Reliability Resolutions: How to build effective reliability programs that won’t fade away

Did you know the third week of January is the most common time for people to fail New Year’s Resolutions? It doesn’t matter whether it’s exercising more, learning a new language, or just trying to drink less coffee, that initial surge of fresh New Year’s energy is fading, and if you want to make a resolution stick, this is the key time to make a lasting change.

The same is true with any reliability resolutions you might have made.

Over the years, we’ve seen countless engineers and companies enthusiastically throw themselves into new reliability or Chaos Engineering efforts and get great results, only for the effort fail to gain the traction it needs to have true lasting impact across the company.

What’s the difference between programs that succeed and the ones that fade?

It all comes down to asking the right questions, defining ownership, and building repeatable processes that create a sustainable reliability practice.

It’s the difference between saying you’ll go to the gym or practice Spanish and actually making the time in your schedule that holds you accountable.

If you want to build an effective, long-lasting reliability program in your company, then make sure you start by asking these key questions!

What company goals does this program align with?

Every company needs to keep their applications and systems reliable. So why can it still be hard to get traction with reliability efforts? It comes down to a company’s high-level goals. When you tie reliability directly to company-impacting goals you have alignment that smooths out speed bumps and frees up resources.

A good way to think about this is to look at which teams are currently lacking the data they need to make informed reliability decisions. Not only does this help you tie into their goals, it also helps focus your efforts where they’ll have the most meaningful impact.

For example, an availability effort that has a goal of five 9s uptime will usually require redundancy and failover. Do you have the data to verify system failover correctly when a failure occurs? Or that the redundant systems have enough resources to withstand a flood of new traffic? If not, then you have a reliability blind spot that could prevent you from achieving those goals.

In this case, you can start by testing those specific failure modes on critical systems. If the systems pass the tests, then you’ve created data proving progress towards that goal. And if not, then teams have the data they need to efficiently address issues before they lower availability.

Look for other major efforts, like digital transformations, migrations, compliance, or disaster recovery. The important part is to let the goals of the business shape your efforts and reliability priorities.

Do we have a mandate from leadership?

We’ve all had projects that keep getting deprioritized until they just quietly fade away. We’ve also all been on projects with massive adoption and effectiveness. The difference between success and deprioritization usually comes down to whether or not there was a mandate from leadership.

Gremlin’s founder and CEO Kolton Andrus has talked about Field of Dreams DevOps, or the idea that if you just build a good internal platform, then all your company engineers will show up and use it. Unfortunately, this approach will fail if there’s no reason for engineers to spend the time adopting a new process or platform.

But adoption soars when leadership makes it a priority—especially when reliability efforts tie directly into a major company goal.

If you don’t already have this mandate, then use the first question to find the right technology leader with compatible goals, then show them how the data and fixes your reliability program will create will help achieve that goal. And even if you already have a mandate, it’s still worthwhile to have the conversation to make sure you’re aligned.

Just make sure you’ve thought about this next question, because they will undoubtedly ask you.

How are we measuring results?

For many SREs, this almost becomes a philosophical question as they try to prove results by showing the impact of an outage that never happened. Some companies may have a mature enough post-mortem and tracking system that this is possible, but for most this quickly becomes a subjective exercise.

You can get ahead of this by aligning around the right metrics. Start with the goals in the first question, then look at any related SLOs, key company metrics, or performance thresholds. Build your testing around those metrics, then use the results of tests to create a trackable, ongoing measure of progress towards your reliability goals.

As an example, a retailer might have a specific minimal level of allowed latency to maintain performance during the checkout process, a metric that ties directly to a company goal of preventing customer loss while purchasing.

You can use that latency level as a pass/fail criteria for reliability tests (at Gremlin, we call this a Health Check). Build your tests to introduce failures like unavailable resources, increased network latency, or traffic surges, then track whether the services could still maintain the minimum allowed latency.

The first time you run all of these tests, you’ll probably find that a good number of your services fail to maintain performance. But that’s the whole point. Now you have data that shows where resources should be spent to achieve that goal. Keep testing regularly, and you should see the number of services passing the test improve steadily, showing your teams and leadership the exact metrics they need to track progress towards the goal.

And speaking of testing regularly…

What recurring processes are we using?

No one gets stronger by going to the gym once every two quarters. Just like it takes regular workouts to get in shape, it takes regular testing and issue remediation to improve the performance and reliability of systems.

That doesn’t mean you need to add a whole bunch of new processes and meetings. In fact, it’s often more effective (and less irritating to engineers) if you integrate testing, result analysis, and work planning into existing processes.

At Gremlin, we implemented these actions into our on-call process, and it’s a key part of how we maintain five 9s availability. We have a suite of reliability tests scheduled to automatically run on our production system at regular intervals. As part of the on-call handoff, the engineers go over the results and flag any failures or detected risks. Then it’s the on-call engineer’s job to get those risks to the right person so they can integrate them into their workflow.

And this isn’t limited to companies our size. We’ve seen similar effective approaches at Fortune 100 global organizations. The key is to have clear processes, handoffs, and ownership of both testing and fixes, so when a risk is uncovered, it gets to the right person to address it before it causes an outage.

You don’t need resolutions to start testing

The truth is, you don’t need a New Year’s Resolution to get in shape. You can start a new exercise program any time of the year. The important part is that you keep going.

The same is true with reliability programs. You just need to get some answers to these questions, then start testing, building reliability data, and fixing risks. Because right now, you’ve got reliability blind spots, and the sooner you have actionable data, the sooner you can take effective action.

Your answers to the above questions will probably change over time, and they should as your organization matures. But the important part is to keep going.

Ready to make sure you’re set up for reliability success? Check out the How to Build a Best-in-Class Reliability Program checklist.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour