Don’t just react to incidents

Incident response has been the cornerstone of reliability for decades. From digging in the server logs to navigating modern observability dashboards, responding quickly to incidents and outages is a big part of minimizing downtime. And it should be! When something breaks, your team should move as quickly as possible to address and repair the problem.

But building a reliability practice around observing and reacting alone isn’t enough anymore. Today’s architectures are complex and ephemeral. Every feature or application requires a string of connected services to keep it going, each with their own host of potential issues just waiting to bring it down. Layer in automation, such as spinning down a Kubernetes node and replacing it, and the amount of variables that could lead to an outage increases exponentially. And that’s before you account for teams constantly shipping new code.

It’s not enough to build your reliability practice around incident response. If you want to be able to give your customers the availability they need, then you need to proactively build resiliency and prevent outages before they happen with reliability testing.

You have to stop fighting fires all the time

When an outage fire flares up, you have to fight it—and fast. But what about when there isn’t an outage? If your reliability strategy is built entirely around improving your reaction time, then your team will always have the pressure of another outage hanging over their head. Between the emotional and cognitive costs of stress, you’ll end up with a team that’s always on edge. Something could break at any time, they just don’t know what or when.

And to make matters worse, it’s incredibly difficult to show positive progress over time when your team is only resolving incidents. And if you’ve been spending precious engineering resources scrambling to fix problems instead of releasing new features, it can lead to more than your fair share of awkward review conversations.

This approach is unsustainable. It leaves your team in a perpetual cycle of incident response and firefighting that only gets worse. The engineers on your team will feel burned out and underappreciated. And you’ll have a harder time proving your team is actively creating value.

Get ahead of incidents and outages

What if there was a different, proactive way to improve reliability? One where you could walk into reviews saying, “My team increased the reliability of critical services by 50%,” and back it up with data? When you build a proactive reliability strategy, your goal is to document the reliability status of your systems, spot reliability risks, and stop damaging outages before they happen.

And by prioritizing fixes and integrating them into your sprints, you can improve reliability on your schedule. Not only will your teams be able to use their resources more effectively, but you’ll also be able to demonstrate that you’ve created a positive impact on reliability and availability. It might sound like a huge lift, but with the right model and tools, a proactive reliability practice can be easily integrated into your current processes and shift your efforts from reaction to prevention.

Since 2016, Gremlin has worked with Reliability Engineering teams at hundreds of companies in virtually every environment. We’ve worked with and helped companies define some of the most mature reliability and Chaos Engineering organizations in the world. And our team of reliability experts have come up with a simple, spreadsheet-based way to track the reliability of your systems and align your teams.

Align your teams to improve reliability

As with any cross-team endeavor, the key is alignment around a common source of truth. Right now, most engineers probably have some idea about the various ways their applications or services are going to fail. For example, they might know that their application sometimes has issues with zonal failover, so if something goes wrong, that’s the first thing they check. It’s an approach that’s workable enough, so we’ve all just rolled with it.

But it also means that knowledge and understanding of reliability risks is siloed in the head of individual engineers. To get everyone on the same page, you’ll need to create a way to bring it together into one place—in this case, a Reliability Tracker spreadsheet. Once you have that, you can start testing your systems to create a reliability map.

Sketched spreadsheet grid showing the layout of what could break, what could break it, and if it breaks.

To build the spreadsheet, you’ll need answers to these three questions:

What could break?
What could break it?
Does it break?

We’ve built this concept out to create a Reliability Tracker template for you to download and use. When you fill the template out, it should look something like this:

Screenshot of the Reliability Tracker spreadsheet template.

On the left, you’ll see a list of services, or what could break. (For the sake of this spreadsheet, we’re defining a service as a specific functionality provided by one or more systems within an environment.) On the top, you’ll see common ways that it could fail, known as failure modes. In the template, we’ve included some of the most common ways services can fail, but engineers will know the common failure points for their services, and this is where you get those out of their heads and on the page.

In the middle, you’ll see the results of testing the failure modes using Chaos Engineering tests, with a green “OK” for passed tests, a yellow “?” for tests not performed, and a red “X” for a failed test. In the example above, the Support service responds as expected when there’s a sudden increase in the CPU load. But the control API service? Not so much.

This layout also includes a space for you to tier your services, with Tier 1 being mission-critical services and Tier 3 being less important services like internal tools. (You can adjust this to fit your tiering system.)

On the right is a coverage score for each service showing the percent of successful tests passed—or the amount of reliability risks that have been resolved. (Untested is scored the same as a failed test, because an unknown risk is still a risk.)

A spreadsheet like this gives you a single place to align your reliability efforts. Instead of guessing about which issues to address, you’ll be able to have data-driven conversations with stakeholders about which risks to prioritize based on the potential impact of an outage.

And just as importantly, it also gives you a way to show improvement over time. When you first fill out this spreadsheet, you get a baseline of your current state. Make a copy of it, and a month later (or two weeks or whatever your testing cadence is), you’ll have a new state of reliability. If you’re working to resolve reliability risks, then you’ll develop a steady string of reliability metrics that can be used to prove the effectiveness of your efforts, get buy-in from engineering teams, and show the positive value you’re creating for your organization.

Next steps: Learn how to find and fix reliability risks

Ready to start building a Reliability Tracker of your own? Download the Navigating the Reliability Minefield whitepaper and Reliability Tracker Template. One of the creators of this approach, Sam Rossoff, Principal Engineer at Gremlin, also sat down for the Navigating the Reliability Minefield webinar. In this webinar, he goes over the approach to building the spreadsheet—and how it can be used to get buy-in with teams across your organization. You can also watch a quick, 3-minute overview video of the spreadsheet on our YouTube channel.

It’s time to stop spending all your time and resources responding to incidents. With this simple alignment spreadsheet, you can start fixing reliability risks before they become outages.

No items found.

Don’t just react to incidents—prevent them

You have to stop fighting fires all the time

Get ahead of incidents and outages

Align your teams to improve reliability

Next steps: Learn how to find and fix reliability risks

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Treat reliability risks like security vulnerabilities by scanning and testing for them