Building momentum for a reliability program can be tough. Improving reliability takes time, effort, and resources. But when everything from launching new features to improving security demands those same resources, it can be a struggle to get the buy-in you need to address reliability risks.
And it makes sense! If a team spends time patching a known security bug or creating a new feature, they have a clear demonstration of the value created. But if you resolved a reliability risk before it caused an outage, how can you show that you prevented something that didn’t happen?
To get that buy-in, you need to do three things:
- Align around a common set of metrics for the current state of system reliability.
- Use those metrics to find and fix reliability risks.
- Create a regular cadence of review and reporting to show the results of your efforts.
In this blog, we’re going to go over how to do these using a Reliabilty Tracker in the form of a spreadsheet. We’ve created a free template you can download, along with a whitepaper and webinar that go more in-depth about using it to navigate the reliability minefield.
Whether you use our template, create your own, or use the Reliability Managment dashboards that are part of Gremlin, the important part is that you have a centralized source of truth for your systems’ reliability and use it regularly to improve reliability and prove your results.
Ask any engineer what could break about the application or service they run and they’ll probably be able to rattle off a list of the most common potential failures. Unfortunately, when all of that knowledge is siloed in the heads of engineers, it’s impossible to align an organization around it.
A Reliability Tracker is designed to give you a map of the reliability of your services. (For the sake of this spreadsheet, we’re defining a service as a specific functionality provided by one or more systems within an environment.) It should look something like this:
To build the spreadsheet, list your services down the left side. Across the top, put the various ways the services could fail, also known as failure modes. In the template, we’ve included some of the most common failure modes, but you should customize them to fit your systems. In the middle, fill out the results from testing the services against the failure modes using Chaos Engineering tests. In our template, we use a green “OK” for passed tests, a yellow “?” for tests not performed, and a red “X” for a failed test.
On the right is a coverage score for each service showing the percent of successful tests passed—or the amount of reliability risks that have been resolved. (Untested is scored the same as a failed test, because an unknown risk is still a risk.)
The goal is to show you, at a glance, where you know your services will perform as expected, where you know they’ll fail, and where there’s still an uncertain risk that needs to be tested.
Most organizations track reliability by using backward-facing metrics like downtime, MTTR, or the number of incidents. All of these are important metrics, but they’re also all lagging indicators. They show that the system was unreliable in the past, but they don’t show how reliable the system is right now.
In the Navigating the Reliability Minefield webinar, Sam Rossoff, principal software engineer at Gremlin, goes into more detail about the importance of lagging vs. leading indicators.
When you use a Reliabilty Tracker, you create leading indicators for your reliability and availability. Instead of just looking at when it failed in the past, you’re able to find risks where it could fail in the future. On the Reliability Tracker template, the key metric you should look at is the coverage score. Every time you resolve a risk and pass a new test, your coverage score increases, which means your system is that much less likely to fail.
By using these leading indicators, you shift from a reactive, scrambling approach to reliability to a more proactive approach where you’re finding and fixing availability risks before they cause incidents and impact users.
Once you’ve completed your first round of testing and created a baseline on the Relibiability Tracker, you need to figure out how to prioritize your work and tackle each risk one at a time.
Start by tiering the importance or potential impact of the risks. For example, a risk that causes a brown-out on an internal service is probably less destructive than one that would cause on outage on external mission-critical services.
This is where the Reliabilty Tracker is an essential organizational tool. By aligning around common metrics, you can spark conversations and drive action to actively address reliability risks.
At first, you may encounter some resistance or hesitance to spend engineering resources on testing and resolution. After all, everyone is busy and has more than enough on their plate. This can make it hard justifying spending resources to address potential risks that aren’t active incidents.
This is where having a centralized Reliability Tracker is particularly useful. If a service owner says they won’t fix something, then you don’t have to worry about digging your heels in. Just mark that you showed them the risk, but they decided not to fix it yet. This also provides documentation so you know which services need additional focus during the next testing cycle.
Remember, the point of the Reliability Tracker is to align people and kickstart conversations. Maybe they’re really busy this time, but next time you test and it’s still unresolved, they’ll have time to fix it. Or it may take a director spotting the failed test in a review and directing resources towards it. Either way, the risk has been identified and is known, and now an informed decision can be made about how to address it.
Also, as you start resolving the risks and increasing the coverage metric, you’ll be able to demonstrate the results, which will make buy-in conversations easier.
The Reliabilty Tracker isn’t a one-time exercise that you file away in the archives. Use it as a living tool that you revisit regularly. In our template, we recommend copying the tests to a new sheet every testing cycle, thus creating a metric of reliability coverage over time. Now you have reliability trends that can be used to make informed decisions about resource allocation and prioritization.
It also means you have a method of proving the effectiveness of your reliability improvement efforts. By first showing a baseline of reliability, then improvements to reliability, you can demonstrate to leadership that your work has created value. And if they dig deeper, you have clear, demonstrable impacts that you can show them by tying specific tests to failure modes on specific services—and that you drove the action to fix the risks and improve availability.
At Gremlin, we use our Reliability Management platform to automatically run standard tests every week, then have a battery of manual tests we run on a schedule. This can be daunting when first starting up, but you should set up a regular time to test and revisit the Reliabilty Tracker, such as every other week or monthly.
Once you have the results of the tests, you should go over the results with engineering teams to prioritize efforts and figure out which reliability risks will be addressed. The more you do this testing and review cadence, the more improvements you’ll be able to prove and the more momentum you’ll create for your reliability program.
There’s a saying in business that if you can’t measure it, then it didn’t happen. It’s a pain many engineers feel when they fix something, but can’t measure the improvement. When you build and align around a Reliability Tracker, you make it possible to measure reliability. And that means when you make reliability improvements, you can prove that they happened.
Ready to build your own? Go more in depth about using a Reliability Tracker with the Navigating the Reliability Minefield whitepaper, get more insights from the webinar that goes with it, or check out a short 3-minute video on how to use the reliability tracker on YouTube.
You can also start a free trial for Gremlin and see how the Reliabilty Tracker concept has been incorporated into automated testing and dashboards.