If your organization asked you to report on the reliability improvements you’ve made over the past 90 days, would you be able to pull up a report? 

If you’re like many engineers, this question might make you anxious. Reliability is a difficult metric to quantify in a meaningful way, let alone measure. Observability and incident response tools often use metrics like mean time to detection (MTTD) and mean time to resolution (MTTR), but these don’t tell us the likelihood of a failure, only how long it takes to fix failures. They’re backwards-looking metrics, when what we really need is a forward-looking metric.

Just as organizations conduct pen-testing to understand their security posture and test financial controls to understand their financial risks, organizations must test the resiliency of their services to understand their reliability posture.

In this blog, we’ll look at different methods of quantifying reliability, how you can use this information to make your services more resilient, and show your organization the progress being made.

How do you measure reliability?

Measuring reliability starts with a service. A service is any software component that an engineering team is responsible for running and managing. Node.js applications, databases, and third-party SaaS applications are common examples. Organizations that have adopted DevOps best practices put availability targets on their services. These might be as simple as “keep the service up and running as long as possible,” or they might be comprehensive Service Level Objectives (SLOs) made up of multiple metrics. In any case, these availability targets should be the focus of your reliability work.

The next step is to identify the reliability risks that could prevent the service from meeting and exceeding its availability targets. A reliability risk is anything that jeopardizes a service’s availability, such as a host failure, data center outage, network failure, etc. We can test our services using tools like Gremlin to determine whether our service is susceptible to each risk, and this becomes our metric for measuring reliability.

If we compare the number of reliability risks present in a service against all relevant reliability risks, we get a percentage. This percentage is the service’s reliability score, and the metric we can use to quantify reliability.

(Number of risks present in a service / Total number of relevant risks) * 100 = Reliability score

Finding the reliability risks in a service

There are countless things that can go wrong when deploying to production. Worse, these vary depending on which cloud platforms you deploy to, what your architecture looks like, which dependencies you use, and even how your code is written.

Gremlin manages this in two ways: first, it auto-detects common misconfigurations that create reliability risks; and second, it runs a suite of reliability tests based on industry best practices to see how your systems actually behave. These tests cover multiple facts of reliability, such as scalability, redundancy, and resilience against slow and failed dependencies. While running these tests, Gremlin monitors the state of your service: if your service remains available and no monitors or alerts fire, then this means your service is resilient against that type of failure and it passes the test. If the service goes offline or takes too long to respond, it fails the test. The results of each test contribute to the service’s overall reliability score.

Gremlin screenshot showing a service with a reliability score of 70%, one detected risk, and six dependencies
Want to learn more about how Gremlin’s reliability score works? Check out our blog.

Tracking reliability over time

Your reliability score is more than just a point-in-time measure of your services’ reliability posture. Gremlin also tracks your score over time so you can see how the reliability posture of your service has changed as you continue to test and improve it. This is especially useful for reviewing past test results, determining when you last tested this service, and proving to your manager that you've been putting effort into improving your service's reliability.

As an example, the Company Summary report has two graphs showing the overall reliability of our Gremlin organization over the past 90 days. We can see three clear points where our reliability dropped noticeably: first in late December, again in mid-January, and again in mid-February. We also saw an increase in the number of reliability risks Gremlin detected in late January,‌ coinciding with the drop in the reliability score:

Screenshot of a report in Gremlin

In addition, we can see the current reliability score for each service in our organization, as well as a 30-day historical chart. We can sort this list by score to determine which services need additional testing or reliability work to bring them up to an acceptable score. We can also filter the list using tags: For instance, Gremlin has a built-in tag to flag services running in production, so if we want to see only the average score, risks, and per-service scores for production services, we can just select that tag. Gremlin automatically detects cloud provider tags like availability zone, region, and cluster ID, but you can also define your own custom tags for even greater filtering.

Filtering the services shown in the Company Summary report by tag

With historical and point-in-time reliability metrics, we now have a comprehensive record of our reliability efforts that we can use to shape future development. The services in the above screenshot all have excellent scores of 94–95%, have zero risks, and have zero single points of failure (SPoFs). We can work to increase those scores until we achieve 100%, or we can shift our focus to less reliable services and work on increasing their scores. In doing so, we raise our overall team and company scores and can anticipate fewer critical failures across our entire deployment.

Here, we have several services that haven’t had much (if any) reliability work done. Some of these, like the support-portal, are critical infrastructure and should be tested as soon as possible. Others, like the adservice, are less critical, but may still impact business operations and revenue if they fail.

Screenshot of four services with reliability scores ranging from 74% to 44%

How to use reliability metrics and scores in your day-to-day work

A reliability score doesn’t necessarily equate to uptime. For example, a service with a score of 99% might have three-nines of availability (99.9%), or it might only have 80% availability. What the score represents is how well your service holds up against a specific set of risks, whether it’s the default test suite that Gremlin provides or a custom test suite your team has created. When your service faces these same risks in production—for instance, if one of its hosts fails, or a dependency goes offline—you already know how your service will respond based on whether it passed or failed the corresponding test.

Another way to look at this is: the reliability score represents how resilient your service is against a set of reliability risks. Ideally, this set of risks is one that your company determined to be the most important based on compliance requirements, internal best practices, and/or industry best practices. If you don’t have or need a custom test suite, Gremlin’s recommended test suite will get you most of the way there, since it’s based on industry standards like the AWS Well-Architected Framework and known causes of incidents and outages.

Get your score even higher for better reliability

Of course, reliability isn’t a one-and-done task. Systems change, bugs get introduced, and regressions emerge in once-stable systems. With Gremlin, you can automate your test suites so they run weekly. This ensures your reliability score is kept up-to-date, and you and your organization can feel confident knowing that your systems have a strong reliability posture.

To get your free reliability report, log into your Gremlin account or sign up for a free trial. You can also learn more about Gremlin’s built-in reporting tools by reading our blog: Measuring the impact of your reliability work with reports.

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL