- 6 min read

How Gremlin's reliability score works

In order to make reliability improvements tangible, there needs to be a way to quantify and track the reliability of systems and services in a meaningful way. This "reliability score" should indicate at a glance how likely a service is to withstand real-world causes of failure without having to wait for an incident to happen first. Gremlin's upcoming feature allows you to do just that.

When you add a service to Gremlin, Gremlin automatically assigns it a reliability score and provides a series of reliability tests for you to run. This score is based on how many reliability tests you've run, and how many have passed or failed. This lets you measure the resilience of the service and how much additional reliability work your team may need to put into it. In addition, because Gremlin tracks your score over time, you can easily see how your reliability has improved over time, and quantify the value of your reliability/resiliency work.

But what exactly does this score represent, and how does Gremlin calculate it? In this blog, we pick apart Gremlin's reliability score so you can get the full picture of how it works.

What is a reliability score?

A reliability score is a calculated value between 0 and 100 that represents how reliable a particular service is. We define a service as a set of functionality provided by one or more systems within an environment, such as an authentication or checkout service, and reliability is how well we can trust that service to withstand turbulent conditions like node failures, exhausted resources, and slow or unavailable dependencies. In other words, the reliability score measures how well we can depend on a service to remain available.

Each service that you define in Gremlin is assigned a reliability score starting at 0. To increase your score, you'll need to run reliability tests such as autoscaling CPU and memory, zone and host redundancy, and dependency failures. Each of these tests a specific behavior of your service while simultaneously monitoring your service's availability via Golden Signals, which are metrics that indicate the health and stability of your service. If running a test causes any of your Golden Signals to become unhealthy, then the test is marked as a failure. Otherwise, it's marked as passed.

How is the reliability score calculated?

The reliability score is calculated in three steps:

  • Each reliability test is given a score between 0 and 100 based on whether your service passed or failed.
  • Tests are grouped into different categories. Each category has a score that's simply the average between the tests in that category.
  • The scores from every category are added up and averaged to provide the service's reliability score.

Reliability test score

Each reliability test can have one of four scores depending on whether the test was successful, failed, not yet run, or if it was successful but is outdated (i.e. hasn't been run within the past month). The value of these scores is shown here:

Test StatusScore
Test passed100
Test passed, but has not been run in the past month75
Test failed50
Test not yet run0

We give 50 points for a failed score because it shows that effort has been taken to check the service's reliability, even though the service failed the test. We also provide a score for tests that passed previously, but haven't been run in the past month. This is to encourage teams to run these tests frequently to ensure services keep passing as new code is deployed, and ideally automate tests on a regular schedule or as part of your CI/CD pipeline. The only way to get a full (i.e. passing) score is to run the reliability test all the way through without any of the service's Golden Signals failing, and to keep running the test on a minimum monthly basis.

Category score

Reliability tests are grouped into categories. These include:

  • Scalability
  • Redundancy
  • Dependency Loss
  • Dependency Latency

Each category represents a known, critical reliability risk and contributes an equal percentage to the total reliability score. Since there are four categories, each one accounts for 25% of the total score. Gremlin automatically removes empty categories from the total score; for example, if a service has no dependencies, then Dependency Loss and Dependency Latency tests won't factor into the score.

Within each category, we take the average of your test scores in order to get the category score. For example, let's say we want to test redundancy. There are two redundancy reliability tests: Zone, and Host. We successfully passed the Host test, but we failed the Zone test.

Redundancy reliability tests

This means that our category score is 75 ((50 + 100) / 2). This isn't a terrible score, but it does show that we still have more work to do in this category before we can feel confident about this service's redundancy.

Service reliability score

The service reliability score is the sum of each category score out of 100. To get the reliability score, we first need to determine the percentage that each category contributes to the total score. We start by dividing 100 by the number of categories. If we have four categories, then each category contributes 25% (100 / 4). Next, for each category, we multiply the maximum contribution by the category score to get its contribution to the reliability score.

For example, we know that our redundancy category score is 75%. We know that each category's contribution is 25, so to get the redundancy category's contribution, we multiply 25 * 0.75, which is 18.75. We then add 18.75 to the reliability score, then repeat this step for each of the other categories.

The end result will look similar to this:

Reliability score as a sum of all category scores

Tracking your reliability score over time

Your reliability score is more than just a point-in-time measure of reliability. Gremlin also tracks your score over time so you can see how the reliability of your service has changed as you continue to test and improve it. This is especially useful for reviewing past test results, determining when you last tested this service, and proving to your manager that you've been putting effort into improving your service's reliability.

What is a good score?

Reliability Scores work a bit like the grades you’d receive in school. You can still get to 50% even if you fail every test. Getting to 100% and keeping it there means you've tested every area and are keeping your scores high by running tests at least monthly. Not every service is equally important to your application's reliability, so you may want to set a goal of 100% on critical services, but set a lower score for less critical services. You might also set a lower score requirements for specific test categories (i.e. a service needs to pass Scalability tests, but not Redundancy).

Using the score in practice

Being able to proactively test and measure reliability before an incident is valuable on its own, but how can we drive even greater business results across the organization?

As a first step, we can use the score to manage teams more effectively by setting reliability standards. We can see how teams are scoring relative to each other, monitor for changes in scores, identify risks, and think about improvements we can make. Teams now have a positive reliability metric that they can use to proactively plan improvements. Contrast this with the retrospective meetings most teams run after incidents.

The next step is to automate. Once we begin passing tests, we can schedule them to make sure our services are continuously validated as systems change. We could also use the Gremlin API to run tests from a CI/CD platform or other tool. This reduces the amount of toil that teams have to put into reliability testing, making them more efficient.

Third, we can use our service reliability score to prevent unreliable code from reaching production by tying into our CI/CD pipeline. You can imagine if we had a condition in our CI/CD pipeline where we wouldn’t deploy new changes unless our reliability score was >90. If bad code causes one or more of our reliability tests to fail in staging, we can automatically prevent the problem from reaching customers and remind our teams to fix reliability issues before deploying.

There’s a lot you can do with reliability scores to operationalize reliability; these are just three of the most common ways we see this being adopted at scale.

Conclusion

Translating something as complex as reliability into a numerical score isn't always clear cut. A score of 90 doesn't mean that your service will only work 90% of the time. Instead, it represents how well your service stands up to the different failure scenarios that you've put it through, and how it compares to other services across your organization. It also shows the amount of effort that you and your team have put into making your services as resilient as possible. Always strive for 100, and if you fall short, at least you know what to fix first!