Introducing Custom Reliability Test Suites, Scoring and Dashboards

Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.

Today, we fulfill the next stage of that promise with the release of Custom Reliability Test Suites, Custom Scoring, and Dashboards.

Now you can customize your reliability test suits and scores to fit your system and your organization's priorities—then give your organization and technology leadership a full and accurate view of your reliability posture with scoring and reporting.

Taken together, this helps cloud architects and centers of excellence leaders define customized software reliability standards and measure progress towards them. It’s easier than ever to meet external compliance requirements, consistently ensure resilience to past outages, or scale organizational best practices. Read on for details, or skip to the end to watch a demo.

Define reliability standards with custom reliability test suites

With increased compliance requirements around reliability, and the rising costs of unreliability, enterprise organizations are looking to define a standard of reliability that’s not based on backward-facing metrics or error budgets.

The suite of reliability tests Gremlin introduced last year has been helping organizations ensure they are compliant with industry best practices and prove resilience to known reliability risks. While this suite of Gremlin-recommended tests covers the most common issues out-of-the-box, we recognize that every organization has unique reliability needs.

Starting today, you can define the tests most meaningful to you.

With Custom Test Suites, you can set the standard for reliability in your organization. Whether that means testing toward compliance requirements such as the OCC or DORA or industry frameworks such as AWS WAF, testing against past painful outages, or tweaking the Gremlin-recommended suite to better fit your needs, Gremlin makes it easy.

A custom Test Suite showing seven reliability tests split across four categories

Each test in your custom suite is essentially a Scenario: a series of fault injections run in sequence while pulling in system health data from your monitoring tools to understand real-world impact. You can test for things like host and zone redundancy, ensure your system scales up and down as resources are saturated, understand behavior when dependencies become slow or disconnected, and much more. By using custom Fault Injection scenarios, you can create tests specific to your systems and reliability standards. You can then promote those Scenarios, paired with a Health Check, into a test suite that can be rolled out across your team or organization.

Measure reliability proactively with custom scoring

Prior to the introduction of Gremlin’s Reliability Score, reliability metrics were really measurements of unreliability, such as downtime, MTTR, and MTTD. As these metrics changed, they begged the question: are we getting better—or just lucky? The Gremlin Reliability Score was developed as a way to measure the current reliability of your system without having to react to incidents. Over time, this can show you the changes in your reliability posture, rather than just the moments when things broke.

Now, when you define a custom test suite, you’ll get a custom reliability score for each service too. This score shows how much testing is being done and how you’re doing against those tests. Previously, this score was generated from the recommended pre-built reliability tests, which cover the majority of the standard reliability issues. Any situations that were unique to your systems had to be tested separately with Fault Injection Scenarios and weren’t included in the score. Because you can integrate custom scenarios into standardized scores now, you can get a fast, complete picture of your reliability posture in one place.

A screenshot of a service in Gremlin. It has a reliability score of 84% across four categories.

You can measure and compare that reliability posture across services, teams, and your organization to see which teams are testing and which teams aren’t, identify weak spots, and reward teams for making real improvements. You can also use the score as a gate in CI/CD to allow reliable teams to have fewer checks, or have them bypass manual resilience testing events because testing is automated in the background.

Identify reliability risks organization-wide with executive dashboards

Of course, with better ways to measure reliability comes better visibility into that reliability across the organization with new executive dashboards. Reliability is one of those problems that spans the organization, with each service owner or team ultimately responsible for the reliability of their own service. With this many owners and stakeholders, it can be hard to keep track of the reliability of different services.

The new executive dashboards change that by putting all the information you need in a single pane of glass. You can view scores, test results, and detected risks for every service and team, or even roll scores and risks up into company-wide reports to show how your organization is mitigating reliability risks over time.

You can filter and sort by team, score, and recent changes. Coming soon, you’ll also have the ability to tag services however you wish, and use these tags to group services and build custom reports.

A screenshot of the Team Score report in Gremlin showing a chart of the team's reliability score over time, and a matrix of services with the results of each reliability test

The result: proactively drive reliability and meet IT governance requirements

These standards are built around metrics that show you your current reliability, giving you greater confidence in your actions and helping you avoid outage costs like unplanned downtime, reputational risks, lost revenue, and stalled feature development.

There are big changes happening for companies investing in reliability before incident response. With Gremlin, you can now validate infrastructure and releases, shift from a break/fix model and backward-facing metrics that lead to firefighting, and get ahead of incidents and outages before they impact customers, all while tailoring Gremlin to your organization’s specific needs.

To get started,