Managing and improving reliability using Gremlin's Reliability Dashboard

Part of a successful reliability program is being able to monitor and review your progress toward improving reliability. Being able to run tests on services is a big part of it, but how can you tell you're making progress if you can only see your latest test results? There should be a way to track improvements or regressions in your reliability testing practice across your organization in a way that's easy to digest. That's where the Reliability Dashboard comes in.

What is a Reliability Dashboard?

The Reliability Dashboard is a single pane of glass view of all the services in your environment, along with their reliability scores. It combines point-in-time data with historical trends to show how each service's reliability compares and how they've changed over time. These scores are pulled from each service's reliability score, which is calculated by running regular reliability tests via Gremlin.

A view of the Reliability Dashboard for a Company in Gremlin.

The Reliability Dashboard lists all services belonging to your Gremlin organization. By default, it sorts them in descending order by their reliability score. Next to the score is an image showing how the score compares to the previous week's score. For example, the Transaction History Service has a score of 92, but because last week's score was higher, this indicates a decrease.

Gremlin also spotlights the three services with the largest score changes over the past week. This serves two purposes:

It highlights teams that have made significant reliability improvements to their services.
It flags services that have had substantial drops in reliability so teams can identify at-risk services.

The Dashboard also highlights the services with the highest scores. This is to showcase service owners who have invested time and attention into reliability and to set a gold standard for other teams. It also creates gamification—service owners can compete to get the highest scores, and team leads can even provide rewards for teams that get consistently high scores.

What can you do with a Reliability Dashboard?

The main benefit of the Reliability Dashboard is seeing all of your organization's services in relation to one another. While the services page shows a list of services with their scores broken down by category, it's limited to the current Gremlin team. That's fine for teams working on their own services, but without cross-team visibility, it's much harder for team leads to see how the organization as a whole is doing. Below are some of the many benefits that a dashboard provides.

You can see the percentage of services that meet your reliability targets

Teams can implement a minimum reliability target that all services must meet. For example, a service might need a score of 80 or higher before it can be considered production-ready. With the Reliability Dashboard, you can easily sort services by score and see all services that fall below the target. You can also see how scores are trending, which tells you at a glance what progress (if any) the team has made toward the standard.

Determine whether to set or stop initiatives based on team performance

The Dashboard also provides a view of team performance across the organization. This can alert you to macro-level patterns. For example, if you notice a lot of services have had their scores drop simultaneously, this might indicate a larger trend within your organization. Maybe developers aren't focusing enough on improving reliability, or a recent environment change is causing reliability tests to fail across the board. Regardless, a Dashboard makes it much easier to notice and analyze these kinds of issues.

You can stay on top of potential risks

Lastly, because the Dashboard lists all services, you can easily sort by Services with the lowest scores to see which ones are at risk. Here, we've identified three services with very low scores:

This view also highlights the fact that two of these services haven't had any reliability tests run in over a month. Considering how much these services have likely been modified since then, this means that we don't even have an up-to-date measure of their reliability. We'll want to run our reliability tests on these services as soon as possible and, ideally, auto-schedule them weekly. For all we know, these services could be perfectly reliable, but because we haven't tested them, we have no idea if that's true. Teams can use gamification here, too: if these teams can quickly get their scores above target, they should be recognized for doing so.

Conclusion

A high-level view of service reliability makes it much easier to implement an organizational reliability strategy. It lets you:

See the relative reliability posture of individual services and teams
Monitor team performance and look for macro trends, such as a widespread drop in reliability
Identify services that don't meet reliability standards

Of course, this isn't an exhaustive list. Your team may have use cases that aren't included in this list. In any case, if you want to see the value that a Reliability Dashboard can provide you and your teams, get a free demo by going to gremlin.com/demo.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL