Measuring the impact of your reliability work with reports

Improving reliability is important, but how do you prove that your efforts are having an impact? A critical part of reliability work is having the tools to measure and track your progress. Gremlin supports this by providing several built-in reports, which update automatically and are available on-demand. This blog post is a quick introduction to Gremlin’s reporting capabilities.

Which reports does Gremlin provide?

Currently, Gremlin provides three reports:

The Company Summary report gives an overview of your entire Gremlin company’s reliability score and risks. It also lists each of the services in your company along with their reliability score, any reliability risks Gremlin detected, and the number of dependencies marked as a single point of failure (SPoF).
The Team Risk report shows all of the detected risks for a Gremlin team, grouped by service. It also shows the sum of detected risks that haven’t been mitigated over the past 3 months in the form of a line chart.
The Team Score report is similar to the Team Risk report, only it shows each team’s reliability score and per-service test results in place of detected risks.

How can you use these reports to improve reliability?

Find gaps in your reliability testing coverage

The Team Score report is very useful for finding services with tests that have expired or failed, or haven’t been run yet. This report also breaks down each team’s reliability score into their main categories. Both of these give you an instantaneous snapshot of your team’s overall performance and highest-risk areas.

For example, this team scored well on detected risks, but scored very poorly on redundancy and dependency resilience. We’ll want to encourage this team to spend time making their services redundant, setting up automatic failover, and tuning their scaling parameters. They should also take time to make their services less tightly coupled to their dependencies by adding circuit breakers, automatic timeouts, and similar mechanisms:

Screenshot of a team's reliability score of 52%

Find common risks

Detected Risks are high-priority reliability concerns that Gremlin automatically identified in your environment, such as failed containers and missing scaling rules. The Team Risk report shows all of these risks across all services in a team, letting you see the complete list of unresolved risks at a glance.

In this example, we have a team that’s done a great job of cleaning up its risks, but there’s one that consistently reappears: availability zone (AZ) redundancy. This implies that this team isn’t aware of AZ redundancy as a best practice, or maybe this Kubernetes cluster isn’t set up for multi-AZ redundancy. We could confirm this by looking at other teams in the Team Risk report to see if their services all have the same detected risk.

A chart showing changes in the number of risks detected over a 90 day period

This report is also useful for tracking improvements (or regressions) in risks over time. This team had a peak risk count of 14 before dropping it to just one a few days later. We can assume they put in work in late November to reduce their risks. But now that the count is creeping back up, we’ll want to address the issue so the team can reduce their risks again. This also helps set standards, so any services added in the future are less likely to be susceptible to these same risks.

Roll up your work to leadership

Leadership teams don’t need to know the nitty-gritty details of each service in the company. They just need to know: “are we reliable?” That’s where the Company Summary report comes in.

While the Company Summary report does list all services (with the ability to group, filter, and sort as needed), the two graphs at the top are the highlights. These graphs take the average reliability score and detected risk graphs from the other two reports and extrapolate them so they’re company-wide, giving you a complete view of reliability across the entire company.

A company summary report showing a high reliability score, few risks, and highly rated services

The reliability score graph is color-coded for ease: a score of 70% or higher is good, a score between 50 and 70% is fair, and a score below 50% is poor. Each service is also color-coded based on its score. This way, you can quickly get a sense of where your company is on the road to reliability, and how many services will need to be worked on to get you as close to 100% as possible.

Start getting real-time reliability reports today

Engineers put a lot of effort into making their systems more reliable. Whether you’re showing your work to leadership, catching up on new detected risks, or looking for areas needing improvement, Gremlin’s reports have you covered. Over the coming weeks, we’ll be publishing more blog posts that dive into each report in detail. Those will be linked below, so stay tuned!

In the meantime, to get your instant reports, sign up for a free 30-day trial, then head to app.gremlin.com/reports.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL