Measuring the impact of your reliability work with reports
Improving reliability is important, but how do you prove that your efforts are having an impact? A critical part of reliability work is having the tools to measure and track your progress. Gremlin supports this by providing several built-in reports, which update automatically and are available on-demand. This blog post is a quick introduction to Gremlin’s reporting capabilities.
Which reports does Gremlin provide?
Currently, Gremlin provides three reports:
- The Company Summary report gives an overview of your entire Gremlin company’s reliability score and risks. It also lists each of the services in your company along with their reliability score, any reliability risks Gremlin detected, and the number of dependencies marked as a single point of failure (SPoF).
- The Team Risk report shows all of the detected risks for a Gremlin team, grouped by service. It also shows the sum of detected risks that haven’t been mitigated over the past 3 months in the form of a line chart.
- The Team Score report is similar to the Team Risk report, only it shows each team’s reliability score and per-service test results in place of detected risks.
How can you use these reports to improve reliability?
Find gaps in your reliability testing coverage
The Team Score report is very useful for finding services with tests that have expired or failed, or haven’t been run yet. This report also breaks down each team’s reliability score into their main categories. Both of these give you an instantaneous snapshot of your team’s overall performance and highest-risk areas.
For example, this team scored well on detected risks, but scored very poorly on redundancy and dependency resilience. We’ll want to encourage this team to spend time making their services redundant, setting up automatic failover, and tuning their scaling parameters. They should also take time to make their services less tightly coupled to their dependencies by adding circuit breakers, automatic timeouts, and similar mechanisms:
Find common risks
Detected Risks are high-priority reliability concerns that Gremlin automatically identified in your environment, such as failed containers and missing scaling rules. The Team Risk report shows all of these risks across all services in a team, letting you see the complete list of unresolved risks at a glance.
In this example, we have a team that’s done a great job of cleaning up its risks, but there’s one that consistently reappears: availability zone (AZ) redundancy. This implies that this team isn’t aware of AZ redundancy as a best practice, or maybe this Kubernetes cluster isn’t set up for multi-AZ redundancy. We could confirm this by looking at other teams in the Team Risk report to see if their services all have the same detected risk.
This report is also useful for tracking improvements (or regressions) in risks over time. This team had a peak risk count of 14 before dropping it to just one a few days later. We can assume they put in work in late November to reduce their risks. But now that the count is creeping back up, we’ll want to address the issue so the team can reduce their risks again. This also helps set standards, so any services added in the future are less likely to be susceptible to these same risks.
Roll up your work to leadership
Leadership teams don’t need to know the nitty-gritty details of each service in the company. They just need to know: “are we reliable?” That’s where the Company Summary report comes in.
While the Company Summary report does list all services (with the ability to group, filter, and sort as needed), the two graphs at the top are the highlights. These graphs take the average reliability score and detected risk graphs from the other two reports and extrapolate them so they’re company-wide, giving you a complete view of reliability across the entire company.
The reliability score graph is color-coded for ease: a score of 70% or higher is good, a score between 50 and 70% is fair, and a score below 50% is poor. Each service is also color-coded based on its score. This way, you can quickly get a sense of where your company is on the road to reliability, and how many services will need to be worked on to get you as close to 100% as possible.
Start getting real-time reliability reports today
Engineers put a lot of effort into making their systems more reliable. Whether you’re showing your work to leadership, catching up on new detected risks, or looking for areas needing improvement, Gremlin’s reports have you covered. Over the coming weeks, we’ll be publishing more blog posts that dive into each report in detail. Those will be linked below, so stay tuned!
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL