Uncovering hidden reliability risks in complex systems

Did you know that more than half of all services deployed by organizations are only deployed to a single availability zone (AZ)? This means that if you’re one of the many organizations hosting critical services in the cloud, and your primary AZ fails, your entire application may fail with it. Fortunately, once engineers are made aware of this risk, 28% make their services multi-AZ redundant within 9 days on average.

These numbers are based on our internal product metrics, and they tell an interesting story. What other risks are lurking in these systems, and how can we surface those risks to engineers so they can address them?

In this blog post, we’ll answer this question by looking at one of Gremlin’s most popular reports: the Team Risk Report. We’ll show you how it works and how it helps teams uncover and fix latent reliability risks.

‍

What is a reliability risk?

A reliability risk is anything that poses a threat to your environment’s stability—potential points of failure in your system where an outage could occur. These include misconfigurations, bad default values, and reliability anti-patterns that are often overlooked or introduced unintentionally. If you can find and remediate reliability risks, then you can prevent incidents before they happen.

Gremlin automatically monitors systems for many of these risks with our Detected Risks capability. Gremlin can detect many of the most relevant and critical reliability risks, with an emphasis on those most likely to lead to outages or downtime in Kubernetes environments. These include:

Missing CPU Requests	Missing Liveness probes	No Availability Zone redundancy	Missing memory requests
Missing memory limits	Application version non-uniformity	Pods in a CrashLoopBackOff state	Pods in an ImagePullBackOff state
Pods with init container errors	Unschedulable pods

‍

How do I find my reliability risks?

The “traditional” method of finding reliability risks is when they happen. An incident occurs, and engineers respond to the incident by deploying a fix and documenting what happened so the risk can be prevented or avoided in the future. Some teams take a more proactive approach by researching common failure modes for their systems using documentation, deployment guides, and well-architected frameworks.

But what if there was a way to combine and automate these two approaches? What if you had access to a library of risks that were relevant to your services, and could quickly determine whether your services were susceptible to those risks at a glance? That’s what the Team Risk report is for.

The Team Risk report lists all of the services in your Gremlin team, all of the detected risks available in Gremlin, and whether the service is susceptible to the risk. It also charts the total number of detected risks for your team over a 90-day period. This gives you a view into not only your outstanding detected risks, but also how many risks you’ve addressed in the past three months.

For example, we have a team that’s done a great job of cleaning up its risks, but there’s one that consistently reappears: availability zone (AZ) redundancy. This implies that this team isn’t aware of AZ redundancy as a best practice, or maybe this Kubernetes cluster isn’t set up for multi-AZ redundancy. In any case, this team should revisit each of their services and ensure they’re configured for redundancy.

The 90-day trend chart also makes it possible to track improvements (or regressions) in risks over time. This team had a peak count of fourteen risks before dropping it down to just one a few days later. This is a huge drop and indicates that the team put in a lot of work to improve the reliability of their services. But now that the count is creeping back up, we’ll want to address the issue so the team’s engineers can implement fixes or check for other risks.

‍

See your team’s reliability risks in minutes

To get your instant team risk report, current users can log into the Gremlin web app and select “Team Risk” from the Reports menu, or visit this link. You’ll see all of the services belonging to your team, and whether they have outstanding detected risks. Gremlin’s reports update daily, and can be exported to PDF to attach to your ticketing system, document your reliability work, or show off during standup.

If you’re not using Gremlin yet, you can sign up for a free trial and view your reliability risks in as little as a few minutes.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL