Measuring the benefits of Chaos Engineering

We often talk about the tangible benefits that Chaos Engineering provides for teams managing complex systems, such as reduced downtime, increased availability, and decreased mean time to resolution (MTTR). And while these benefits are easy to understand implicity, measuring them is much harder. After all, the entire goal of Chaos Engineering is to prevent impactful events from happening in the future. How do you quantify and demonstrate the value of something that won't happen?

In this article, we'll provide guidelines that you can use to measure the success of your Chaos Engineering efforts and demonstrate value to the rest of the business.

The benefits of Chaos Engineering

Before we can talk about measuring the benefits of Chaos Engineering, let's discuss what those benefits are.

A key benefit is cost savings. The primary goal of Chaos Engineering is to prevent incidents from causing problems in production and creating downtime. Downtime costs companies an average of $9,000 per minute, with complete outages costing twice as much as partial outages. Chaos Engineering reduces downtime by preventing outages before they happen and avoiding these costs altogether. And while there is an upfront cost to implementing Chaos Engineering, a survey by Forrester Consulting found that it had a 245% return on investment (ROI).

However, the only way to demonstrate cost savings is by having a way to track costs related to Chaos Engineering and improved reliability. If you can demonstrate that Chaos Engineering saved costs in the long-term by reducing downtime and incidents, management will be more likely to continue investing in it. It's important to keep this in mind as we explore ways to measure and prove the benefits further in this article.

The challenge in measuring Chaos Engineering's benefits

Now that we've covered the key benefits, why is it so hard to measure them?

For one, Chaos Engineering is a proactive, preventative process. The goal is to prevent failures and outages from happening in the future, when you need your systems to be at their most reliable. With that in mind, how do you measure something that you're trying to prevent?

Measuring the cost of an outage is hard enough on its own, let alone measuring the cost of an outage that never happened. Instead, look at similar outages that your team or other teams experienced and use their cost as your own basis for calculation. But even this number can be misleading, as each team is different, and even individual services contribute different amounts of value to the business.

Also, outage prevention isn't the only value Chaos Engineering provides. Chaos Engineering helps engineers find and fix bugs earlier in the development process. This provides significant savings on engineering labor and opportunity costs, as bugs are up to 30x more expensive to fix in production than in earlier stages of the software development lifecycle. We need to factor these costs into the ROI as well.

Measuring Chaos Engineering's impact

Now that we've introduced the challenges, let's look at ways to calculate the cost of downtime and the value of a Chaos Engineering initiative.

Ensure you're capturing baseline metrics and costs

Start by understanding your baseline performance metrics. This doesn't just refer to your systems and services, but how well your team handles incident response. These include:

Alerting/on-call metrics such as mean time to detection (MTTD: how long it takes to detect a problem after it started) and mean time to resolution (MTTR: how long it takes to resolve a problem after it started).
High-severity incidents such as the number of SEV1 and SEV2 incidents, outage lengths, etc.
Infrastructure metrics such as resource consumption, latency, throughput, and replication lag.

For past incidents, calculate and track the economic impact that the business experienced. Ask questions like:

Was there downtime?
How long did the incident last?
How much revenue was lost due to the outage?
How many engineers were working on the issue? How many person-hours went into fixing the problem?
Were there any other direct costs associated with the incident (e.g., provisioning new systems)?

An easy way to calculate revenue loss is by estimating the average amount of revenue the business normally generates over a period (e.g., one hour) and multiplying this by the duration of the outage.

Track the number of issues addressed through Chaos Engineering

Whenever you uncover an issue in your infrastructure using Chaos Engineering, make sure to document that issue in your ticketing system. Add a note or tag specifying that the issue was found with Chaos Engineering. This allows your team to easily find and list all issues found and fixed with Chaos Engineering so you can prove its effectiveness to management.

Additionally, track the amount of work that went into fixing these issues. Ideally, this will reflect the number of person-hours (the number of people working on the issue multiplied by the time taken to resolve it). If that's not available, estimate the number of hours spent troubleshooting, fixing, testing, and deploying the fix.

Track trends in SEV1 and SEV2 incidents

If you're unfamiliar with incident severity levels, they're a way of identifying and prioritizing incidents based on their impact on the business. Each incident is assigned a SEV (short for severity) number, with 0 or 1 being a critical impact and 3 or higher being a low impact. For example, if a bank's customer emailing service failed, it would likely be a SEV3, but if their core transaction processing service failed, it would be a SEV1 or SEV0.

As you use Chaos Engineering to uncover weaknesses and implement fixes, you should see the number of incidents—especially high severity incidents—decrease over time. This is another strong indicator that your Chaos Engineering efforts are working.

According to the 2021 State of Chaos Engineering report, nearly 20% of organizations experience 10—20 high severity incidents per month!

If you've already started tracking your cost per incident as described earlier in this article, this is another way to calculate your cost savings thanks to Chaos Engineering. For each severity level, count the number of reduced incidents compared to before you implemented Chaos Engineering, and multiply this by the average cost of an incident for that severity level. This will give you the average cost savings per severity level, and if you combine them, you'll get your average total savings.

((Number of SEV1 incidents this period) - (Number of SEV1 incidents last period)) * (Average cost of a SEV1 incident)

Estimate potential customer impact of fixed issues

Ultimately, the goal of all of this effort is to do what's best for your customers. The fewer latent failure modes there are in your system, the less likely it is that those failure modes will cause production incidents, and the less likely it is that customers will have a poor experience using your service. But for those few issues that do end up reaching production, it's valuable to know what the potential impact might be.

If possible, calculate the number of potential customers that an issue would impact. One way to measure this could be by tracking the number of individual user IDs or sessions that are handled by the services being impacted. For example, if you have a service that exclusively handles multi-factor authentication for users, and 20% of your users use multi-factor authentication, then a service outage could‌ theoretically impact up to 20% of your users.

If we assume all customers contribute equally to sales, a 20% drop in users results in a 20% drop in sales. To calculate the revenue lost, we simply take our normal sales volume over a period (e.g., one hour), multiply it by the length of the outage, then multiply by 20%.

We also can't forget to include the indirect costs of fixed issues. For one, customers who experience outages are less likely to continue using our services in the long-term, causing a drop in future sales and revenue. We'd also need to invest several engineering person-hours into identifying and fixing the issue, which has associated salary costs and missed opportunity costs.

In short, proactively fixing reliability issues saves us from:

Lost revenue due to customers no longer having access to our service.
Long-term revenue loss due to customers losing trust in our service.
More engineering time spent fixing problems instead of generating business value.

Conclusion

To summarize, here's how we measure the benefits of Chaos Engineering:

What's our cost of downtime? How much revenue do we lose, how many customers churn after each incident, and how many engineer person-hours do we dedicate to fixing issues after an incident happens?
How much did our previous incidents cost? How many incidents did we reduce on average after implementing Chaos Engineering? How many high-severity incidents did we reduce?
Did we reduce our rate of support tickets? How many tickets did we resolve due to Chaos Engineering? How much time were your engineers spending on those tickets, and how much did you save by reducing them?
How many customers did your previous incidents impact, and how many were impacted recently (after implementing Chaos Engineering)?

If you want to dig more into the value of reliability and its positive impact on the business, read our blog on The KPIs of improved reliability.