- 8 min read

The KPIs of improved reliability

Prioritizing reliability can be challenging for businesses. Although reliable systems and services are necessary for building customer trust and growing revenue, businesses also need to focus on initiatives such as developing new products and features. When determining which initiatives to prioritize and which ones to defer, it's understandable for business leaders to choose those that provide an obvious return.

Of course, this doesn't mean that reliability doesn't have value, but making this value clear to the business is surprisingly tricky. Site reliability engineers (SREs) generally understand the need for reliable systems better than anyone else, since they're the ones managing those systems, responding to incidents, and fixing outages. For business leaders to understand the value of reliability, SREs must be able to tie these benefits back to business-level metrics and KPIs by demonstrating how reliable systems contribute to revenue growth, cost reduction, and increased customer satisfaction.

In this blog post, we'll look at several reliability-centric metrics and key performance indicators (KPIs) that show the positive impact that reliability has on businesses.

Why is reliability important for businesses?

Reliability is how well we can trust a system to remain available, whether it’s an application, a distributed service, a single server running in a data center, or even a process that employees follow, such as an incident response runbook. The more reliable a system is, the longer it can run before failing or requiring human intervention.

When we think about online services, reliability is key. Modern customers don't just want consistently fast and stable access to online services, they demand it. Outages are immediately publicized through social networks and websites like Downdetector, and can irreversibly harm a company’s reputation in a matter of minutes. Customers have plenty of options and plenty of motivation to leave services that don't meet their reliability expectations, especially in highly competitive markets like e-commerce and SaaS.

The less reliable our systems are, the more we lose out in sales, brand recognition, and customer trust. For services that already compete on features and usability, reliability can become a key differentiator.

To understand the impact that reliability can have on a business, we need ways to measure reliability meaningfully and objectively. This is where metrics and KPIs come into play.

Quantifying reliability with metrics and key performance indicators (KPIs)

A key performance indicator (KPI) is a measurable value tracking the business’ progress towards a specific goal or objective. A metric is a method of measuring something, or the results obtained from a measurement. Metrics are used to track progress towards a KPI, and how well you're meeting your KPIs indicates how well you're meeting your objectives. For example, if you're an online retailer and your KPI is to increase sales by 20%, your metrics might include the average order size, number of transactions, email and ad conversion rates, and shopping cart abandonment rate. Each of these data points directly impacts sales in a meaningful way.

In order to make reliability a business objective, we need a way to measure, track, and demonstrate the benefits of improved reliability. In addition, we need to be able to benchmark against previous efforts and industry peers. This helps demonstrate the effectiveness of our reliability efforts and the benefits they provide to the business over time. We'll look at four commonly used metrics: uptime, Service Level Agreements (SLAs), mean time between failures (MTBF), and mean time to resolution (MTTR).

Uptime

Uptime is the amount of time that a system is available for use. It’s typically measured as the percentage of time that a system is accessible by users over a period of time, or the percentage of user requests that are successfully fulfilled over a period of time. 100% uptime is the equivalent of zero errors or downtime, and while this is an ideal we should strive for, it simply isn't realistic due to the unpredictability of complex distributed systems. Instead, teams aim for high availability, which sets a high minimum target uptime. For example, Netflix promises 99.99% availability, which allows for less than five minutes of downtime per month.

The higher the target uptime, the harder it is to achieve and maintain. The benefit is that your organization gains a reputation for being reliable, and customers can feel more confident putting their trust in you.

Service level agreements (SLAs)

A service level agreement (SLA) is a contract between your organization and your customers promising a minimum level of availability. This level is often measured by uptime, but can also be tracked using response time or error rate. If availability falls below the minimum promised in the SLA, customers may be entitled to discounts or reimbursements. For example, AWS will provide a full service credit to customers of their EC2 service if availability falls below 95% for any given month. SLAs aren’t necessary, but service providers that offer SLAs are showing their customers that they care enough about reliability and the satisfaction of their users to provide financial compensation if they fail to meet their own standards.

Can I easily identify the quality of our customer experience? I’d much rather spend the time and effort understanding my systems on my dime than understanding our systems on our customers’ dime, when they may be hitting SLA violations [or chargebacks] because we didn’t do a good enough job understanding the things that are running in production.

Mean time between failures (MTBF)

Mean time between failures (MTBF) is the average amount of time between system failures. This metric directly affects uptime. A low MTBF means our systems are failing often, which implies our engineers are deploying problematic code and aren’t addressing the underlying causes of failure. This has the dual detriment of increasing the frequency of impacts to our customers and our operations teams as they have to constantly manage these issues.

Mean time to resolution (MTTR)

Mean time to resolution (MTTR) is the average amount of time for our engineers to detect and fix a problem. Unlike MTBF, we want a low MTTR since this means our engineers are addressing problems quickly. There are several ways we could accomplish this, including:

  • Using monitoring and alerting solutions to quickly notify engineers of problems.
  • Creating incident response playbooks to guide engineers through resolving problems.
  • Automating as much of our incident response process as possible.

A high MTTR means our systems are down for extended amounts of time, and that our engineers are struggling to troubleshoot and resolve the issue. We can reduce our MTTR by finding and addressing failure modes, and by preparing our teams to respond to incidents.

One of the long-term benefits of [our Chaos Engineering practice] is it improved our MTTR and incident communications because we were able to practice without being in a live-fire scenario.

Translating metrics to KPIs to business objectives

Metrics like uptime, SLAs, MTBF, and MTTR tell us the state of our systems in terms of reliability, but they don’t tell us the value that we get from being reliable. For that, we need to connect them to our KPIs and business metrics.

As an example, imagine you're an engineering manager at a SaaS company with aggressive year-over-year growth targets. Your SaaS product is the company's sole source of income, so customers must be able to access and use your service for the company to generate revenue. The more outages and downtime your service has, the more likely customers are to churn, demand reimbursements, or simply stop using your service, and the less likely the company is to reach its growth targets.

With this context, we can create a KPI such as "our service must have a monthly uptime of 99.5%." This gives us our target, but how will we measure and track compliance to that target? If we define uptime as "the number of minutes our service was accessible out of the total number of minutes in a month", this gives us a clear metric to measure using a monitoring or observability tool. We also have a way of directly tying this metric to our KPI, since we know that our monthly uptime can never fall below 99.5% (around 216 minutes of downtime). The more time our service is accessible, the more likely we are to meet our monthly uptime KPI, and since this KPI has a direct impact on sales, the company is more likely to reach its growth target.

While business objectives are often unique to the organization, some are commonly shared. These include revenue and customer retention, which we'll explore next.

Lost revenue and added costs

The most immediate impact of an outage is lost revenue. This is especially true in e-commerce, as the top e-commerce sites risk losing as much as $300,000 in sales for every minute of downtime. Outages also divert engineering resources away from activities meant to generate revenue, such as feature development and performance optimization. In highly regulated industries such as finance and healthcare, outages can also come with significant fines, lost of trust, and even personal liability.

Customer attrition

In addition to lost sales, poor reliability can also lead to lost customers. Customers abandon services for several reasons, and reliability is a key factor. Akamai found that sites that went down experienced a permanent abandonment rate of 9%, and sites that performed slowly experienced a permanent abandonment rate of 28%. If customers can’t trust the services they depend on to be operational, or if downtime is impacting their ability to run their business, they’ll move to a competitor. Service providers (including SaaS, PaaS, and IaaS providers) in particular need to set the reliability standard for their customers by offering services that are reliable, performant, and quick to recover.

A key indicator for customer attrition is Net Promoter Score (NPS), which measures customer satisfaction and loyalty. A high NPS indicates that customers are enthusiastic about a service, while a low NPS indicates that customers are unhappy with a service and may even dissuade potential customers. While many variables contribute to NPS, reliability has a direct and significant impact.

Reliability is good for customers, it's good for engineers because you're going to get paged less; and when you do get paged, hopefully, it's something that's urgent so you don't have to waste that executive decision making ability. And, it's good for the business because it improves revenue, it lowers churn, it does all those healthy things.

How Chaos Engineering helps improve reliability

Starting a reliability initiative involves:

  • Taking an in-depth look at our systems.
  • Identifying the different ways that they can fail.
  • Addressing these failure modes by deploying fixes.
  • Testing our systems to ensure they’re no longer vulnerable to those failure modes.
  • Creating and promoting a culture of reliability within our company.

To do this, we need to be able to test our systems against a wide range of failure modes, resolve any issues we find, then re-test our systems to make sure these fixes work. The question is: how can we do this in a safe, effective, and controlled way that doesn’t put our systems or customers at risk? The answer is with Chaos Engineering.

Chaos Engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of failure. By observing how our systems react to this failure, we can find ways to improve its resilience.

Chaos Engineering tests our assumptions about how our systems behave under certain conditions. For example, if one of our servers goes offline, can we successfully failover to another server without our application crashing? With Chaos Engineering, we can perform a simulated outage on a specific server, observe the impact on our application, then safely halt and rollback the outage. We can then use our observations and insights to build resilience into the server and reduce the risk of real-world outages. By repeating this process, we can gradually build up the resilience of our entire deployment without putting our operations at risk. In addition, we can prepare our engineers to respond to these situations so that if they happen in production, they already have the training and muscle memory needed to respond quickly and effectively.

62% of [developers surveyed by PagerDuty] are spending 10 or more extra hours a week resolving incidents...and 39% are firefighting or focused on unplanned outages 100% of the time. There’s a significant strain on practitioners charged with keeping digital services running.

The improvements that Chaos Engineering brings reflects first in our low-level metrics, especially uptime and MTBF. Increasing availability and reducing our failure rate will, in turn, reduce the risk of missed revenue, lower our cost of downtime, and improve customer satisfaction. Given how competitive online services are, companies that differentiate on reliability will not only gain and retain customer trust, but avoid costly incidents.

The unpredictable, the unknown, has just as much impact—if not more—on your business than the known. Being prepared for [unexpected events] and trying to find them is one of the most important things we can do.

Learn how Gremlin can help you improve reliability at your company. Request your free trial today.