October 29, 2020 - 5 min read

The KPIs of improved reliability

Reliability can be a tricky subject for service providers. Building reliable systems and applications are necessary for serving customers and generating revenue, but teams often delay reliability programs in favor of other initiatives. For businesses to understand the importance of reliability, we need to tie the benefits back to business-level metrics and demonstrate how it contributes to revenue, growth, costs, and customer satisfaction.

In this blog post, we’ll present reliability-centric metrics and key performance indicators (KPIs) that show the positive impact that reliability has on businesses.

Why is reliability important?

Reliability is how well we can trust a system to remain available, whether it’s an application, a server, or even a process that our employees follow. The more reliable a system is, the longer it can run before failing or requiring intervention from an engineer.

Why is this important? Customers expect consistently fast and stable access to online services. Any outages are immediately made visible through websites like Downdetector and social networks, and can irreversibly harm a company’s reputation. With how competitive the online services markets are, customers have plenty of options and the motivation to leave services that don’t meet their reliability expectations.

The less reliable our systems are, the more we lose out in sales, brand recognition, and customer trust. For services that already compete on features and usability, reliability can become a key differentiator.

Next, we’ll explore how to measure reliability in a meaningful way.

Quantifying reliability with metrics and key performance indicators (KPIs)

A key performance indicator (KPI) is a measurable value tracking the business’ progress towards a specific goal or objective. When making reliability a business objective, we need to be able to demonstrate our progress towards our objective in a way that can be quantified and benchmarked against previous periods and industry peers. This helps us demonstrate the effectiveness of our initiatives and the benefits it provides to the business.

Uptime

Uptime is the amount of time that a system is available for use. It’s typically measured as a percentage for a period of time or percent of user requests that are successfully fulfilled. 100% uptime is ideal, but isn’t realistic. Instead, teams aim for high availability, which sets a high minimum target uptime that teams must strive for. For example, companies like Netflix promise 99.99% availability, which allows for less than five minutes of downtime per month.

The higher the target uptime, the harder it is to achieve and maintain. The benefit is that your organization gains a reputation for being reliable, and customers can feel more confident in putting their trust in you.

Service level agreements (SLAs)

A service level agreement (SLA) is a contract between your organization and your customers promising a minimum level of availability. This level is often measured by uptime. If availability falls below what’s promised by the SLA, customers may be entitled to discounts or reimbursements. For example, AWS will provide a full service credit to customers of their EC2 service if availability falls below 95% for any given month. SLAs aren’t necessary, but service providers that offer an SLA are showing their customers that they care about reliability and the satisfaction of their users.

Can I easily identify the quality of our customer experience? I’d much rather spend the time and effort understanding my systems on my dime than understanding our systems on our customers’ dime, when they may be hitting SLA violations [or chargebacks] because we didn’t do a good enough job understanding the things that are running in production.
Tyler Wells
Senior Director of Engineering - SRE Platform at Twilio

Mean time between failures (MTBF)

Mean time between failures (MTBF) is the average amount of time between system failures. This metric directly affects uptime. A low MTBF means our systems are failing often, which implies our engineers are deploying problematic code and aren’t addressing the underlying causes of failure. This has the dual detriment of increasing the frequency of impact on our customers and our operations teams as they have to constantly manage these issues.

Mean time to resolution (MTTR)

Mean time to resolution (MTTR) is the average amount of time for our engineers to detect and fix a problem. Unlike MTBF, we want a low MTTR since this means our engineers are addressing problems quickly. There are several ways we could accomplish this, including:

  • Using monitoring and alerting solutions to quickly notify engineers of problems.
  • Creating incident response playbooks to guide engineers through resolving problems.
  • Automating as much of our incident response process as possible.

A high MTTR means our systems are down for extended amounts of time, and that our engineers are struggling to troubleshoot and resolve the issue. We can reduce our MTTR by finding and addressing failure modes, and by preparing our teams to respond to incidents.

One of the long-term benefits of [our Chaos Engineering practice] is it improved our MTTR and incident communications because we were able to practice without being in a live-fire scenario.
Justin Turner
Senior Software Engineering Manager at H-E-B

Translating KPIs to business objectives

Metrics like uptime, SLAs, MTBF, and MTTR tell us the state of our systems in terms of reliability, but they don’t tell us the value that we get from being reliable. For that, we need to look at business-oriented metrics.

62% of [developers surveyed by PagerDuty] are spending 10 or more extra hours a week resolving incidents...and 39% are firefighting or focused on unplanned outages 100% of the time. There’s a significant strain on practitioners charged with keeping digital services running.
Rachel Obstler
VP, Product at PagerDuty

Lost revenue and added costs

The most immediate impact of an outage is lost revenue, especially for e-commerce websites. The top e-commerce sites risk losing as much as $300,000 in sales for every minute of downtime. Outages also divert engineering resources away from activities meant to generate revenue, such as feature development and performance optimization. In highly regulated industries such as finance and healthcare, outages can also come with significant fines.

Customer attrition

As service providers, we need to set the reliability standard for our customers by offering services that are reliable, performant, and quickly recoverable. Customers abandon services for several reasons, and reliability is a key factor. Akamai found that sites that went down experienced a permanent abandonment rate of 9%, and sites that performed slowly experienced a permanent abandonment rate of 28%. If customers can’t trust our services to remain operational, or if downtime is impacting their ability to run their business, they’ll move to a competitor.

A key indicator for customer attrition is Net Promoter Score (NPS), which measures customer satisfaction and loyalty. A high NPS indicates that customers are enthusiastic about our service, while a low NPS indicates that customers are unhappy with our service and may even dissuade potential customers. While many variables contribute to NPS, reliability has a direct and significant impact.

Reliability is good for customers, it's good for engineers because you're going to get paged less; and when you do get paged, hopefully, it's something that's urgent so you don't have to waste that executive decision making ability. And, it's good for the business because it improves revenue, it lowers churn, it does all those healthy things.
Paul Osman
Lead Instrumentation Engineer at Honeycomb.io

How Chaos Engineering helps improve reliability

Starting a reliability initiative involves:

  • Taking an in-depth look at our systems.
  • Identifying the different ways that they can fail.
  • Addressing these failure modes by deploying fixes.
  • Testing our systems to ensure they’re no longer vulnerable to those failure modes.
  • Creating and promoting a culture of reliability within our company.

To do this, we need to be able to test our systems against a wide range of failure modes, resolve any issues we find, then re-test our systems to make sure these fixes work. The question is: how can we do this in a safe, effective, and controlled way that doesn’t put our systems or customers at risk? The answer is with Chaos Engineering.

Chaos Engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of failure. By observing how our systems react to this failure, we can find ways to improve its resilience.

Chaos Engineering tests our assumptions about how our systems behave under certain conditions. For example, if one of our servers goes offline, can we successfully failover to another server without our application crashing? With Chaos Engineering, we can perform a simulated outage on a specific server, observe the impact on our application, then safely halt and rollback the outage. We can then use our observations and insights to build resilience into the server and reduce the risk of real-world outages. By repeating this process, we can gradually build up the resilience of our entire deployment without putting our operations at risk.

The improvements that Chaos Engineering brings reflects first in our low-level metrics, especially uptime and MTBF. Increasing availability and reducing our failure rate will, in turn, reduce the risk of missed revenue, lower our cost of downtime, and improve customer satisfaction. Given how competitive online services are, companies that differentiate on reliability will not only gain and retain customer trust, but avoid costly incidents.

The unpredictable, the unknown, has just as much impact—if not more—on your business than the known. Being prepared for [unexpected events] and trying to find them is one of the most important things we can do.
Matt Simons
Senior Engineering Manager at Workiva

When you’re ready to start improving reliability at your company, click here to get started.



© 2020 Gremlin Inc. San Jose, CA 95113