How do you track reliability in an organization with hundreds of engineers, dozens of daily production changes, and over 32 million monthly users? Even more, how do you do this in a way that's simple, presentable to executives, and doesn't dump a ton of extra work on to engineers' plates?
Slack recently wrote about how they created the Service Delivery Index for Reliability (SDI-R), a simple yet comprehensive metric that became the basis for many of their reliability and performance indicators. In this blog, we take a look at the SDI-R from an outside perspective to see how other companies can learn from Slack's experience and expertise.
The SDI-R is, in Slack's own words, "a composite measure of successful API calls and content delivery (as measured at the edge), along with important user workflows (e.g. sending a message, loading a channel, [and] using a huddle)." In other words, it measures how successful a service is at handling user requests and responding to those requests.
If a service successfully handles a request but can't deliver the response to the user, or if it fails to process the request and delivers an error to the user, then the service isn't reliable. But a service that successfully processes the request and delivers the response is. The SDI-R tracks this.
For Slack, the motivation was primarily focused around its engineers. Slack was suffering from "Hero Engineers," which are those skilled engineers who jump head-first into incidents and outages and fix the underlying causes. They get to walk out of the datacenter with their heads held high knowing they saved the day. There's a few problems with this approach, though:
- Being on-call and responding to major incidents creates stress and eventually leads to burnout.
- Although the organization, its services, and its infrastructure can scale, individual engineers can't. Organizations need a scalable, systemic way of handling incidents.
- It's harder for teams to move to a shared culture of reliability and service ownership, since there will always be a team to handle emergencies.
This isn't the only reason, of course. Slack was also building a culture of reliability by building out Incident Management and Response processes, plus enabling teams to fully own the services they work on. A key part of reliability culture is being able to track reliability. Or, as Slack puts it:
If you’re driving a reliability culture in a service-oriented company, you must have a measurement of your service reliability before all else, and this metric is quintessential in driving decision-making processes and setting customer expectations. It allows teams to speak the same language of reliability when you have one common understanding.
Any metric that measures reliability must be clearly defined, since "reliability" can have different definitions depending on the organization or team. One team might consider 99% uptime to be reliable, while a company like Slack requires 99.99% uptime (for context, that's a jump from over 7 hours of downtime per month to under 5 minutes of downtime per month).
At Gremlin, we offer a similar definition with our reliability score. This score is calculated for each service in your environment based on the results of running a suite of tests. For instance, if you add a service to Gremlin, run the full set of auto-generated reliability tests, and the service passes all of them, it gets a score of 100%. If it fails half of them, it gets a score of 50%, etc. Like Slack's SDI-R, this is a number that tells you at-a-glance how reliable a service is.
The main difference between Gremlin's score and Slack's SDI-R is that Slack's SDI-R works off operational data like uptime and the number of successful API calls, whereas Gremlin's score aims to be predictive by detecting known reliability risks and testing against common failure modes like autoscaling and slow or missing dependencies. Both metrics approach reliability from two different angles. We've written at length about why it's important to have a forward-looking reliability metric. Using these two metrics together gives you a comprehensive view of reliability all throughout the software delivery lifecycle.
It's always fascinating to see major engineering organizations like Slack post about their inner workings. The way these engineering teams handle incidents and reliability testing can become best practice for the entire industry, much like how Slack itself has become a primary collaboration platform for millions of users.
Want to learn more? Check out these resources:
- Follow the Slack engineering blog to see how they approach engineering challenges.
- Download the Navigating the Reliability Minefield whitepaper to see how to build a reliability tracker spreadsheet of your own.
- Watch the More Reliability, Less Firefighting webinar and learn how to build a successful, best-in-class reliability program.