May 15th, 2018 The Cost of Downtime

In 2016, IHS Markit surveyed 400 companies and found downtime was costing them a collective $700 billion per year. How do you estimate your own cost?

A number like that makes for sensational headlines, but it’s hard to wrap your head around. Per-company figures are more down-to-earth: Gartner cites $5,600 per minute—about $300,000 per hour—in its estimates. But what’s your cost? This post will help you get an idea.

During the Downtime

As soon as your service degrades or crashes, you start losing money, so first tally the most obvious cost: the revenue you forfeit every moment you’re down. Call it R.

If you make money from ads, R is lost ad revenue (By one 2015 estimate, Facebook was then apt to lose $1.7 million per hour). If you run an e-commerce store, it’s the number of lost sales times the average sale amount. If you’re a ride-hailing service, it’s the number of failed hails times the expected average fare.

Then there’s E, the cost of lost employee productivity. In the IHS Markit survey, an incredible 78% of that $700 billion was from E; just 17% was from R.

From the first moment of downtime, it’s all hands on deck: the engineers drop everything and hole up in a room together; the support team struggles to tamp down swelling ticket and phone queues; and the executives, if it’s bad enough, work with PR to start apologizing to stakeholders. Add to E the number of hours (times pay-plus-benefits) spent by all affected employees.

The Aftermath

Unfortunately the costs keep accruing after your service is back up. At a minimum, engineers need to find the root cause and design safeguards against future outages (which should be easier if you have an SEV Management Program). So keep adding to E for any employees dealing with the aftermath.

Next, if you’re a B2B company, your customers probably lost some revenue, too. (Amazon’s S3 outage last year cost its customers around $150 million.) Add anything you owe customers to another figure, C. If you have a service-level agreement (SLA) and you breached it, prepare to pay up. Also add to C any money that, while not contractually due, you pay as penance. If you run an airline and strand tens of thousands of customers, you may be buying hotel rooms or comping flights.

What’s the Damage?

The total cost of downtime (COD), then, is easy to calculate:

COD = R + E + C

If you incurred other (significant) miscellaneous costs—for outside consultants, for the recovery of lost data, etc—compile those into one more figure, M, and append that to the equation.

Now that you’ve got a ballpark number for one outage, how do you estimate yearly downtime cost? That’s not so straightforward; not every outage is equal. But if you divide COD by the number of hours in that outage, and you know roughly how many hours you were down in the past year, just multiply those two figures to approximate yearly cost. Make sure the example outage you picked to calculate COD was not especially impactful (e.g., occurring on Black Friday) or insignificant—use an outage of average severity.

Hidden Damage of Downtime

For the yearly cost of downtime, how many SREs could you set to the task of preventing it? You’ll never eradicate downtime—even if you put every penny of those costs towards preventive efforts—yet minimizing it is still worthwhile. Why? Because downtime has other, incalculable costs.

How many would-be customers read about your last outage and decided not to sign up? You cannot know, but it isn’t zero. How many existing customers churned out? A drop in your NPS score may give you an idea (here's how a raft of outages dropped Telstra’s score) but suffice it to say, customers won’t put up with one outage after another.

Maybe the most pernicious cost of downtime—especially if it’s chronic—is its drain on employee morale. And employees may not keep it to themselves. Word gets out.

SRE Bob: Hey Alice, how’s $NEW_JOB going?

SRE Alice: Meh. The on-call is miserable. Engineers deploy code without testing it.

Bob: Really?

Alice: Yeah. But we’re hiring a lot. You should come have lunch and see the office!

Bob: Ok, maybe!

Bob, of course, is only being polite.

Reducing Downtime with Chaos Engineering

Hopefully, unlike Alice’s company, 1) you test your code thoroughly, and 2) your software engineers and SREs work in harmony. But these practices alone won’t minimize your downtime.

Modern software architectures are more distributed than ever. Long gone are the days of applications running in one server rack—or even one datacenter. At first blush, distributed applications would seem to be more resilient, and in many ways, they are. But they’re also incredibly complex, often glued together by a mixed bag of third-party services running halfway around the world. (In the IHS Markit survey, network interruptions were the number one cause of downtime.)

Beyond code test coverage and DevOps culture, mature teams practice chaos engineering. They proactively run Gamedays to unearth weakness in their architecture before it causes downtime. But they know chaos doesn’t mean all madness and no method—they thoughtfully plan chaos experiments rather than kill services with reckless abandon. Whether you’re a veteran or a newbie in the chaos engineering community, come say hello on Slack!

May 15th, 2018

Categories: Chaos Engineering