Lessons from Alaska’s outage: Redundant ≠ resilient

Last Sunday, Alaska Airlines suffered a three-hour outage that led to more than 200 flight cancellations and disrupted 15,600 passengers. The culprit?

“A critical piece of multi-redundant hardware at our data centers, manufactured by a third-party, experienced an unexpected failure. When that happened, it impacted several of our key systems that enable us to run various operations, necessitating the implementation of a ground stop to keep aircraft in position.”

Redundancy is a best practice whenever you have a critical piece of hardware or software. In fact, Chaos Monkey was originally invented specifically to verify that systems were redundant by shutting off random servers. The Alaska Airlines IT team clearly follows this practice, but even with multi-redundancy, the system still couldn’t weather the failure and went down.

So how much redundancy is enough?

Find the cost vs. resilience balance

Ideally, we’d have all the redundancy we need to make sure everything has a backup system and that each of those backup systems has multiple backup systems. That gets really expensive really quickly, ending with leadership and finance asking you to explain massive invoices.

But the fewer redundancy measures you take, the greater your risk. So how can you find the right balance of cost vs. resilience?

It comes down to understanding your risk. Many organizations are making these decisions using assumptions or educated estimates, but ultimately those are guesses made in the absence of concrete risk data.

Resilience testing gives you that data. And once you have it, your organization can perform an accurate cost/benefit analysis.

For example, you might find that the third layer of cache redundancy could prevent performance bottlenecks, but comes with a $10 million price tag. It’s perfectly reasonable to decide to take that risk as an organization rather than spend the money. But that decision should be made based on data instead of in a vacuum.

Blackhole resilience tests give you reliability data

Blackhole experiments cut off all network traffic to specific resources, such as an availability zone or database. As far as your service is concerned, that resource is down and unavailable, which means you’re simulating an outage without having to actually take down the resource.

This makes them the perfect way to test redundancy. Ideally, your service should perform the way you expect it to, such as failing over to the backup or redundant resource with minimal downtime or performance degradation.

But things don’t always go as planned. Your backup resource may not have the right specs to handle the increased load when there’s no redundancy left, or it may increase latency to the point where it causes brownouts on other parts of your application.

Blackhole experiments show you these unexpected behaviors so you can make data-driven decisions about your redundancy.

Regular tests for continued resilience you can prove

Testing should never be a one-and-done exercise. Your systems are constantly changing as new code is shipped, infrastructure is updated, and more. Failover tests that passed easily last week might not pass this week due to something as small as a latency timeout change in one of your dependencies.

The only way to ensure continued resilience is through regular, standardized testing. By running the same group of tests weekly or bi-weekly, you’ll catch shifts in your reliability risk posture so you can address issues before they catch you off guard in an outage.

Regular tests also give you a steady metric over time for resilience. This gives you, your team, and your leadership a way to see the effectiveness of your efforts. It also lets you answer that question all teams get from executives after an outage: “Will this happen again?”

By creating the same failure conditions, you can show them the results to prove that your systems are now resilient to that outage.

That same metric can also be used to make a case for increased infrastructure or resource investment. If a service shows regular increases in risk despite your team’s best efforts, then that data can be used to justify spending the money on more redundancy.

Test redundancy at scale with Gremlin

Gremlin is designed to make reliability management easy with standardized test suites, customized experiments and scenarios, and team-wide dashboards. By using Gremlin, teams can make data-driven decisions about their reliability risks and prevent costly outages before they happen.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour