How to be prepared for cloud provider outages

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products.

But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Cloudflare was an example of this spread. The web platform serves thousands of companies and handles an average of 78 million HTTP requests per second. But during the GCP outage, a critical service was affected that caused a service outage. Suddenly, companies that use Cloudflare were also running into issues, even if none of their architecture ran on GCP.

This story isn’t new. It’s a continued risk as architectures grow in an ever-increasing web of dependencies and services.

Fortunately, there is something you can do about it. Check out these testing best practices teams should follow to minimize the impact of large-scale outages so they don’t catch you by surprise.

1. Simulate failures to verify the resilience of critical services

Engineers and architects think they know which services are critical to their applications. They’ll purposefully design their systems to eliminate single points of failure and make those critical systems more resilient. But even if you design with the best intentions and multiple failover redundancies, you don’t really know it’s resilient until a failure occurs.

Which is, of course, the worst time to find out it doesn’t work correctly.

Instead, engineers can use resilience testing to simulate these failures ahead of time. Resilience testing lets you control your failure so you can safely verify that your systems are resilient. A Blackhole experiment blocks inbound or outbound traffic to specific locations, simulating an outage where a service is unreachable.

Use the Blackhole test to simulate and verify resilience to a variety of failure responses, like:

Failing over to an origin database if a cache is unavailable (and vice versa)
Failing over if a region or availability zone suffers an outage
Switching to a different instance of the critical service when a default goes down
Safely fall back to a degraded experience with limited features

And the Blackhole experiment is just the beginning. It’s a good idea to run the full battery of standard tests to ensure your systems aren’t vulnerable to other common failures, like traffic spikes or increased latency

Your goal should always be to make sure critical services act the way you expect them to. And it can’t just be a one-time effort. Systems are constantly changing, and those changes in your (or other people’s) systems can make services unreliable again.

By testing, you move from thinking they’re resilient to knowing they are. And by testing regularly, you keep up with the changing state of your dependencies to remain resilient.

2. Run Blackhole and latency tests against your SaaS vendors

Modern architectures come with a massive amount of third-party SaaS dependencies. These can do wonders to improve performance, security, and resilience, but they do introduce additional points of failure.

The GCP outage highlighted this risk. Companies that didn’t use any GCP products suddenly found their services failing due to SaaS dependencies that did. That doesn’t mean we should all run out and strip our architectures of SaaS products. Those dependencies were chosen because the benefits of using them outweigh the risks, but it does mean you should manage that risk through testing.

Start by running Blackhole tests against dependencies to learn how your service reacts when a SaaS call fails. That way, no matter what infrastructure they’re running, you’ll know if your service is resilient to their failure.

In the wake of the GCP outage, industry leaders, like Cloudflare, are already performing the testing and working to improve the resiliency of their systems. Their post-mortem includes line items like:

“Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.”

Other organizations should take note and use this opportunity to perform the work before it causes an outage.

Also, be sure to fully map your dependencies. In the sprawling web of modern architectures, it can be easy for unknown dependencies to sneak in. Make sure you know all of your service dependencies and test against their loss to be fully prepared.

3. Test your backups to make sure they’re functional

Backups can be really easy to set and forget. After all, the whole idea is that you should be able to run without using them the vast majority of the time. But when you need them to be there, you’ll really need them to work.

There are many reasons a backup could go wrong, covering everything from bad data integrity to incompatible software versions to bad cloud provider settings. The last time you want to find out about these errors is when your primary system is down.

That’s why operations engineers like to say, “If you haven’t done a restore, then you don’t have a backup.”

So do the restore. Run large-scale tests on your infrastructure to make sure everything rolls over, restores correctly, and runs the backup as planned.

In some industries, such as finance, this Disaster Recovery planning and testing is not only best practice, but required by regulations. But this GCP outage shows the importance of all industries embracing regular, standardized, large-scale Disaster Recovery and backup testing to make sure they’re resilient to large-scale failures.

And if your backup involves failing over to a different service provider, such as with multicloud redundancy, it’s a good idea to run latency tests. After all, if one provider goes down, the others will probably suddenly see a drastic increase in traffic, which could lead to slowdowns or performance degradation on your backups.

It comes down to knowing your reliability posture

Outages are inevitable.

It’s not a matter of if something will fail, but a matter of when. What’s important is whether or not you know how your systems will react when the failure occurs.

And that’s where knowing your reliability posture comes in.

With regular, standardized testing, you can have confidence in how your systems will react to failures. It does require an investment of resources and effort, but it’s small compared to the resources and effort needed to respond to P0 and P1 incidents. If you do the work and the testing now, you increase resilience on your schedule, on your time, and without impacting customers.

All so when big issues happen, they’re a non-issue.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour