Updated June 7, 2021
Off the top of your head, do you know the answer to these questions: what will happen to your application if Amazon S3 — one of the most widely used Amazon Web Services products — suddenly becomes unavailable? Are you confident that your website will continue serving requests if it fails to load assets from the cloud? What about your deployment system? Will you still be able to add capacity to your fleet? And don’t forget the other dozen or so backend services you’re running for file sharing, data analytics, secrets management, etc. that all depend on S3 to operate correctly. The question really becomes: is the distributed system you’ve built reliable enough to survive such an outage?
Unfortunately, this scenario isn’t hypothetical — it’s for real. On February 28th, 2017, a significant service disruption of S3 caused many websites and services — including the one you’re browsing at this very moment — to go offline for hours. A quick summary of the incident: while debugging an issue, a member of the S3 team executed a command that removed a larger set of servers than intended from service. The following recovery process, which required restarting two S3 subsystems, took longer than expected. Unfortunately, the failure cascaded not only to other AWS services, like EC2 and EBS, but also caused the AWS Service Health Dashboard to report stale status information.
Major outages like these are rare, but the lesson we should take away from this is that sooner or later, complex systems will fail. It’s not a matter of if, but when. There will always be something that can — and will — go wrong. From self-inflicted outages caused by bad configurations or buggy container images, to events outside our control like denial-of-service (DoS) attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on. This isn't because engineers are careless or unable to build reliable systems, but because we live in an imperfect world. Even services designed for >99.99% availability can fail in surprising and unexpected ways.
The way to overcome this imperfection is by focusing on the things you can control: creating a quality product or service that is resilient to failures. Consider the conditions your systems will be exposed to in production, and the potential failures that can affect them. Design your systems to cope with these failures, and anticipate unexpected failures where possible. Build applications that are resilient by using techniques such as detecting and gracefully degrading failed network connections, decoupling system components so that they operate independently of each other, and automatically restarting or recovering components when they fail. As the saying goes, hope for the best and prepare for the worst.
That’s the theory. Actually building reliable systems in practice — especially making sure you’re ready for disaster to strike in production — is easier said than done, of course.
If your infrastructure stack also happens to rely on a critical third-party service, as many do, chances are you now have an idea of what can go wrong when that service fails. If you were impacted by the S3 outage or a similar outage, you might already have conducted a post-mortem to better understand how the incident impacted you and how to prevent similar problems in the future. That’s good! Learning from outages after the fact is crucial. However, it shouldn’t be the only method for acquiring operational knowledge. Dealing with outages is never a pleasant experience, much less so if it's at 2 a.m. and you're at the mercy of your infrastructure provider. But what’s the alternative?
Building reliable systems requires experience with failure, but we can't just wait for failures to happen in production and risk impacting customers. Instead, we need a way to inject failures in a proactive and controlled way in order to gain confidence that our production systems can withstand these failures. By simulating potential failures in advance, we can verify that our systems behave as we expect — and to fix them if they don’t. Because these failures are controlled, we can adjust their magnitude (their severity) and blast radius (the number of systems impacted), reducing the risk of an outage.
Controlled failures can include terminating nodes in a cluster, dropping network traffic to a service to simulate an outage, adding latency and packet loss to network traffic to simulate degraded conditions, stopping a process and observing whether it restarts, and more. This will teach you about the coupling of your system and reveal subtle dependencies and other problems you might not be aware of.
This might sound scary at first, but the point is to build resilience, not create chaos. For example, we don’t recommend wreaking havoc on production from day one (although that should be the ultimate goal). Start by experimenting in an isolated staging/test environment. Start with the smallest possible impact that will teach you something, then gradually ramp up testing efforts as you learn more about how your systems operate and the ways they can fail.
Breaking things on purpose is one of the core ideas behind Chaos Engineering and related practices like GameDay exercises. It’s also at the heart of Gremlin’s Reliability as a Service. Our product helps you find weaknesses in your system before they end up in the news. You can use Gremlin to recreate real-world outages today, to help verify your fixes and validate that you’re ready for future incidents.