Off the top of your head, do you know the answer to these questions: what will happen to your application if Amazon S3 — one of the most widely used Amazon Web Services — suddenly goes down? Are you confident that your website will continue serving requests if it fails to load assets from the cloud? What about your deployment system? Will you still be able to add capacity to your fleet? And don’t forget the other dozen or so backend services you’re running for file sharing, data analytics, secrets management, etc. that all depend on S3 to operate correctly. The question really becomes: is the distributed system you’ve built reliable enough to survive such an outage?
Unfortunately, this scenario isn’t hypothetical — it’s for real. Two weeks ago, on February 28th, we witnessed a significant service disruption of S3 that caused many websites and services — including the one you’re browsing at this very moment — to go offline for hours. A quick summary of the incident: while debugging a different issue with S3, a member of the S3 team executed a command that removed a larger set of servers than intended from service. The following recovery process, which required restarting two S3 subsystems, took much longer than expected. Ironically, the failure cascaded not only to other AWS services, like EC2 and EBS, but also caused the AWS Service Health Dashboard to report stale status information.
The lesson we should learn and remember is that sooner or later, all complex systems will fail. It’s not a matter of if, it’s a matter of when. There will always be something that can — and will — go wrong. From self-inflicted outages caused by bad configuration or buggy images to events outside our control like denial-of-service attacks or network failures. No matter how hard you try, you can’t build perfect software (or hardware, for that matter). Nor can the companies you depend on. Even S3, which is designed for >99.99% availability and has had a long track record of reliability, will break in surprising ways.
We live in an imperfect world. Things break, that’s just how it is. Accept it and focus on the things you can control: creating a quality product/service that is resilient to failures. Build software that can cope with both expected and unexpected events. Gracefully degrade whenever possible. Decrease coupling between system components so that parts of your site will continue working as expected, seemingly unaffected by S3 outages and similar incidents. As the saying goes, hope for the best and prepare for the worst.
That’s the theory. Building a reliable system in practice — making sure you’re ready for disaster to strike in production — is easier said than done, of course.
If your infrastructure stack also happens to rely on S3, as many do, chances are you now have an idea of what can go wrong when Amazon’s storage service fails. If it did affect you, you might already have conducted a postmortem to better understand how the incident impacted you and how to prevent similar problems in the future. That’s good! Learning from outages after the fact is crucial (and we will come back to postmortems in another post). However, it shouldn’t be the only method for acquiring operational knowledge.
In fact, as an operator, you should be tired of learning it the hard way. Dealing with outages is never a pleasant experience, much less so if you’re at the mercy of your infrastructure provider in the middle of the night. But what’s the alternative?
Building reliable systems requires experience with failure. Waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. By simulating potential errors in advance, we can verify that our systems behave as we expect — and to fix them if they don’t.
While this may seem counterintuitive at first, we see these Antifragile principles at play in everyday life. Take the vaccine — we inject something harmful into a complex system (an organism) in order to build an immunity to it. This translates well to our distributed systems where we want to build immunity to hardware and network failures, our dependencies going down, or anything that might go wrong.
Don’t wait for trouble. Intentionally terminate cluster machines, kill worker processes, delete database tables, cut off access to some services, inject network latency, etc. This will teach you about the coupling of your system and reveal subtle dependencies and other problems you might otherwise overlook.
Sounds scary? We don’t recommend wreaking havoc on production from day one (although that should be the ultimate goal). Start by experimenting in an isolated staging/test environment. Start with the smallest possible impact that will teach you something, then gradually ramp up testing efforts.
Breaking things on purpose is the core idea behind Chaos Engineering and related practices like GameDay exercises. It’s also at the heart of Gremlin’s Reliability as a Service. Our product helps you find weaknesses in your system before they end up in the news. You can use Gremlin to recreate the S3 outage today, to help verify your fixes and validate that you’re ready for future incidents.
If you’re interested in a demo of Gremlin, you can request one here!