Why Reliability Engineering Matters: an Analysis of Amazon's Dec 2021 US-East-1 Region Outage
In the field of Chaos Theory, there’s a concept called the Synchronization of Chaos—disparate systems filled with randomness will influence the disorder in other systems when coupled together. From a theoretical perspective, these influences can be surprising. It’s difficult to understand exactly how a butterfly flapping its wings could lead to a devastating tornado. But we often see the influences of seemingly unconnected systems play out in real life.
On December 7, 2021, Amazon Web Services (AWS) experienced a nearly 7 hour long incident in their Northern Virginia region (US-East-1) that had far reaching effects including package delivery and the price of crypto currencies.
Understanding what happened
Incidents in complex systems are rarely simple and in this case, the effects of the incident were varied. Some AWS customers experienced very little impact, while others were left completely incapacitated. At the start of the incident this left a lot of organizations wondering, if the problem really lay with AWS or if something else was happening. Internally, AWS Engineers were also struggling to understand what was happening.
According to AWS’s incident report, an automated scaling event triggered unexpected behavior in a number of clients. That resulted in a surge of activity and a retry storm that overloaded the internal network that AWS uses to manage their systems.
This meant that some existing customer workloads were not affected. For example, running EC2 instances interacting with Elastic Load Balancers (ELBs) use a dedicated customer network, so they were unaffected by the network congestion on AWS’s management network. However administrative actions such as provisioning new EC2s or updating configurations are performed over the management network and were impacted. Additionally, Cloudwatch monitoring was delayed and some customers lost metrics for brief periods of time.
AWS’s incident report does not detail the exact causes of the issue, but it does reveal three factors that contributed to the incident or delayed the resolution.
- Impaired monitoring on the AWS network meant that engineers didn’t have visibility into the issue and had to spend time searching logs to try to piece together an accurate picture of what was happening.
- Impacted deployment systems meant that fixes couldn’t be implemented quickly.
- Because existing customer workloads were minimally affected, there was an abundance of caution to ensure that any remediation actions did not impact them.
These three factors aren’t unique to AWS’s incident or their architecture, and there’s a lot that we can learn from them to make our own systems more resilient.
It’s difficult to solve a problem when you can’t accurately see what the problem is. Metrics monitoring, distributed tracing (sometimes called Observability or APM), and logging are all important tools when it comes to finding problems in your systems and identifying their source.
AWS has an amazing team of engineers and they’re large enough to have a team that can focus on monitoring, observability, and logging. However, running your own tooling means that you’re also responsible for it and it puts you at risk of incidents that not only affect your applications, but your monitoring tooling as well. If you’re running your own tooling, it’s prudent to have a secondary monitoring solution available as a back up.
But whether you’re operating your own monitoring tools or not, it’s even more important to ensure that you’re receiving the correct metrics. I’ve observed countless incidents where monitoring was in place, but dashboards didn’t surface the right information or metrics thought to mean one thing meant something completely different.
Running controlled incidents, also called a GameDay, is a way to ensure that monitoring is in place and showing the right information. Initiating an incident that you can control (and immediately halt if it gets out of hand), you can validate your incident response processes and verify that your dashboards are accurate.
My colleague Tammy has written about how Gremlin runs GameDays to validate our monitoring and alerting.
Delivering fixes quickly
While some incidents may be resolved by simply rolling back or reverting the latest changes to the last stable release, incidents such as the one AWS faced cannot. In these cases it’s important to ensure that you can deploy fixes as quickly as possible. Similar to monitoring, if you’re running your own deployment system, it’s wise to have a backup plan to deploy fixes if your primary deployment system goes down.
One thing to be mindful of: when trying to implement fixes quickly to remediate an issue, it can be tempting to shortcut the deployment system and process altogether. Logging into a server to update configurations or pushing changes directly seems expedient, but bypassing existing checks can lead to botched fixes and exacerbate the outage.
Implementing fixes safely
Ensuring that fixes follow your standard deployment testing procedures will help prevent errors that could negatively affect functioning systems. To further guard against issues caused by deploying code or configuration updates, use a canary or staggered deployment strategy. Roll out your changes to a subset of impacted systems and monitor them to confirm the correct behavior before deploying the changes to all impacted systems.
Synchronization of Chaos
One of the interesting things about the Synchronization of Chaos is that the more tightly coupled the chaotic systems are, the more they influence each other and help reduce the disorder in each other
Similarly, if your complex systems are reliant on AWS, one way to reduce the chaos is more AWS. The AWS Well-Architected Framework’s (WAF) Reliability Pillar outlines how to use availability zones (AZs) and design your application to use multiple AWS regions to ensure reliability when one AZ or region fails.
As you implement zonal and regional redundancies, use GameDays to validate and practice your failover procedures by instigating controlled failures.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more
Treat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.Read more