Chaos Engineering: the history, principles, and practice
Last Updated February 27th, 2018
With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
These failures cause costly outages for companies. These outages hurt customers when they try to shop, transact business, and get work done. Even brief issues hit company bottom lines, and as a result the cost of downtime is becoming a KPI for many engineering teams. For example, in 2017, 98% of organizations said a single hour of downtime will cost their business over $100,000. One outage can cost a single company millions of dollars. The CEO of British Airways recently explained a technological failure which stranded tens of thousands of British Airways (BA) passengers in May 2017 cost the company 80 million pounds ($102.19 million USD).
Companies need a solution to this challenge because waiting for the next incident to respond is too late. To meet this challenge head on, companies are turning to chaos engineering.
What is Chaos Engineering - Preventative medicine
Chaos engineering is a disciplined approach to identify failures before they become outages. By proactively testing how a system will respond under duress, you can identify and fix failures before they end up in the news.
Chaos engineering compares what you think will happen to what actually happens in your systems. In chaos engineering, you literally “break things on purpose” to learn how to build more resilient systems.
A brief history of Chaos Engineering
Chaos engineering first became relevant at internet companies that were pioneering large scale, distributed systems. These systems were so complex that they required a new approach to test for failure.
The Netflix Eng Tools team created Chaos Monkey. Chaos Monkey was created in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services, and the need to be sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.
The Simian Army was born. The Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well. “The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system” (Netflix, 2011).
Netflix shared the source code for the Chaos Monkey on Github in 2012, they included the message that they “have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient” (Netflix, 2012).
Netflix team decided they would create a new role, the Chaos Engineer. Bruce Wong coined the role and Dan Woods shared it with the greater engineering community via Twitter. Dan Woods explains, “I learned more about chaos engineering from Kolton Andrus than anyone else, he called it failure injection testing”.
In October of 2014, while Gremlin co-founder Kolton Andrus was at Netflix, they announced Failure Injection Testing (FIT) that built on the concepts of the Simian Army, but gave developers more granular control over the “blast radius” of their failure injection. The Simian Army tools had been so effective that in some instances created painful outages and as a result Netflix developers grew wary of using the tools in the first place. FIT gave developers control over the scope of their failure so they could realize the insights of chaos engineering, but mitigate potential downside.
What are the Principles of Chaos Engineering?
Chaos engineering involves running thoughtful, planned experiments which teach us how our systems behave in the face of failure.
These experiments follow three steps:
You start by forming a hypothesis about how a system should behave when something goes wrong.
Then, you design the smallest possible experiment to test it in your system.
Finally, you measure the impact of the failure at each step, looking for signs of success or failure. When the experiment completes, you should have a better understanding of your systems real-world behavior.
Which companies practice chaos engineering?
Because of their adoption of distributed systems like microservices, Chaos Engineering is common practice within many large technology companies, including Twilio, Netflix, LinkedIn, Facebook, Google, Microsoft, and Amazon.
Chaos Engineering is also practiced within the (traditionally very conservative) banking and finance industry. For example, in 2014, the National Australia Bank migrated from physical infrastructure to Amazon Web Services and used Chaos Engineering to dramatically reduce incident counts.
Why would you break things on purpose?
It’s helpful to think of a vaccine or a flu shot where you inject yourself with a small amount of a potentially harmful foreign body in order to prevent illness. Chaos engineering is a tool we use to build such an immunity in our technical systems by injecting harm (like latency, CPU failure, or network black holes) in order to find and mitigate potential weaknesses.
These experiments have the added benefit of helping our team build muscle memory in resolving outages, akin to a fire drill (or changing a flat tire in the Netflix analogy). By breaking things on purpose we are able to identify unknown issues that could impact our systems and customers.
If you’re looking for a couple of starting places for chaos experiments, you can read these posts:
- 4 Chaos Experiments to Start With
- How to Run a Gameday
- Gremlin’s Gameday: Breaking DynamoDB (We take our own medicine!)
What’s the role of chaos engineering in distributed systems?
Distributed systems are inherently more complex than monolithic systems, and therefore it’s difficult to predict all the ways that they might fail. The eight fallacies of distributed systems shared by Peter Deutsch and others at Sun Microsystems describe false assumptions that programmers new to distributed applications invariably make.
Fallacies of Distributed Systems:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Several of these fallacies are the focus of Chaos Engineering experiments such as “packet-loss attacks” and “latency attacks”. For example, network outages can cause a range of failures for applications that severely impact customers. Applications may stall while they wait endlessly for a packet. Applications may permanently consume memory or other Linux system resources. And even after a network outage has passed, applications may fail to retry stalled operations, or may retry to aggressively. Applications may even require a manual restart. Each of these examples need to be tested and prepared for.
What are the customer, business and technical benefits of chaos engineering?
Companies practicing Chaos Engineering work towards system resilience as this benefits their customers, business, and engineering team.
We describe the benefits as follows:
- Customer: the increased availability and durability of service means no outages disrupt their day-to-day lives.
- Business: chaos engineering can help prevent extremely large losses in revenue and maintenance costs, create happier and more engaged engineers, improve in on-call training for engineering teams, and improve the SEV (incident management) program for the entire company.
- Technical: the insights from chaos experiments can mean a reduction in incidents, reduction in on-call burden, increased understanding of system failure modes, improved system design, faster mean time to detection for SEVs, and reduction in repeated SEVs.
Chaos engineering for service teams
Many engineering organizations, including Netflix and Stitch Fix, have dedicated Chaos Engineering teams. These teams are often small in size, with 2-5 engineers. The Chaos Engineering team will own and advocate for Chaos Engineering across the organization. However, they will not be the only engineers using chaos engineering day-to-day. They will empower teams across their engineering organization to use Chaos Engineering.
Examples of service teams that would often be amongst the first to perform Chaos Engineering within a company:
- Traffic Team (e.g. Nginx, Apache, DNS)
- Streaming Team (e.g. Kafka)
- Storage Team (e.g. S3)
- Data Team (e.g. Hadoop/HDFS)
- Database Team (e.g. MySQL, Amazon RDS, PostgreSQL)
Other companies, such as Remind, are integrating chaos engineering into their normal release cycle like other best practice testing as a way to ensure reliability is baked into every feature.
Which Chaos Engineering experiments do you perform first?
We argue that you should perform your experiments in the following order:
- Known Knowns - Things we are aware of and understand
- Known Unknowns - Things we are aware of but don’t understand
- Unknown Knowns - Things we understand but are not aware of
- Unknown Unknowns - Things we are neither aware of nor understand
The diagram below illustrates this concept:
To illustrate this in practice with examples, we will demonstrate how to select experiments based on a sharded MySQL Database. In this example we have a cluster of 100 MySQL hosts with multiple shards per host.
In one region, we have a primary database host with 2 replicas and we use semi-sync replication. We also have a pseudo primary and 2 pseudo replicas in a different region.
- We know that when a replica experiences a shutdown it will be removed from the cluster. We know that a new replica will then be cloned from the primary and added back to the cluster.
- We know that the clone will occur as we have logs that confirm if it succeeds or fails, but we don’t know the weekly average of the mean time it takes from experiencing a failure to adding a clone back to the cluster effectively.
- We know we will get an alert that the cluster has only one replica after 5 minutes but we don’t know if our alerting threshold should be adjusted to more effectively prevent incidents.
- If we shutdown the two replicas for a cluster at the same time, we don’t know exactly the mean time during a Monday morning it would take us to clone two new replicas off the existing primary. But we do know we have a pseudo primary and two replicas which will also have the transactions.
- We don’t know exactly what would happen if we shutdown an entire cluster in our main region, and we don’t know if the pseudo region would be able to failover effectively because we have not yet run this scenario.
We would create the following chaos engineering experiments and work through them in the following order:
- Known-Knowns: shutdown one replica and measure the time it takes for the shutdown to be detected, the replica to be removed, the clone to kick-off, the clone to be completed and the clone to be added back to the cluster. Before you kick off this experiment increase replicas from two to three. Run the shutdown experiment at a regular frequency but aim to avoid the experiment resulting in 0 replicas at any time. Report on the mean total time to recovery for a replica shutdown failure and break this down by day and time to account for peak hours.
- Known-Unknowns: Use the results and data of the known-known experiment to answer questions which would currently be “known-unknowns”. You will now be able to know the impact the weekly average of the mean time it takes from experiencing a failure to adding a clone back to the cluster effectively. You will also know if 5 minutes is an appropriate alerting threshold to prevent SEVs.
- Unknown-Knowns: Increase the number of replicas to four before conducting this experiment. Shutdown two replicas for a cluster at the same time, collect the mean time during a Monday morning over several months to determine how long it would take us to clone two new replicas off the existing primary. This experiment may identify unknown issues, for example, the primary cannot handle the load from cloning and backups at the same time and you need to make better use of the replicas.
- Unknown-Unknowns: Shutdown of an entire cluster (primary and two replicas) would require engineering work to make this possible. It is possible this failure will occur unexpectedly in the wild but you are not yet ready to handle it. Prioritize the engineering work to handle this failure scenario before you perform chaos engineering experiments.
How do you plan for your first chaos engineering experiments?
Planning your First Experiment
One of the most powerful questions in Chaos Engineering is “What could go wrong?”. By asking this question about our services and environments, we can review potential weaknesses and discuss expected outcomes. Similar to a risk assessment, this informs priorities about which scenarios are more likely (or more frightening) and should be tested first. By sitting down as a team and white-boarding your service(s), dependencies (both internal and external), and data stores, you can formulate a picture of “What could go wrong?”. When in doubt, injecting a failure or a delay into each of your dependencies is a great place to start.
Creating a Hypothesis
You have an idea of what can go wrong. You have chosen a scenario, the exact failure to inject. What happens next? This is a excellent thought exercise to work through as a team. By discussing the scenario, you can hypothesize on the expected outcome when running live. What will be the impact to customers, to your service or to your dependencies?
Measuring the Impact
In order to understand how your system behaves under duress, you need to measure your system’s availability and durability. It is good to have a key performance metric that correlates to customer success (such as orders per minute, or stream starts per second). As a rule of thumb, if you ever see an impact to these metrics, you want to halt the experiment immediately. Next is measuring the failure itself where you want to verify (or disprove) your hypothesis. This could be the impact on latency, requests per second, or system resources. Lastly, you want to survey your dashboards and alarms for unintended side effects.
Have a Rollback Plan
Always have a plan in case things go wrong. You must accept that sometimes even the backup plan can fail. Talk through the ways in which you’re going to revert the impact. If you’re running commands by hand, be thoughtful not to break ssh or control plane access to your instances. One of the core aspects of Gremlin is safety. All of our attacks can be reverted immediately, allowing you to safely abort and return to steady state if things go wrong.
Go fix it!
After running your first experiment, hopefully, there is one of two outcomes. You’ve verified either that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you’ve found a problem before it caused an outage.
Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system’s failure modes you will reduce your operational burden, increase your availability, and sleep better at night. Gremlin makes it safe and simple to get started, email us to get started today!
Where can you find additional Chaos Engineering resources?
Pavlos Ratis has created a GitHub repo called “Awesome Chaos Engineering,” which is a curated list of Chaos Engineering resources. You can find Books, Tools, Papers, Blogs, Newsletters, Conferences, MeetUps, Forums and engineers to follow on Twitter. View the GitHub repo here: https://github.com/dastergon/awesome-chaos-engineering. Gremlin Principal Software Engineer Matt Jacobs has written a guide on 4 Chaos Engineering Experiments To Start With.
What are the recommended Chaos Engineering conference presentations to view?
- QCon 2015 - Kolton Andrus (Gremlin) on Breaking Things at Netflix
- AWS re:Invent 2017 - Nora Jones (Netflix) Describes Why We Need More Chaos - Chaos Engineering, That Is
- Velocity 2017 - Kolton Andrus (Gremlin) shares the Evolution of Chaos
- Australian Government - Tammy Butow (Gremlin) gives an Introduction to Chaos Engineering
- SRECon 2017 - Kolton Andrus (Gremlin) on Breaking Things
Where can you find the chaos engineering community?
The Chaos Engineering community is a global community with engineers based in over 10 countries around the world.
- Chaos Engineering Meetup community with over 2000 engineers, join here: meetup.com/pro/chaos.
- Chaos Engineering Slack community with over 400 engineers, join here: gremlin.com/slack.
- Follow Gremlin on Twitter @gremlininc and instagram @thegremlininc
As web systems have grown much more complex with the rise of distributed systems like microservices, system failures have become difficult to predict. So in order to prevent failures from happening, there is a need to be proactive in our efforts to learn from failure.
In this paper, we’ve shared a brief history of Chaos Engineering, and then demonstrated how Chaos Engineering offers us new insights into our systems. Finally, we shared Chaos Engineering resources where you can learn more about how to incorporate a little chaos into your engineering culture.
We look forward to hearing about your Chaos Engineering journey and encourage you to share your progress with the Chaos Engineering community.