The Discipline of Chaos Engineering
Last time, we introduced you to the idea of breaking things on purpose in order to build more reliable systems. By triggering failures intentionally in a controlled way, we gain confidence that our systems can deal with those failures before they occur in production.
Today's complex web systems are largely intractable. They're composed of many moving parts—web servers, databases, load balancers, CDNs, routers, and a lot more—working together to form an intricate whole. It's difficult, if not impossible, to fully understand how all the bits and pieces resonate with each other under different conditions. That's why we need to be proactive in our efforts to learn from failure.
At Gremlin, we believe that experience with failure is a prerequisite for creating reliable systems. And for us, the perfect embodiment of this idea is Chaos Engineering. This post kicks off a series designed to introduce you to Chaos Engineering. We will start with the fundamentals today and later explore best practices to make the most of it.
Chaos Engineering defined
So, what exactly is Chaos Engineering?
To put it simply, Chaos Engineering is one particular approach to "breaking things on purpose" that aims at teaching us something new about systems by performing experiments on them. Ultimately, our goal is to identify hidden problems that could arise in production. Only then will we be able to address systemic weaknesses and make our systems fault-tolerant.
Chaos Engineering goes beyond traditional (failure) testing in that it's not only about verifying assumptions. It also helps us explore the many unpredictable things that could happen and discover new properties of our inherently chaotic systems.
While the underlying concepts aren't that new, it was Netflix who originally formalized Chaos Engineering as a discipline. (Netflix, as it happens, is more than a video-streaming service. The company behind the service is also a pioneer in the field of automated failure testing, having developed tools like the notorious Chaos Monkey, which randomly terminates EC2 instances in AWS.)
In the Principles of Chaos Engineering, Netflix defines the discipline as follows:
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production.
As a Chaos Engineer, you test a system's ability to cope with real-world events - server failures, traffic spikes, malformed messages, etc - in a series of controlled experiments. According to the principles, these chaos experiments typically consist of four steps:
- Define the system's normal behavior—its steady state—based on measurable output like overall throughput, error rates, latency, and so on.
- Hypothesize about the steady state behavior of an experimental group, as compared to a stable control group. (For practical purposes, both groups may contain the same systems but at different points in time, i.e. before and during experimenting.)
- Expose the experimental group to simulated real-world events such as server crashes, malformed responses, or traffic spikes.
- Test the hypothesis by comparing the steady state of the control group and the experimental group. The smaller the differences, the more confidence we have that the system is reliable.
Needless to say, it's a good idea to address any problems uncovered this way. The whole point of simulating potentially catastrophic events over and over again is to make them non-events, irrelevant to our infrastructure and its operators.
As a basic example, let's say you want to know what happens if, for some reason, your MySQL database isn't available. Injecting failures so that individual pieces of your infrastructure become unavailable is an excellent way to learn about system coupling. Besides, it's an entirely realistic thing to happen in production.
Word of caution: you should never conduct a chaos experiment in production if you already know that it will cause severe damage, possibly affecting customers—and with them, your reputation. Always try to fix known problems first! Chaos Engineering requires a base level of reliability.
Back to the experiment. You hypothesize that, in the absence of the database, your web application would stop serving requests, immediately returning an error instead. To simulate the event, you block access to the database server. There are a couple of ways to achieve this. You could, for example, add an iptables rule to drop traffic to port 3306, or modify your cloud provider's security groups accordingly. (Or you could use Gremlin, our product which supports a variety of network attacks to impact traffic to your application—and safely reverts everything when done.)
When performing the experiment, something unexpected happens though: you notice that the web application seems to take forever to respond. Your system metrics confirm the suspicion that something must be wrong. After some investigation, you find the cause—a misconfigured client timeout—and fix it in a matter of minutes.
So your initial hypothesis turned out to be wrong. Fortunately, that's a good problem to have because you just discovered a new property of your system—a weakness, actually—that you didn't know about before the experiment! It's much cheaper to deal with these problems now rather than wait for them to manifest in production.
From our experience, there's a lot of low-hanging fruit you can fix when you're starting out with Chaos Engineering. You'll be surprised how much there is to learn from chaos in general and failure in particular.
Over the next couple of weeks, we will continue our series on Chaos Engineering, taking a closer look at some best practices—automating experiments, running them in production, reducing risk, etc—that allow you to design even better experiments.
We hope you enjoyed this introduction to Chaos Engineering, a powerful approach to building reliable systems. As always, the best way to internalize new concepts is by practice. Start today and run your first chaos experiments with Gremlin.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more