The transition from learning about Chaos Engineering to practicing it can be difficult. Chaos Engineering as a concept has been around for a while now, and plenty of talks, papers, and books have talked about why it's becoming essential to making distributed systems resilient and stable.
Introducing chaos to increase resilience seems like a contradictory concept at first. It's important to recognize that Chaos Engineering is an engineering discipline where you carefully use experiments of isolated and controlled system failure as a method to identify couplings in your system, which will allow you to identify how to increase resilience.
The ideal chaos experiment will introduce behavior into your system that is both likely to occur "in the wild", and impacts core business functionality. This article provides a few examples of high-value chaos experiments that reproduce common outages we've experienced. They should be a good starting point for your chaos experimentation. In each experiment, I've included a "Gremlin Recipe", if you want to use our tool to conduct the experiment.
Resources on computers are finite. A machine/VM/container will inevitably hit a resource limit at some point, and the application will be forced to handle the lack of a resource. Commonly, this is CPU, Memory, or I/O.
We can reproduce CPU exhaustion by conducting a chaos experiment. Running this experiment will consume CPU cycles, leaving the application with the same amount of customer-facing work, and less CPU to do it with. As always, we advocate starting small on a single instance, then increasing the blast radius as confidence grows. Common reactions to CPU exhaustion are an increase in errors and latency and a reduction in successful requests to customers.
Network dependencies are a fact of life in a distributed system, and as distributed systems are growing in adoption AND complexity, Chaos Engineering becomes an optimal way to test for potential failures on the path to increasing resilience.
We know from the Fallacies of Distributed Computing that the network is unreliable. This implies that the success of our overall system will be determined by how many network calls we make, and how much we can shield the customer from the failures that do occur.
Applications generally have both internal and external network dependencies. Here, I'm defining internal dependencies to be those systems that are under our organization's control. Someone is carrying a pager for that system and they work for the same company I do. We can (and should) expect that internal teams try to maintain availability, and use the proper tools and channels to notify them if they don't.
When you do have a dependency that is out of your control, it's critical to understand how the system reacts when it is unavailable. This is a network blackhole Chaos Experiment and will make the designated addresses unreachable from your application. Once you've applied the blackhole, you should check that your application starts up normally and is able to serve customer traffic without the dependency.
All interesting applications have some sort of storage, and managing the relationship between application and datastore is critical to overall system health. There are a variety of ways that an application may overwhelm a data store (poor queries, lack of indices, bad sharding, upstream caching decisions, etc), but all of them result in what appears to be an unresponsive data layer.
It's important to understand how datastore saturation manifests in your application. There are a few ways of modeling this with a Chaos Experiment. You can blackhole your datastore, making it appear completely unavailable. You can add latency to requests to your datastore, making it appear slow. Finally, you can consume I/O bandwidth to simulate a congested path to the datastore.
Since the datastore is a critical dependency, you should expect that some features of your application are slow or unavailable. Ideally, only the set of features backed by the datastore are affected. If the impact is wider than you expect, you should investigate for a hidden dependency on the datastore. This is also a great opportunity to tune timeouts to the datastore and test that they're cutting off requests as you expect them to.
It's easy to forget the critical role that DNS plays in keeping our systems running. Many companies experienced customer-facing issues when a real-world DNS failure occurred in October 2016. Because a failure like this is relatively rare, getting a recovery plan together is challenging.
The best way forward is to induce a DNS outage and understand how your application behaves. If you blackhole DNS traffic on a single instance, it will appear to that instance as if DNS is unavailable. The fixes will vary depending on the issue, but common solutions are to pass around IP addresses instead of hostnames for internal addressing and the use of a backup DNS provider. Once you've run this experiment, an exercise for the reader is to then consider what sort of damage an outage of an internal service discovery tool (like etcd, Eureka, Consul, etc) would cause.
This set of chaos experiments should help you get started. We'll be discussing more sophisticated chaos experiments in future articles, but in the meantime, we'd love to hear about other interesting tests you've run or any chaos success stories you have. We're active on Slack and we look forward to continuing the conversation there.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Request a Demo