By proactively testing how a system responds under stress, we can identify and fix failures before they end up in the news. Ultimately, the goal of Chaos Engineering is to enhance the stability and resiliency of our systems.
Chaos and Reliability Engineering techniques are quickly gaining traction as essential disciplines to building reliable applications. Many organizations - both big and small - have embraced Chaos Engineering over the last few years.
Creating reliable software is a fundamental necessity for modern cloud applications and architectures. As we move to the cloud or rearchitect our systems to be cloud native, our systems are becoming distributed by design and the potential for unplanned failure and unexpected outages increases significantly. Additionally, moving to DevOps further complicated reliability testing.
Testing disciplines like QA and others emerge in response to something that breaks consistently and warrants a new testing methodology.
For example, unit tests verify that a bit of code we write does what it’s supposed to. Integration tests verify that code we wrote plays nicely with the rest of the codebase. Sometimes we have system tests that attempt to verify that the entire system conforms to design specifications. Traditionally, development teams would pass their code to be tested to verify that it worked as expected or to find issues that needed to be fixed.
At this point, the code would be tossed over the proverbial wall to an operations team whose job it was to make that code run in a production environment. Operations bore the responsibility for getting stuff running, and because of the uniqueness of each organization’s environment, individual operations teams would come up with their own strategies and plans.
DevOps merged the development and operations teams together and made them share responsibility for production readiness and deployment. Agile and DevOps software processes have increased our development and deployment velocity by orders of magnitude so we can get products and features to customers faster.
But, the faster code is created and checked into master, the more frequently QA has to write tests and the more tests are needed. With faster velocity, the chances that an occasional error will slip past grow higher. To keep up, testing has been automated as much as possible.
Additionally, as we moved to microservices and other distributed, cloud-based architectures. These distributed systems have emergent behaviors, responding to various production conditions by scaling up and down in order to make sure the application can deliver a seamless experience to increasing customer demands. In other words, these systems never follow the same path to arrive at the customer experience. Emergent behaviors also means emergent failures. Distributed systems will fail, but it’s unlikely that they will fail the same way twice.
Our previous understanding of tests do not account for the unique and constantly changing production environments of today. The Ops side of DevOps does its best to make things work, but their mandate frequently only covers getting the code into production and hoping for the best or rolling back changes or making hotfixes when failures occur. They automate some testing, but don’t typically run tests that would uncover system failure arising from turbulent conditions in production.
Traditional quality assurance only covers the application layer of our software stack. And no amount of traditional QA testing or other traditional testing is going to verify whether our application, its various services, or the entire system will respond reliably under any condition, whether “working as designed” or under extreme loads and unusual circumstances.
A failure at any software stack or application layer can disrupt the customer experience. Traditional QA testing methods will not catch any of these potential problem conditions before they actually happen.
Furthermore, most traditional QA activities were absorbed into other teams. Many tests are now automated by CI/CD pipelines and watched over by an SRE or DevOps team. The responsibility for finding and fixing problems has become the responsibility of service owners. Adding to that is the undeniable fact that it is impossible to make testing and staging environments that accurately mimic production environments.
Because Chaos Engineering can test the quality of code at runtime, and has the potential for both automated and manual forms of testing, the discipline emerged as a powerful tool in the new Quality Assessment toolbox.
Earlier we explained how distributed systems are constantly changing, which means they’ll never break the same way twice, but that they will break. Chaos Engineering helps businesses guard against these failures by allowing engineers to simulate how their systems will respond to failures in a safe and controlled environment.
We use chaos experiments to simulate things on canary instances that we know have the potential to cause problems, like network latency. Does the new service hold up under light testing? Medium? Heavy? We push the new instances hard. In production. We gradually build up and even test past the point where we expect things to work. And we learn things. What we learn oftens creates opportunities to refine our work further in the next build. This is safe in production because other instances of the service are handling customer needs; no one should even be able to tell we are doing Chaos Engineering.
Chaos Engineering is the only way to find systemic issues in today’s complex reality, regardless of whether we use canary deployments or not. How will our REST API-driven inventory service behave when network latency increases by two microseconds? What happens when a large number of delayed requests all hit the microservice concurrently? How do we know? We test it.
We start by designing a small chaos experiment, one with a magnitude that is way smaller than we think has the potential to cause trouble. Next, we limit the blast radius and the real potential for harm so that we keep our system and data safe while our chaos testing is in progress. Then, we run the experiment and after it is complete we carefully examine our monitoring and observability and other system data and see what we learn.
What was affected by our chaos experiment? That data drives how we prioritize our efforts, mitigating the small problems we found before they can become big problems (and definitely mitigating any big problems we find right away!). Then we follow our work up by running the same chaos experiment again to confirm our work was effective.
Doing this repeatedly, starting small and fixing what we find each time, quickly adds up. Our systems become better and better at handling real-world events that we cannot control or prevent, such as when our cloud provider has an unexpected outage.
“Oh, no! Our Amazon S3 bucket in us-east-2 just went down?” No worries, we anticipated that and our system is still performing well from a customer standpoint. Perhaps we already had a failover backup in place in us-west-1 and designed our system to switch over when performance degraded to a certain level, before customers would notice.
Whatever our solution, we designed it, we implemented it, and then we tested it with Chaos Engineering. As a result, it worked as expected when a production failure occurred that was out of our control and, more importantly, our customers never even knew it happened.