Chaos Engineering is a practice that is growing in implementation and interest. What is it and why are some of the most successful companies in the world adopting it?
Chaos Theory, in mathematics, studies how changing systems behave as the result of apparently random actions. Recall the widely-dispersed idea of a butterfly flapping its wings in South America having the potential to cause a hurricane in the Caribbean or the Gulf of Mexico.
In that sense, Chaos Engineering studies how highly complex large-scale computer systems respond to apparently random events. A networking glitch in one data center or a misconfigured server configuration change that propagates across your system can have catastrophic results and expensive downtime.
Originally, Chaos Engineering involved submitting extremely complex cloud-deployed systems to randomized, negative behavior, like shutting down an individual node or instance to see how the system responded. This was Netflix’s rationale for creating and later releasing their open source Chaos Monkey. They were migrating to AWS, and needed to be reliable against random host failure.
The ultimate purpose was never chaos, but reliability. Learning how a system responds to a stimulus affords engineers the opportunity to compensate by adjusting their systems to automatically mitigate against future occurrences of the same stimulus. When you simulate a problem while controlling the parameters, you collect useful data while limiting the impact.
Stated simply, Chaos Engineering is not chaotic, it actively seeks to limit the chaos of outages by carefully investigating how to continually make a system more robust. For this reason, some prefer to use terms like reliability engineering, which is a term with a broader coverage. A site reliability engineer may perform Chaos Engineering experiments, but that is only a part of their overall job.
As Chaos Engineering has matured (and continues to mature), the idea of injecting random failures gave way to instead injecting intentional, measured, known failures in carefully designed ways.
This is done thoughtfully, scientifically, experimentally, with designs that limit the impact area (often called the blast radius) and the potential impact to other parts of the wider system and with a means to stop and roll back the experiment at any moment if we discover it is causing harm beyond the goals of the experiment. Metrics are used across the system to measure the results of testing and provide as full of a picture of the response to the stimulus as possible. Experiments may be conducted anywhere across cloud instances, services, microservices, Docker or other containers, and even via intentional application layer fault injection. We even see engineers testing the resource constraints of their monoliths.
While we at Gremlin have some clear thoughts and ideas of what Chaos Engineering is and how we believe it is best performed, there’s an entire community of engineers bringing the culture and practice to their organizations.
To that end, we decided to ask experts in diverse roles from multiple companies for their thoughts and opinions and have compiled them here.
In my opinion, Chaos Engineering is when you make resiliency a must-have for your system. It's accomplished by controlled experiments or GameDays where specific failure modes are tested and the results are used in order to make the system more robust. Chaos Engineering makes development teams think about the system that runs where failures are simply another day in the office: production.
Right now, Chaos Engineering is starting to spread across many companies, as they are seeing the value of avoiding hard downtime. It makes it easier to get buy-in once there's actual quantification (of down time’s cost to the business). In the future, I expect Chaos Engineering to go mainstream and experimentation toggles built into more software. For me, it feels like it will augment integration testing, as it's a great point to test overall system stability.
Chaos Engineering is performing self-contained engineering experiments in order to validate hypotheses about system behavior or exercise response paths, in a controlled manner that can be rolled back if they cause harm.
We have to be able to think about Chaos Engineering not just on the single service or single machine/layer scenarios, but also to establish certainty that we will detect problems with individual users seeing dramatically different performance from the rest of the herd, and
good observability practice is critical for being able to establish a baseline, validate that your experiments are having the intended effect, debug if things are behaving differently than expected [or exercise the debugging paths for people engaging in a game day], and ensure things return to nominal afterwards.
The industry is at a tipping point with respect to the principles pushed forward by Chaos Engineering. More often than not we are seeing that a view of systems reliability limited to tighter operational controls, reliable response rates, and low latency are not enough. This is especially true with the advent of micro service oriented architectures which have given rise to more complex, distributed systems. The industry is starting to grapple with the fact that we need to evaluate systemic failure scenarios before problems occur so that we thoroughly and holistically understand our systems.
I think about this problem as an analogy to vaccination. We can only do so much once a disease has entered and is spreading in our system. However, if we take active, preventative measures to defend our systems by exposing them to controlled failure scenarios, we are able to build up the system's immune system and fend of much worse, uncontrolled situations. I am excited to see that the principles of Chaos Engineering are making their way into the foundational layers of infrastructure that build up today's internet scale platforms. For instance, in the Envoy project (which I am a maintainer of) we have created a fault injection filter, which has allowed the resilience team at Lyft to routinely inject network level failures to our platform. With this mentality at the forefront we are able to build safer, more resilient products for our customers.
Chaos Engineering is an exciting space. As it is today, it's still very new, but it's gaining traction. When it comes up in conversation, people know of it and why we need it. From talking to engineers, Chaos Engineering is something that they want to implement eventually. As distributed cloud architectures and systems become more prevalent, I think Chaos Engineering will play an important role in ensuring a system's availability and durability. The costs are too high to not to. I look forward to seeing how Chaos Engineering get adopted.
Chaos Engineering has a marketing problem. It often gets interpreted as breaking things on purpose. Unfortunately, such an interpretation excludes safety considerations. In reality, Chaos Engineering is nothing but controlled hypothesis testing to test some assertions about a proposed change in the architecture of a system. You make a hypothesis about how the system would behave in a given failure mode, and then you perform a controlled test of that failure mode. There is nothing chaotic about this.
Most attempts today still seem to be testing small scale failure models such as introducing latencies or turning off machines in pre-production environments.
I see two opportunities for Chaos Engineering:
- Help teams better understand the physics of complex failure modes. This is possible when you start making hypotheses about production incidents and test those in controlled conditions. This is hard and time-consuming, and yet essential to gain comfort in dealing with failures.
- Enable testing of redundancy and compartmentalization. As simple as these principles are, breaking these principles due to ever-changing dependencies is quite easy.
Chaos Engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior.
We have today reached the point where Chaos Engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries. It’s quite easy to make the comparison to where experiment testing, automation, and observability was 5-10+ years ago. We’ve seen Chaos Engineering move from only being a conference talk topic to a widespread practice.
A lot has happened in the last couple of years regarding tools for Chaos Engineering, both on the commercial side and on the open-source side, which has made it easier to get started. Much of the focus is still on traditional infrastructure such as virtual machines and containers, so I believe there will progress made on tools for experimenting with managed cloud services and the serverless space.
All the best software engineers know that code is just a fraction of the work you need to do to ship a robust new feature. There is documentation, monitoring, instrumentation, and testing.
There are different kinds of test cases, unit, functions, end to end, integration and so on.
A few years ago DevOps introduced a new layer of complexity for sysadmin: code. Programmatically provisioning your infrastructure doesn't make your system more reliable.
Today Chaos Engineering according to me is the test framework for reliability. It is a practice you should use to validate scenarios such as:
- What happens to my queue if I stop the workers for 5 minutes? What if the workers become very slow at handling messages?
- Let me try to bring down my MySQL master time to time to see how writes recover.
Being able to answer these questions is a significant step forward, but I don't think this is enough. I expect to see Chaos Engineering to be driven more by telemetry data from infrastructure and applications. From what Charity Major calls observability.
If we do Chaos Engineering right, we will put a controlled, but still significant amount of stress on our systems. It means that we need to learn as much as we can from every new session.
Chaos without a learning experience is just Chaos. When we will be able to validate our scenarios with data we will do "Chaos Engineering."
As someone deeply ensconced in the Site Reliability Engineering space, my understanding of Chaos Engineering is strongly influenced by that outlook. For me, Chaos Engineering has the potential to offer a principled and accelerated way to learn things about a production environment that ordinarily would require both pain and time to reveal. This ability to uncover how applications and their substrate really behave through intentional experimentation is really attractive to me.
Given this potential, I’ve made the claim in public (for example, in my book Seeking SRE) that Chaos Engineering is one of those practices which will become “normal” for SREs in the next 3-5 years. I predict SREs will be including some practices that are clearly drawn from Chaos Engineering as part of their standard toolkit within this time period.
Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.
Chaos Engineering should not be separate from “Engineering” or an after-thought -- it should be part of the full stack software engineering process, and also utilize artificial intelligence to detect breakages and recommend resolutions.
Myra Haubrich, Senior SRE, Adobe Experience Platform
First, a HUGE thank you to everyone who contributed their thoughts to this discussion starter.
We at Gremlin hope this post helps nurture the continuing discussion of Chaos Engineering, builds some understanding among those new to the idea, and promotes growth in the implementation and understanding of building reliability by injecting failure.
If you are looking for a way to get started with Chaos Engineering, may we suggest you take a look at Gremlin Free?
Do you have thoughts you would like to share? Comments? Opinions? Polite and friendly rebukes? Tweet them out with a link to this article while using the #ChaosEngineering hashtag and let’s keep the ideas flowing.
It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal Site Reliability Engineer