When reading about Chaos Engineering, you’ll likely hear the terms “fault injection” or “failure injection.” As the name suggests, fault injection is a technique for deliberately introducing stress or failure into a system in order to see how the system responds. But what exactly does this mean, and how does this relate to Chaos Engineering? In this post, we’ll look at the history of fault injection, how it’s evolved over time, and how it contributed to Chaos Engineering as we know it today.
Fault injection originated as a technique for simulating failures at the hardware level. Engineers exposed devices to various harmful conditions and observed the devices to determine how well they continued functioning. These tests involved shorting connections between pins, creating electromagnetic interference, disrupting the power supply, and even bombarding circuits with radiation. The goal was to see how stressors like these affected the device’s operations, determine at what point the device would fail, and redesign the hardware to be more resilient.
Over time, engineers developed tools for introducing faults using other methods. Devices started including specialized debugging ports such as JTAG, which allowed for injecting controlled faults directly into circuits. With the development of software fault injection, software engineers could simulate faults in their applications, test error and exception-handling functions, alter source code to inject simulated faults (known as compile-time injection) and trigger faults on actively running systems (known as runtime injection).
Runtime fault injection became especially popular among companies managing large, complex, distributed systems. In 2011, Netflix open sourced Chaos Monkey, a tool that terminated compute instances running in their cloud infrastructure. Chaos Monkey helped Netflix validate that their workloads could tolerate sudden and unexpected failures by randomly shutting down running systems. Netflix would later introduce their Failure Injection Testing (FIT) platform in 2014, which was a more sophisticated solution for orchestrating failure on a larger scale and across multiple teams. These tools laid the groundwork for Chaos Engineering as we know it today.
Being able to simulate failures in systems has several benefits:
We can thoroughly test our applications and systems. Traditional software testing focuses on happy path testing, or testing the code paths that we expect our applications and systems to take. This doesn’t account for the many ways that our systems can deviate, whether due to unexpected user behavior, changing environment conditions, failures in dependencies, or other situations. We want to be sure that any resilience mechanisms we have in place—error handling code, redundant instances, auto-healing policies, etc.—will work when we expect them to, and fault injection helps us verify this.
Read more about the limitations of traditional testing in our white paper, The New QA.
We can better identify the nature and cause of production failures. When our production environment fails, we immediately go into response mode. Our top priority is to stop the problem, and only after restoring service can we work on understanding the cause. Depending on the nature or severity of the problem, it could be days or weeks before we have a clear answer. Fault injection gives us full control over when and how we inject failure, including the ability to reproduce production incidents, so that we can validate fixes and deploy more confidently.
We can prepare for the unexpected. There’s a lot that can go wrong in production, and with how complex modern distributed systems are, even small failures can cascade into large outages. Fault injection lets us test conditions that are hard to anticipate, such as cluster-wide spikes in CPU or memory usage, multiple simultaneous host failures, and regional outages. This lets us prepare by adding resiliency mechanisms, adjusting our monitoring and alerting tools, updating our runbooks, and validating our disaster recovery plans.
Chaos Engineering is the practice of injecting controlled amounts of failure into a system, observing how the system responds, then using those observations to improve its reliability. If fault injection is a technique for injecting failure, then Chaos Engineering is the strategy of using fault injection to accomplish the goal of more reliable systems.
Chaos Engineering tools like Gremlin provide ways of injecting fault, such as shutting down a host, consuming CPU or memory, simulating network outages or latency, and terminating processes. However, the real value lies in uncovering and addressing failure modes, validating your monitoring and incident response processes, and reducing your risk of production outages. Chaos Engineering takes a more holistic approach towards improving reliability beyond testing systems, even though that’s still a core part of it. Gremlin supports this by providing recommended experiments based on your infrastructure, encouraging you to record your observations after each experiment, helping you run incident response exercises such as GameDays and FireDrills, and helping you track your progress towards better reliability.
Fault injection is a means to an end. We’re deliberately injecting failure into our systems, but we’re doing so with the goal of improving their reliability. It allows us to detect failure modes we wouldn’t find through normal testing, validate our error-handling and recovery mechanisms, and prepare our teams to handle production failures. Chaos Engineering solutions like Gremlin allow you to use fault injection so that you can safely find failure modes, reduce your risk of outages, and build confidence in your systems.