Chaos Engineering

Chaos Engineering is a disciplined approach of identifying potential failures before they become outages.

What is Chaos Engineering?

Chaos Engineering is a practice that aims to help us improve our systems by teaching us new things about how they operate. It involves injecting faults into systems (such as high CPU consumption, network latency, or dependency loss), observing how our systems respond, then using that knowledge to make improvements.

To put it simply, Chaos Engineering identifies hidden problems that could arise in production. Identifying these issues beforehand lets us address systemic weaknesses, make our systems fault-tolerant, and prevent outages in production.

Chaos Engineering goes beyond traditional (failure) testing in that it's not only about verifying assumptions. It helps us explore the unpredictable things that could happen, and discover new properties of our inherently chaotic systems.

Chaos Engineering as a discipline was originally formalized by Netflix. They created Chaos Monkey, the first well-known Chaos Engineering tool, which worked by randomly terminating Amazon EC2 instances. Since then, Chaos Engineering has grown to include dozens of tools used by hundreds (if not thousands) of teams around the world.

How does Chaos Engineering work?

Chaos Engineering involves running thoughtful, planned experiments that teach us how our systems behave in the face of failure. These experiments follow three steps:

Plan an experiment.
Contain the blast radius.
Scale or squash the experiment.

You start by forming a hypothesis about how a system should behave when something goes wrong. For example, if your primary web server fails, can you automatically failover to a redundant server?

Next, create an experiment to test your hypothesis. Make sure to limit the scope of your experiment (traditionally called the blast radius) to only the systems you want to test. For example, start by testing a single non-production server instead of your entire production deployment.

Finally, run and observe the experiment. Look for both successes and failures. Did your systems respond the way you expected? Did something unexpected happen? When the experiment is over, you’ll have a better understanding of your system's real-world behavior.

Why would you break things on purpose?

Chaos Engineering is often called “breaking things on purpose,” but the reality is much more nuanced than that. Think of a vaccine or a flu shot, where you inject yourself with a small amount of a potentially harmful foreign body in order to build resistance and prevent illness. Chaos Engineering is a tool we use to build such an immunity in our technical systems. We inject harm (like latency, CPU failure, or network black holes) in order to find and mitigate potential weaknesses.

These experiments also help teams build muscle memory in resolving outages, akin to a fire drill (or changing a flat tire, in the Netflix analogy). By breaking things on purpose, we surface unknown issues that could impact our systems and customers.

According to the 2021 State of Chaos Engineering report, the most common outcomes of Chaos Engineering are increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to product, and fewer outages. Teams who frequently run Chaos Engineering experiments are more likely to have >99.9% availability.

What's the role of Chaos Engineering in distributed systems?

Distributed systems are inherently more complex than monolithic systems. It’s hard to predict all the ways they might fail. The eight fallacies of distributed systems shared by Peter Deutsch and others at Sun Microsystems describe false assumptions that engineers new to distributed applications invariably make:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Many of these fallacies drive the design of Chaos Engineering experiments such as “packet-loss attacks” and “latency attacks”:

Network outages can cause a range of failures for applications that severely impact customers
Applications may stall while they wait endlessly for a packet
Applications may permanently consume memory or other Linux system resources
Even after a network outage has passed, applications may fail to retry stalled operations or may retry too aggressively. Applications may even require a manual restart.

We need to test and prepare for each of these scenarios.

Chaos Engineering vs. performance engineering

Regardless of whether your applications live on-premise, in a cloud environment, or somewhere in between in a hybrid state, you’re likely familiar with the struggles and complexities of scaling environments and applications. All engineers eventually must ask themselves: “Can my application and environment scale? And if we attract the users the business expects, will everything work as designed?”

For decades, enterprises used Performance Engineering to put their systems to the test in preparation for increased demand. Solutions like Micro Focus LoadRunner Professional and open source offerings like JMeter have helped engineers ensure proper performance and scaling of their systems to meet customer (and business) expectations. It’s the engineering team’s responsibility to validate that a system can handle an influx of users for peak events such as Cyber Monday or a big promotional sale.

But often, when we test performance, we test it in a stable environment. These performance tests are usually run under ideal conditions different from real world conditions. There aren’t any service issues, regional outages or thousands of other complexities found within on-premise or to complicate it further, cloud-native environments.

Simply put, scaling is incomplete without coupling scaling with resilience. It won’t mean much if your systems can scale, but they’re offline. The important question then becomes: “I know my application can handle 50k users, but can it handle these 50k users amidst a critical infrastructure outage or with the outage of a dependent service?

Let’s use a simple analogy of building the World’s Tallest Building, The Burj Khalifa in Dubai, which stands at a staggering 2,717ft. We could equate performance engineering with the ability to make this the tallest building in the world. But a tall building that nobody can access or that falls over in high winds isn't very impressive.

Reliability and resiliency are just as important as performance. Look at how many other features engineers built into the tower to account for earthquakes, high winds, and failures in other portions of the building.

Chaos Engineering and Performance Engineering are different. However, they’re complementary rather than exclusionary. Companies who adopt both not only have the ability to scale but scale in a way that keeps resiliency top of mind.

This dual approach lets engineers reassure the business while providing a great customer experience. The benefits include a reduction in incidents, higher availability numbers, and more robust, scalable systems.

Metrics for Chaos Engineering

Prior to your first Chaos Engineering experiments, it’s important to collect a specific set of metrics. These metrics include infrastructure monitoring metrics, alerting/on-call metrics, high severity incident (SEV) metrics, and application metrics.

Why is it important to collect baseline metrics for Chaos Engineering?

If you don’t collect metrics prior to practicing your Chaos Engineering experiments, you can’t measure whether they’ve succeeded. You also can’t define your success metrics and set goals for your teams.

When you do collect baseline metrics, you can answer questions such as:

What are the top 5 services with the highest counts of alerts?
What are the top 5 services with the highest counts of incidents?
What is an appropriate goal to set for a reduction in incidents in the next 3 months? 2x? 10x?
What are the top 3 main causes of a CPU spike?
What are typical downstream/upstream effects when the CPU is spiking?

How do you collect baseline metrics for Chaos Engineering?

To ensure you have collected the most useful metrics for your Chaos Engineering experiments, you need to cover the following:

Infrastructure monitoring metrics

Infrastructure monitoring software will enable you to measure trends, disks filling up, cpu spikes, I/O spikes, redundancy, and replication lag. You can collect appropriate monitoring metrics using monitoring software such as Datadog, New Relic and SignalFX.

You should aim to collect the following infrastructure metrics:

Resource: CPU, IO, Disk & Memory
State: Shutdown, Processes, Clock Time
Network: DNS, Latency, Packet Loss

Collecting these metrics involves two simple steps:

Roll out your chosen monitoring agent across your fleet e.g. inside a Docker container
Use the out-of-the-box monitoring metrics and dashboards provided by your software

Alerting and on-call metrics

Alerting and on-call software will enable you to measure total alert counts by service per week, time to resolution for alerts per service, noisy alerts by service per week (self-resolving) and the top 20 most frequent alerts per week for each service.

These metrics include:

Total alert counts by service per week
Time to resolution for alerts per service
Noisy alerts by service per week (self-resolving)
Top 20 most frequent alerts per week for each service

Software applications that collect alerting and on-call metrics include PagerDuty, VictorOps, OpsGenie and Etsy OpsWeekly.

High Severity Incident (SEV) metrics

Establishing a High Severity Incident Management (SEV) program will enable you to create SEV levels (e.g. 0, 1, 2 and 3), measure the total count of incidents per week by SEV level, measure the total count of SEVs per week by service and the MTTD, MTTR and MTBF for SEVs by service.

Important SEV metrics include:

Total count of incidents per week by SEV level
Total count of SEVs per week by service
MTTD, MTTR and MTBF for SEVs by service

Application metrics

Observability metrics will enable you to monitor your applications. This is particularly important when you are practicing application-level failure injection. Software that collects application metrics includes Sentry and Honeycomb.

Collecting these metrics involves two steps:

Out of the box SDKs will attempt to hook themselves into your runtime environment or framework to automatically report fatal errors. However, in many situations, it’s useful to manually report errors or messages.

Ready to demonstrate your understanding of the Chaos Engineering fundamentals? Get your free Gremlin Certified Chaos Engineering Practitioner (GCCEP) certification.

Chaos Engineering use cases

Demonstrating regulatory compliance

Technology organizations in regulated industries have strict, often complex requirements for availability, data integrity, etc. Chaos Engineering helps ensure that your systems are fault tolerant by letting you test key compliance aspects, such as disaster recovery plans and automatic failover systems.

Use Chaos Engineering to:

Verify your recovery time objective (RTO) and recovery point objective (RPO).
Test your automated incident mitigation schemes, like redundant instances, database failover, and data recovery.
Confirm that your data preservation methods work as intended.
Ensure your monitoring tools properly send alerts when necessary.
Test system performance under heavy load, including high CPU usage, high network latency or packet loss, and unusual disk I/O usage.
Demonstrate that your system can successfully handle a Distributed Denial of Service (DDoS) or similar cyber attack.

Learn more in our article: Using Chaos Engineering to demonstrate regulatory compliance.

Maximizing resilience

Tuning today’s complex applications is becoming increasingly challenging, even for experienced performance engineers. This is due to the huge number of tunable parameters at each layer of the environment. Adding to this complexity, these systems often interact in counterintuitive ways. Likewise, they may behave under special workloads or circumstances in such a way that vendor defaults and best practices become ineffective, or worse, negatively impact resilience.

Chaos Engineering uncovers unexpected problems in these complex systems, verifies that fallback and failover mechanisms work as expected, and teaches engineers how to best maximize resilience to failure.

Site reliability

One of the main use cases for Chaos Engineering is ensuring that your technology systems and environment can withstand turbulent or unfavorable conditions. Failures in critical systems like load balancers, API gateways, and application servers can lead to degraded performance and outages. Running Chaos Engineering experiments validates that your systems and infrastructure are reliable so that developers can feel confident deploying workloads onto them.

Disaster recovery

Organizations with high availability requirements will often create a disaster recovery (DR) plan. Disaster recovery is the practice of restoring an organization’s IT operations after a disruptive event, like an earthquake, flood, or fire. Organizations develop formal procedures for responding to events (disaster recovery plans, or DRPs) and practice those plans so engineers can respond quickly in case of an actual emergency.

Chaos Engineering lets teams simulate disaster-like conditions so they can test their plans and processes. This helps teams gain valuable training, build confidence in the plan, and ensure that real-world disasters are responded to quickly, efficiently, and safely.

Benefits of Chaos Engineering

Business benefits

Chaos Engineering helps businesses reduce their risk of incidents and outages. Outages can result in lost revenue due to customers not being able to use the service. They also give businesses a competitive advantage by making availability a key differentiator.

In highly regulated industries like financial services, government, and healthcare, poor reliability can lead to heavy fines. Chaos Engineering helps avoid these fines, as well as the high-profile stories that usually accompany them. Chaos Engineering also helps accelerate other practices designed to identify failure modes, such as failure mode and effects analysis (FMEA). The result is a more competitive, more reliable, less risky business.

Engineering benefits

Engineers benefit from the technical insights that Chaos Engineering provides. These can lead to reductions in incidents, reduced on-call burden, better understanding of system design and system failure modes, faster mean time to detection (MTTD), and a reduction in high severity (SEV-1) incidents.

Engineers gain confidence in their systems by learning how they can fail and what mechanisms are in place to prevent them from failing. Engineering teams can also use Chaos Engineering to simulate failures and react to those failures as though they were real production incidents (these are called GameDays). This lets teams practice and improve their incident response processes, runbooks, and mean time to recovery (MTTR).

It’s true that Chaos Engineering is another practice for engineers to adopt and learn, which can create resistance. Engineers often need to build a business case for why teams should adopt it. But the results benefit the entire organization, and especially the engineers working on more reliable systems.

Customer benefits

All of these improvements to reliability ultimately benefit the customer. Outages are less likely to disrupt customers’ day-to-day lives, which makes them more likely to trust and use the service. Customers benefit from increased reliability, durability, and availability.

Containerization and Chaos Engineering

Chaos Engineering practices apply to all platforms and cloud providers. At Gremlin, we most often see teams apply Chaos Engineering to AWS, Microsoft Azure, and Kubernetes workloads.

AWS

In 2020, AWS added Chaos Engineering to the reliability pillar of the Well-Architected Framework (WAF). This shows how important Chaos Engineering is to cloud reliability. Chaos Engineering helps ensure resilient AWS deployments and continuously validates your implementation of the WAF.

Microsoft

Microsoft Azure is the second largest cloud provider after AWS. Windows is the leading operating system for servers. Chaos Engineering ensures these systems’ reliability by testing for risks unique to Windows-based environments, such as Windows Server Failover Clustering (WSFC), SQL Server Always On availability groups (AG), and Microsoft Exchange Server back pressure. It also ensures that your Azure workloads are resilient.

Kubernetes

Kubernetes is one of the most popular software deployment platforms. But it has a lot of moving parts. For unprepared teams, this complexity can result in unexpected behaviors, application crashes, and cluster failures. Teams using, adopting, or planning a Kubernetes migration can use Chaos Engineering to ensure they’re ready for whatever risks a production Kubernetes deployment can throw at them.

If your team is new to Kubernetes, read why if you’re adopting Kubernetes, you need Chaos Engineering. Or, if your team is already on Kubernetes, learn how to run your first 5 Chaos Engineering experiments on Kubernetes.

Getting started with Chaos Engineering

When you’re ready to take the next step into adopting Chaos Engineering, there’s a process that will maximize your benefits.

Much like QA testing or performance testing, Chaos Engineering requires you to make an assumption about how your systems work (a hypothesis). From there, you can construct a testing scenario (called an experiment), run it, and observe the outcomes to determine whether your hypothesis was accurate.

When mapped out, the process looks like this:

Consider the potential failure points in your environment.
Create a hypothesis about a potential failure scenario.
Identify the smallest set of systems you can test to confirm your hypothesis (i.e. your blast radius).
Run a Chaos Engineering experiment on those systems.
Observe the results and form a conclusion.

If the experiment reveals a failure mode, address it and re-run the experiment to confirm your fix. If not, consider scaling up the experiment to a larger blast radius to make sure your systems can withstand a larger scale impact.

If something unexpected happened—like a failure in a seemingly unrelated system—reduce the scope of your experiment to avoid unintended consequences. Repeat this process by coming up with other hypotheses and testing other systems.

As you go, you’ll get better at running Chaos Engineering experiments. Your systems will become more reliable.

Of course, it’s not enough to hand engineers a new tool and a quick-start guide to using it. Introducing any new tool or practice is difficult on its own, not to mention a relatively unfamiliar practice like Chaos Engineering. It’s important for engineers to know the how and the why behind Chaos Engineering so they can be most effective at using it.

You may also need to create incentives to encourage engineers to integrate Chaos Engineering into their day-to-day workflows, respond to pushback, and encourage other teams in the organization to also adopt Chaos Engineering practices. To learn more, we created a comprehensive guide on how to train your engineers in Chaos Engineering.

Now that you understand the “how” of getting started with Chaos Engineering, join our Slack channel to learn from and connect with other engineers using Chaos Engineering. You can also watch our webinar—Five hidden barriers to Chaos Engineering success—to learn how to avoid the most common pitfalls in adopting Chaos Engineering.

Industry applications for Chaos Engineering

Financial services

For financial services, Chaos Engineering helps:

Improve reliability while reducing IT costs. Engineers in finance companies face unique challenges such as large, distributed teams; separation of ownership; and technology platforms managed by third-parties. Chaos Engineering helps teams verify the reliability of their own systems and their ability to withstand failed dependencies.
Improve the customer experience. Modern financial companies rely on technology, whether it’s mobile banking, ATMs, peer-to-peer (P2P) payments, or customer service support. They also face increasing competition from fintechs, who aren’t restrained by legacy applications or mainframe systems. Chaos Engineering helps mitigate the risk of system failures, allowing engineers to increase their velocity of changes to better serve customers and keep pace with fintechs.
Proactively test systems for compliance. Financial institutions face enormous pressure from regulatory agencies and financial markets for service availability. Chaos Engineering gives engineers insight into how systems behave overall, not just how they fail. Doing this early in development reduces the risk of high-profile IT failures, which can result in heavy fines and lost customer goodwill.

Tech business management

Tech Business Management (TBM) is a collaborative decision-making framework meant to unite finance, business, and IT in business decisions. It defines ten core tenets for organizations to implement, with the intention of helping the organization drive IT decisions based on overall business needs like reliability and customer satisfaction. The overall goals of TBM are to communicate the value of IT spending to business leaders, and reduce IT costs without sacrificing vital services.

Chaos Engineering supports TBM goals in several ways:

Improve customer satisfaction and retention by improving service reliability, performance, and recoverability.
Find and resolve problems faster by running training events like GameDays and proactively simulating disaster recovery scenarios.

Case Studies

Ensuring reliability of customer data platforms in AWS

Charter makes data driven decisions to perfect the customer experience. Gremlin keeps their data platforms reliable.

Tutorials

Visualize Chaos Experiments in Grafana with Gremlin webhooks

Use Gremlin webhooks to send events to Grafana for displaying annotations and alerts, and correlate Chaos Experiments with impact

Technical Product Marketer

How to simulate missing and failed dependencies using Gremlin

Tired of unreliable dependencies taking your applications offline? Learn how to build and test dependency-resilient services using Gremlin.

Sr. Reliability Specialist

How to simulate a zone/region evacuation using Gremlin

Learn how to recreate large-scale availability zone and region outages using Gremlin, the reliability management platform for enterprises.

Sr. Reliability Specialist

How to run a Chaos Engineering experiment on AWS Lambda using Java and Failure Flags

Learn how to improve the resiliency of your Java applications running on AWS Lambda using Gremlin Failure Flags.

Sr. Reliability Specialist

How to run a Chaos Engineering experiment on AWS Lambda using Python and Failure Flags

Failure Flags lets you test and improve the reliability of your applications, without requiring agents or system-level access. Learn how it works in this tutorial.

Sr. Reliability Specialist

How to run an experiment on AWS Lambda using Failure Flags and Node.js

Learn how to test the reliability of your Node.js AWS Lambda functions using Gremlin's Failure Flags feature.

Sr. Reliability Specialist

How to run multiple experiments in parallel using Gremlin

Learn how to run parallel Chaos Engineering experiments using Gremlin. Follow this step-by-step tutorial to run multiple scenarios simultaneously for better system resilience.

Sr. Reliability Specialist

How to use your Gremlin reliability score in Jenkins to ensure reliable releases

How do you prevent unreliable code from reaching production? By using a proactive reliability score from Gremlin in your CI/CD platform. Learn how to create a reliability gate in Jenkins using Gremlin.

Sr. Reliability Specialist

How to create a custom Test Suite

Learn how to build your own suite of reliability tests and Chaos Engineering experiments in Gremlin. Run multiple Scenarios as a single test harness and automate your testing processes.

Sr. Reliability Specialist

Chaos Engineering: the history, principles, and practice

Learn the history behind Chaos Engineering, how to apply it, and how it helps improve reliability.

How to install Gremlin on ECS

Learn how to install and use Gremlin on Amazon Elastic Container Service (ECS).

Philip Gebhardt

Software Engineer

How to use Detected Risks to quickly find reliability weaknesses

Learn how to uncover reliability risks in your systems in just a few minutes with Gremlin.

Sr. Reliability Specialist