Comparing Chaos Engineering tools
Originally published October 7, 2020.
In this article, we take an in-depth look at some of the most popular open-source and commercial Chaos Engineering tools in the community.
This isn't meant to be a direct comparison between other tools and Gremlin. Rather, it's an objective look at each tool's features, ease of use, system/platform support, and extensibility.
Our team at Gremlin has decades of combined experience implementing Chaos Engineering at companies like Netflix and Amazon. We understand how to apply Chaos Engineering to large-scale systems, and which features engineers will most likely want in a Chaos Engineering solution. We'll also provide a table to show how these tools compare.
Why use Chaos Engineering tools?
The goal of using Chaos Engineering tools isn't to deliberately cause your systems to fail. Rather, it's to help you find the reliability risks in your systems that could eventually lead to incidents and outages. The only way to truly understand the resiliency of your systems is to test them, and each of these tools gives you several ways of running those tests. For example, you can run a Chaos Engineering experiment to determine whether you have redundancy set up correctly.
By using Chaos Engineering tools to find risks before they impact users, teams:
- Spend less time firefighting and more time doing high-impact work.
- Move away from manual and ad-hoc processes to more automated processes.
- Shift from making assumptions about reliability to having objective, measured results.
Litmus / Harness Chaos Engineering
Release year: 2018
License: Open source (with a managed option)
Litmus started as a testing tool for OpenEBS and has since grown into one of the largest open-source Kubernetes-native Chaos Engineering tools. It provides a library of faults for testing containers, hosts, and platforms such as Amazon EC2, Apache Kafka, and Azure. It includes a native web interface called ChaosCenter, provides a public repository of experiments called ChaosHub, and is easy to install using the official Helm Chart. While Litmus itself is open-source, Harness owns the project. Harness also provides a fully managed SaaS version of Litmus called Harness Chaos Engineering.
One of Litmus' key features is a health-checking feature called Litmus Probes, which monitors the health of your application before, during, and after an experiment. Probes can run shell commands, send HTTP requests, or run Kubernetes commands to check the state of your environment before running an experiment. This is useful for automating error detection and halting an experiment if your systems are unsteady. However, experiments require a lot of setup. You'll need to:
- Make sure your Pods and nodes are accurately tagged before using Litmus. While you can select Pods or nodes by their name, it's easier to select them by label.
- Specify which Litmus Agent to run the experiment on. This could get confusing if you have many agents running on many hosts or clusters.
- Create your experiments. Litmus supports many faults, including custom faults, but no pre-built or recommended experiments. For example, you can't test for specific scenarios like scalability or redundancy without creating and configuring them yourself.
Should I use Litmus?
Litmus is a comprehensive tool that, unfortunately, comes with a steep learning curve. The ChaosCenter web interface makes it easy to run experiments but doesn't provide much guidance. Teams coming into Litmus need to already know what experiments to run, what to test for, and how to interpret the results, which is challenging for teams new to Chaos Engineering.
- Web UI that lets you review experiments and their history.
- A large number of experiment types, with more available on ChaosHub.
- Native integration with observability tools for monitoring system health during experiments.
- Running experiments is a complex process: you need to know what experiments you want to run, how to run them, and which agents to run them on.
- Reliability scoring requires manual input and only applies to workflows, not applications or services as a whole.
- Little to no guidance about what experiments to start with or what to do if they fail.
AWS Fault Injection Simulator and AWS Resilience Hub
Release year: 2021
Creator: Amazon Web Services
AWS Fault Injection Simulator (FIS) lets you introduce faults into AWS services, including Amazon RDS, Amazon EC2, and Amazon EKS. AWS Resilience Hub evaluates your AWS environment, compares them to reliability policies, and provides improvement recommendations.
Unlike other tools, FIS can inject failure into AWS services through the AWS control plane. This allows for unique faults that are difficult for other tools to replicate, such as failing over a managed database cluster or throttling API requests. FIS can also call SSM to run custom commands on hosts, such as using stress-ng to consume resources or tc to manage network traffic. Actions can be performed sequentially or in parallel and target any number of resources. If you're using SSM, you'll also need to write an SSM document with the commands you want to run on each target. You can also define stop conditions using CloudWatch alarms, which automatically stop faults when triggered.
While FIS and Resilience Hub are separate tools, they work best together. You can use Resilience Hub to find points of failure in your applications, and after you address them, use FIS to verify your fixes.
Should I use AWS FIS or AWS Resilience Hub?
Since FIS only supports a limited number of AWS services and has a limited number of attacks, whether you use FIS or Resilience Hub will depend on what services you use. Even still, running an attack in FIS can be difficult, as it requires IAM roles, targeting specific AWS resource IDs, and possibly creating SSM Documents. And while the cost of attacking is only $0.10 per minute per action, this can quickly add up as you run more complex experiments. Even if you use Resilience Hub to find weaknesses, you'll likely need another fault injection tool to fill the gaps left by FIS.
- Tight integration with the AWS platform.
- Access to the AWS backend for running specialized experiments, like API attacks.
- Limited selection of test types.
- Limited to AWS only. No multi or hybrid cloud support.
- Both services have separate pricing, which can add up as you increase testing.
Azure Chaos Studio
Release year: 2021 (preview)
Azure Chaos Studio is a Chaos Engineering solution for running faults directly on the Azure API. It supports faults on Azure Compute instances, CosmosDB, and Azure Cache for Redis. It also supports Kubernetes via integration with Chaos Mesh.
Defining an experiment consists of running one or more faults in sequence or parallel. Faults are either service-direct (i.e., run directly on an Azure resource) or agent-based (i.e., run inside a virtual machine). You can create and manage experiments using the Azure portal or the Chaos Studio REST API. While Chaos Studio has strong controls to prevent experiments from accidentally being run on the wrong systems, this can make it harder to start. Agent-based experiments require even more setup since each target host needs stress-ng pre-installed. This can mean significant upfront time, effort, and automation.
Should I use Azure Chaos Studio?
Chaos Studio's biggest strength is running faults directly on Azure infrastructure. However, it's a new service that's still only a public preview, and although it has exclusive access to the Azure API, you can replicate most of its faults using other tools.
- Native to Azure and integrated into the Azure console.
- Supports agentless faults.
- Difficult to get started.
- Doesn't provide recommended experiments or templates.
- Still in a preview state with no exact release date (as of this writing).
Platforms: Docker, Kubernetes, Linux hosts
Release year: 2018
Steadybit is a commercial Chaos Engineering tool that aims to build remediation into its experiments.
A key part of Steadybit is resilience policies, which are declarative rules that Steadybit evaluates your systems against during an experiment. For example, if your resilience policy requires a service to be redundant and a host experiment causes the service to stop responding, Steadybit will flag your system as non-compliant. This way, you can identify vulnerabilities that specifically impact your policies.
Steadybit also provides several automatic safety mechanisms. It integrates with monitoring and observability tools to monitor your systems and halt experiments if they become unhealthy. It also comes with built-in tests called checks that assess the health of your system before running a test. For example, one check tests whether the number of healthy pods in a Kubernetes deployment matches the target number, and if it doesn't, Steadybit prevents the experiment from running.
Should I use Steadybit?
Chaos Engineering can be a challenging practice to adopt, and Steadybit does a lot to make it more accessible. Unfortunately, it has many of the same challenges as other Chaos Engineering tools:
- You're expected to know what you're testing for when you start. Steadybit provides a guided wizard, but you still need to create a hypothesis, choose the type of fault, and select your target(s).
- Steadybit doesn't explicitly require monitoring or observability, but it's critical for monitoring system health and determining whether an experiment was successful.
- There isn't an obvious way to see your progress in your reliability journey (e.g., a reliability score).
- Straightforward UI and comprehensive set of faults.
- Resilience policies let you build reliability targets directly into your experiments.
- Unclear pricing structure.
- No clear guidance on how or where to start.
Release year: 2012
No Chaos Engineering list is complete without Chaos Monkey. It was one of the first Chaos Engineering tools and kickstarted the adoption of Chaos Engineering outside of large companies. From it, Netflix built out an entire suite of failure injection tools called the Simian Army, although many of them have since been retired or rolled into other tools like Swabbie.
Chaos Monkey is unpredictable by design. It only has one attack type: terminating virtual machine instances randomly during a time window. This lets you replicate unpredictable production incidents, but it can easily cause more harm than good if you're unprepared. While you can configure Chaos Monkey to check for ongoing outages before it runs, this involves writing custom Go code.
Should I use Chaos Monkey?
Chaos Monkey is historically significant, but its limited number of attacks, lengthy deployment process, Spinnaker requirement, and random approach to failure injection makes it less practical than other tools.
- A well-known tool with an extensive development history.
- Creates a mindset of preparing for disasters at any time.
- No longer developed or maintained.
- Requires Spinnaker and MySQL.
- Only one experiment type (shutdown).
- Limited control over blast radius and execution. Attacks are entirely randomized.
- No recovery or rollback mechanism. Any fault tolerance or outage detection requires you to write code.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2019
ChaosBlade is a CNCF sandbox project built on nearly ten years of failure testing at Alibaba. It supports many platforms, including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks, including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which lets you perform arbitrary code injection, delayed code execution, and modifying memory values.
ChaosBlade is modular by design. The core tool is more of an experiment orchestrator, while separate implementation projects perform the actual attacks. For example, the chaosblade-exec-os project provides host attacks, and the chaosblade-exec-cplus project provides C++ attacks. Alternatively, you can download ChaosBlade-box, which bundles host attacks, docker attacks, JVM attacks, C++ attacks, and Litmus integration. Unfortunately for English speakers, ChaosBlade's documentation is primarily written in Standard Chinese. English translations eventually came after users requested complete documentation, but these translations are still missing in some places.
Should I use ChaosBlade?
ChaosBlade is a versatile tool supporting a wide range of experiment types and target platforms. However, it lacks useful features such as centralized reporting, experiment scheduling, target randomization, and health checks.
- Supports a large number of targets and experiment types.
- Application-level fault injection for Java, C++, and Node.js.
- Multiple ways of managing experiments, including CLI commands, Kubernetes manifests, and REST API calls.
- Incomplete English language documentation.
- Lacks scheduling, safety, and reporting capabilities.
Release year: 2020
License: Open source
Chaos Mesh supports 17 unique attacks, including resource consumption, network latency, packet loss, bandwidth restriction, disk I/O latency, system time manipulation, and even kernel panics. Since this is a Kubernetes tool, you can fine-tune your blast radius using Kubernetes labels and selectors. Chaos Mesh also supports node-level attacks using an add-on tool called chaosd.
Chaos Mesh is also one of the few open-source tools to include a fully-featured web user interface (UI) called the Chaos Dashboard. In addition to creating new experiments, you can use the Dashboard to manage running experiments and view a timeline of executions. Chaos Mesh also integrates with Grafana, so you can view your executions alongside your cluster's metrics to see the direct impact.
You can run experiments immediately or schedule them, although scheduling involves writing additional YAML. And while you can set a duration time for running faults, Chaos Mesh runs them indefinitely by default.
Should I use Chaos Mesh?
Chaos Mesh offers a good amount of flexibility for Chaos Engineering on Kubernetes, encourages automating experiments via CI/CD, and is used by Azure Chaos Studio to inject Kubernetes faults. However, its most significant limitations are its need for easy scheduling and safe defaults.
- Comprehensive web UI with the ability to pause and resume experiments at any time.
- Ad-hoc experiments run indefinitely. The only way to set a duration is by scheduling or manually terminating the experiment.
- The Dashboard is a security risk. Anyone with access can run cluster-wide chaos experiments.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2018
License: Open source
Chaos Toolkit will be familiar to anyone who's used an infrastructure automation tool like Ansible. Instead of making you select from predefined experiments, Chaos Toolkit lets you define your own.
Each experiment consists of Actions and Probes. Actions execute commands on the target system, and Probes compare executed commands against an expected value. For example, we might create an Action that calls stress-ng to consume CPU on a web server, then use a Probe to check whether the website responds within a certain amount of time. Chaos Toolkit also provides drivers for interacting with different platforms and services. For example, you can use the AWS driver to experiment with AWS services or manage other Chaos Engineering tools like Toxiproxy and Istio. Chaos Toolkit can also auto-discover services in your environment and recommend tailored experiments.
While Chaos Toolkit supports several different platforms, it does run entirely through the CLI. This makes it difficult to run experiments across multiple systems unless you use a cloud platform like AWS or an orchestration platform like Kubernetes. Chaos Toolkit also lacks a native scheduling feature, GUI, or REST API.
Should I use Chaos Toolkit?
Chaos Toolkit is one of the most flexible tools for designing chaos experiments. Still, because of this DIY approach, it's more of a framework you build on than a ready-to-go Chaos Engineering solution.
- Complete control over experiments, including a native rollback mechanism for returning systems to their steady state.
- Ability to auto-discover services and recommend experiments.
- Built-in logging and reporting capabilities.
- No native scheduling feature.
- No easy way to run attacks on multiple systems (without specific drivers).
- Requires a more hands-on, technical effort in creating experiments.
- Limited portability of experiments.
Release year: 2014
License: Open source
Toxiproxy is a network failure injection tool that lets you create conditions such as latency, connection loss, bandwidth throttling, and packet manipulation. As the name implies, it acts as a proxy that sits between two services and can inject failure directly into traffic.
Toxiproxy has two components: A proxy server written in Go and a client communicating with the proxy. When configuring the Toxiproxy server, you define the routes between your applications, then create chaos experiments (called toxics) to alter traffic behavior along those routes. You can manage your experiments using a command-line client or REST API.
Should I use Toxiproxy?
The main challenge with Toxiproxy is its design. Because it's a proxy service, you must reconfigure your applications to route network traffic. Not only does this add complexity to your deployments, but it also creates a single point of failure if you have a single server handling multiple applications. For this reason, even the maintainers recommend against using it in production.
Toxiproxy also lacks many controls, such as scheduling experiments, halting experiments, and monitoring. A toxic runs until you delete it, and there's a risk of intermittent connection errors caused by port conflicts. It's fine for testing timeouts and retries in development but not production.
- Straightforward setup and configuration.
- Includes a comprehensive set of network attacks.
- Creates a single point of failure for network traffic. Not useful or recommended for validating production systems.
- No security controls. Anyone with access to Toxiproxy can run experiments on any service.
- Slow development speed. The last official release was in January 2019, and some clients haven't been updated in 2+ years.
Release year: 2017
Creators: Google, IBM, and Lyft
License: Open source
Istio can inject latency or HTTP errors into network traffic between any virtual service. Experiments are defined as Kubernetes manifests, and you can choose your targets using existing Istio features like virtual services and routing rules. You can also use health checks and Envoy statistics to monitor the impact on your systems.
This is about the extent of Istio's Chaos Engineering functionality. Experiments can't be scheduled, executed on hosts, customized extensively, or used outside of Istio. It's more or less taking advantage of Istio's place in the network to perform these experiments without adding any additional Chaos Engineering tools or functionality.
Should I use Istio?
If you already use Istio, this is an easy way to run chaos experiments on your cluster without having to deploy or learn another tool. Otherwise, it's not worth deploying Istio just for this feature.
- Natively built-into Istio. No additional setup is needed.
- Experiments are simple Kubernetes manifests.
- Only two experiment types.
- If you don't already use Istio, adding it solely for this feature may be overkill.
Which tool is right for me?
Ultimately, any Chaos Engineering or reliability testing tool aims to help you achieve greater reliability with less toil. The question is: which tool will help you achieve that goal easier and faster?
Teams new to reliability testing will benefit more from a tool that provides pre-built tests and guides them through the process. In contrast, experienced teams may want a tool that gives them complete control over their experiments for hyper-specific scenarios.
For a more in-depth comparison of each of these tools, plus guidance on whether to build or buy a Chaos Engineering solution, see our Guide to Chaos Engineering Tools. We also created a comparison matrix to show how these tools stack up with each other: