Originally published October 7, 2020.
In this article, we take an in-depth look at some of the most popular open-source and commercial Chaos Engineering tools in the community.
This isn't meant to be a direct comparison between other tools and Gremlin. Rather, it's an objective look at each tool's features, ease of use, system/platform support, and extensibility.
Our team at Gremlin has decades of combined experience implementing Chaos Engineering at companies like Netflix and Amazon. We understand how to apply Chaos Engineering to large-scale systems, and which features engineers will most likely want in a Chaos Engineering solution. We'll also provide a table to show how these tools compare.
The goal of using Chaos Engineering tools isn't to deliberately cause your systems to fail. Rather, it's to help you find the reliability risks in your systems that could eventually lead to incidents and outages. The only way to truly understand the resiliency of your systems is to test them, and each of these tools gives you several ways of running those tests. For example, you can run a Chaos Engineering experiment to determine whether you have redundancy set up correctly.
By using Chaos Engineering tools to find risks before they impact users, teams:
Release year: 2018
License: Open source (with a managed option)
Litmus started as a testing tool for OpenEBS and has since grown into one of the largest open-source Kubernetes-native Chaos Engineering tools. It provides a library of faults for testing containers, hosts, and platforms such as Amazon EC2, Apache Kafka, and Azure. It includes a native web interface called ChaosCenter, provides a public repository of experiments called ChaosHub, and is easy to install using the official Helm Chart. While Litmus itself is open-source, Harness owns the project. Harness also provides a fully managed SaaS version of Litmus called Harness Chaos Engineering.
One of Litmus' key features is a health-checking feature called Litmus Probes, which monitors the health of your application before, during, and after an experiment. Probes can run shell commands, send HTTP requests, or run Kubernetes commands to check the state of your environment before running an experiment. This is useful for automating error detection and halting an experiment if your systems are unsteady. However, experiments require a lot of setup. You'll need to:
Litmus is a comprehensive tool that, unfortunately, comes with a steep learning curve. The ChaosCenter web interface makes it easy to run experiments but doesn't provide much guidance. Teams coming into Litmus need to already know what experiments to run, what to test for, and how to interpret the results, which is challenging for teams new to Chaos Engineering.
Release year: 2021
Creator: Amazon Web Services
AWS Fault Injection Simulator (FIS) lets you introduce faults into AWS services, including Amazon RDS, Amazon EC2, and Amazon EKS. AWS Resilience Hub evaluates your AWS environment, compares them to reliability policies, and provides improvement recommendations.
Unlike other tools, FIS can inject failure into AWS services through the AWS control plane. This allows for unique faults that are difficult for other tools to replicate, such as failing over a managed database cluster or throttling API requests. FIS can also call SSM to run custom commands on hosts, such as using
stress-ng to consume resources or
tc to manage network traffic. Actions can be performed sequentially or in parallel and target any number of resources. If you're using SSM, you'll also need to write an SSM document with the commands you want to run on each target. You can also define stop conditions using CloudWatch alarms, which automatically stop faults when triggered.
While FIS and Resilience Hub are separate tools, they work best together. You can use Resilience Hub to find points of failure in your applications, and after you address them, use FIS to verify your fixes.
Since FIS only supports a limited number of AWS services and has a limited number of attacks, whether you use FIS or Resilience Hub will depend on what services you use. Even still, running an attack in FIS can be difficult, as it requires IAM roles, targeting specific AWS resource IDs, and possibly creating SSM Documents. And while the cost of attacking is only $0.10 per minute per action, this can quickly add up as you run more complex experiments. Even if you use Resilience Hub to find weaknesses, you'll likely need another fault injection tool to fill the gaps left by FIS.
Release year: 2021 (preview)
Azure Chaos Studio is a Chaos Engineering solution for running faults directly on the Azure API. It supports faults on Azure Compute instances, CosmosDB, and Azure Cache for Redis. It also supports Kubernetes via integration with Chaos Mesh.
Defining an experiment consists of running one or more faults in sequence or parallel. Faults are either service-direct (i.e., run directly on an Azure resource) or agent-based (i.e., run inside a virtual machine). You can create and manage experiments using the Azure portal or the Chaos Studio REST API. While Chaos Studio has strong controls to prevent experiments from accidentally being run on the wrong systems, this can make it harder to start. Agent-based experiments require even more setup since each target host needs stress-ng pre-installed. This can mean significant upfront time, effort, and automation.
Chaos Studio's biggest strength is running faults directly on Azure infrastructure. However, it's a new service that's still only a public preview, and although it has exclusive access to the Azure API, you can replicate most of its faults using other tools.
Platforms: Docker, Kubernetes, Linux hosts
Release year: 2018
Steadybit is a commercial Chaos Engineering tool that aims to build remediation into its experiments.
A key part of Steadybit is resilience policies, which are declarative rules that Steadybit evaluates your systems against during an experiment. For example, if your resilience policy requires a service to be redundant and a host experiment causes the service to stop responding, Steadybit will flag your system as non-compliant. This way, you can identify vulnerabilities that specifically impact your policies.
Steadybit also provides several automatic safety mechanisms. It integrates with monitoring and observability tools to monitor your systems and halt experiments if they become unhealthy. It also comes with built-in tests called checks that assess the health of your system before running a test. For example, one check tests whether the number of healthy pods in a Kubernetes deployment matches the target number, and if it doesn't, Steadybit prevents the experiment from running.
Chaos Engineering can be a challenging practice to adopt, and Steadybit does a lot to make it more accessible. Unfortunately, it has many of the same challenges as other Chaos Engineering tools:
Release year: 2012
No Chaos Engineering list is complete without Chaos Monkey. It was one of the first Chaos Engineering tools and kickstarted the adoption of Chaos Engineering outside of large companies. From it, Netflix built out an entire suite of failure injection tools called the Simian Army, although many of them have since been retired or rolled into other tools like Swabbie.
Chaos Monkey is unpredictable by design. It only has one attack type: terminating virtual machine instances randomly during a time window. This lets you replicate unpredictable production incidents, but it can easily cause more harm than good if you're unprepared. While you can configure Chaos Monkey to check for ongoing outages before it runs, this involves writing custom Go code.
Chaos Monkey is historically significant, but its limited number of attacks, lengthy deployment process, Spinnaker requirement, and random approach to failure injection makes it less practical than other tools.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2019
ChaosBlade is a CNCF sandbox project built on nearly ten years of failure testing at Alibaba. It supports many platforms, including Kubernetes, cloud platforms, and bare-metal, and provides dozens of attacks, including packet loss, process killing, and resource consumption. It also supports application-level fault injection for Java, C++, and Node.js applications, which lets you perform arbitrary code injection, delayed code execution, and modifying memory values.
ChaosBlade is modular by design. The core tool is more of an experiment orchestrator, while separate implementation projects perform the actual attacks. For example, the chaosblade-exec-os project provides host attacks, and the chaosblade-exec-cplus project provides C++ attacks. Alternatively, you can download ChaosBlade-box, which bundles host attacks, docker attacks, JVM attacks, C++ attacks, and Litmus integration. Unfortunately for English speakers, ChaosBlade's documentation is primarily written in Standard Chinese. English translations eventually came after users requested complete documentation, but these translations are still missing in some places.
ChaosBlade is a versatile tool supporting a wide range of experiment types and target platforms. However, it lacks useful features such as centralized reporting, experiment scheduling, target randomization, and health checks.
Release year: 2020
License: Open source
Chaos Mesh supports 17 unique attacks, including resource consumption, network latency, packet loss, bandwidth restriction, disk I/O latency, system time manipulation, and even kernel panics. Since this is a Kubernetes tool, you can fine-tune your blast radius using Kubernetes labels and selectors. Chaos Mesh also supports node-level attacks using an add-on tool called chaosd.
Chaos Mesh is also one of the few open-source tools to include a fully-featured web user interface (UI) called the Chaos Dashboard. In addition to creating new experiments, you can use the Dashboard to manage running experiments and view a timeline of executions. Chaos Mesh also integrates with Grafana, so you can view your executions alongside your cluster's metrics to see the direct impact.
You can run experiments immediately or schedule them, although scheduling involves writing additional YAML. And while you can set a duration time for running faults, Chaos Mesh runs them indefinitely by default.
Chaos Mesh offers a good amount of flexibility for Chaos Engineering on Kubernetes, encourages automating experiments via CI/CD, and is used by Azure Chaos Studio to inject Kubernetes faults. However, its most significant limitations are its need for easy scheduling and safe defaults.
Platforms: Docker, Kubernetes, bare-metal, cloud platforms
Release year: 2018
License: Open source
Chaos Toolkit will be familiar to anyone who's used an infrastructure automation tool like Ansible. Instead of making you select from predefined experiments, Chaos Toolkit lets you define your own.
Each experiment consists of Actions and Probes. Actions execute commands on the target system, and Probes compare executed commands against an expected value. For example, we might create an Action that calls stress-ng to consume CPU on a web server, then use a Probe to check whether the website responds within a certain amount of time. Chaos Toolkit also provides drivers for interacting with different platforms and services. For example, you can use the AWS driver to experiment with AWS services or manage other Chaos Engineering tools like Toxiproxy and Istio. Chaos Toolkit can also auto-discover services in your environment and recommend tailored experiments.
While Chaos Toolkit supports several different platforms, it does run entirely through the CLI. This makes it difficult to run experiments across multiple systems unless you use a cloud platform like AWS or an orchestration platform like Kubernetes. Chaos Toolkit also lacks a native scheduling feature, GUI, or REST API.
Chaos Toolkit is one of the most flexible tools for designing chaos experiments. Still, because of this DIY approach, it's more of a framework you build on than a ready-to-go Chaos Engineering solution.
Release year: 2014
License: Open source
Toxiproxy is a network failure injection tool that lets you create conditions such as latency, connection loss, bandwidth throttling, and packet manipulation. As the name implies, it acts as a proxy that sits between two services and can inject failure directly into traffic.
Toxiproxy has two components: A proxy server written in Go and a client communicating with the proxy. When configuring the Toxiproxy server, you define the routes between your applications, then create chaos experiments (called toxics) to alter traffic behavior along those routes. You can manage your experiments using a command-line client or REST API.
The main challenge with Toxiproxy is its design. Because it's a proxy service, you must reconfigure your applications to route network traffic. Not only does this add complexity to your deployments, but it also creates a single point of failure if you have a single server handling multiple applications. For this reason, even the maintainers recommend against using it in production.
Toxiproxy also lacks many controls, such as scheduling experiments, halting experiments, and monitoring. A toxic runs until you delete it, and there's a risk of intermittent connection errors caused by port conflicts. It's fine for testing timeouts and retries in development but not production.
Release year: 2017
Creators: Google, IBM, and Lyft
License: Open source
Istio can inject latency or HTTP errors into network traffic between any virtual service. Experiments are defined as Kubernetes manifests, and you can choose your targets using existing Istio features like virtual services and routing rules. You can also use health checks and Envoy statistics to monitor the impact on your systems.
This is about the extent of Istio's Chaos Engineering functionality. Experiments can't be scheduled, executed on hosts, customized extensively, or used outside of Istio. It's more or less taking advantage of Istio's place in the network to perform these experiments without adding any additional Chaos Engineering tools or functionality.
If you already use Istio, this is an easy way to run chaos experiments on your cluster without having to deploy or learn another tool. Otherwise, it's not worth deploying Istio just for this feature.
Ultimately, any Chaos Engineering or reliability testing tool aims to help you achieve greater reliability with less toil. The question is: which tool will help you achieve that goal easier and faster?
Teams new to reliability testing will benefit more from a tool that provides pre-built tests and guides them through the process. In contrast, experienced teams may want a tool that gives them complete control over their experiments for hyper-specific scenarios.
For a more in-depth comparison of each of these tools, plus guidance on whether to build or buy a Chaos Engineering solution, see our Guide to Chaos Engineering Tools. We also created a comparison matrix to show how these tools stack up with each other:
|Gremlin||Litmus||AWS FIS||AWS Resilience Hub||Azure Chaos Studio||Steadybit||Chaos Monkey||ChaosBlade||Chaos Mesh||Chaos Toolkit||ToxiProxy||Istio|
|License||Commercial||Open source||Commercial||Commercial||Commercial||Commercial||Open source||Open source||Open source||Open source||Open source||Open source|
|Works with||Kubernetes, containers, Linux and Windows hosts||Kubernetes||Cloud (AWS-only)||Cloud (AWS-only)||Cloud (Azure-only)||Kubernetes, Docker, and Linux hosts||Spinnaker||Docker, Kubernetes, bare-metal, cloud platforms||Kubernetes||Docker, Kubernetes, bare-metal, cloud platforms||Network||Kubernetes|
|Shared fault library||✔️||✔️||✔️|
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started