This article describes some of the common tools that the Chaos Engineering community considers when starting to implement the practice in an organization. The goal is to give a high level introduction to some frequently mentioned options and list some of the strengths of each.
Our team at Gremlin has a combined decades of experience implementing Chaos Engineering at companies like Netflix, Amazon, Google, Salesforce, and Dropbox. We have created and implemented a variety of these tools multiple times, and we have found that the features listed in this table are critical for widespread adoption of a Chaos Engineering practice in most enterprise engineering organizations. Following the table is a brief description of each entry.
These features center over the common areas like the variety of chaos experiments and the portability of the code, but also critical safety and security features. In particular, safety and security features help adoption with IT, and are difficult to build yourself. Anything that makes the lives of DevOps and site reliability engineers easier is worth consideration.
|Tool||(N)umber of attack types available||Enterprise support||Can halt attack in progress||GUI||Open source||Software as a service (SaaS)||Cloud agnostic||Can attack container / pod||Can attack serverless||Has API|
|Bloomberg Powerful Seal||1||✔️||✔️||✔️||✔️|
|ChaosIQ Chaos Toolkit||3||Via vendor||✔️||✔️||✔️||✔️|
|Netflix Chaos Monkey||1||✔️|
|Netflix Simian Army||2||✔️|
|New Relic Chaos Panda||2||✔️|
From Bloomberg, Powerful Seal is an open source tool written in Python designed for testing Kubernetes clusters. It works on nodes, pods, containers, and namespaces. It natively works with AWS, OpenStack, and local machines. You can run experiments directly or through automation. No official support is available, but documentation is available and development is active.
Chaos Toolkit is an open source project written in Python that defines an API to help you run chaos experiments that you define. The project uses a system of drivers, plugins, and extensions to allow you to customize and automate the experiments you design. Designing your own experiments and assembling the right set of pieces gives you great flexibility, but with the risk of complexity and engineering time to properly implement. Commercial support is available from ChaosIQ.
Gremlin is a commercial software as a service (SaaS) offering focused on enterprise customers and others with large-scale deployments. It includes multiple attack possibilities and the ability to halt an attack in progress and rollback should problems occur. You can test in development, in testing, in your CI/CD process, and in production. Gremlin is controlled via either an API or a clear and easy to understand web-accessible graphical user interface. Either method includes significant security features including encryption and granular user account permissions. The latter feature enables teams to designate who has administrator access to control Gremlin attacks and also limit some user accounts to specific functions, thereby limiting risk to the old only give the level of permissions necessary to complete a task standard. Gremlin also has the ability to control application-level fault injection (ALFI) attacks, making it possible to use request-level metadata to construct your own attacks to be scheduled and controlled by Gremlin.
Istio is a service mesh that includes some features that you can use for chaos experiments, because the istio-proxy is already intercepting all network traffic. That means the proxy can be used to change the responses or delay responses to simulate latency, provided the request you want to target is a part of your service mesh.
From Mesosphere, DRAX, is a container-level, DC/OS-specific resilience testing tool inspired by Netflix's Chaos Monkey. DC/OS is Mesosphere’s datacenter operating system and cloud automation platform. DRAX runs as a Marathon app, killing off random tasks of any (non-framework) or specific (non-framework) application running in Marathon, which is Mesosphere’s container orchestration platform.
From Netflix, Chaos Monkey is the first of all Chaos Engineering tools, the one that started it all. If a person has only heard of one tool, this is the most likely candidate. Netflix created Chaos Monkey as they were moving from an on-site to an AWS cloud deployment. Once Netflix realized that their servers and nodes were no longer under their complete control, but rather they were choosing to trust a vendor, they decided it would be a good idea to figure out what would happen if one of those nodes suddenly went down, so they could mitigate against any customer-facing problems that might arise. Chaos Monkey does just that, in production. It pseudorandomly reboots hosts to suss out weaknesses and/or validate that automated remediation worked correctly. The Chaos Monkey code was released as open source, but is essentially unmaintained and unsupported, so if you choose to use it you are taking on those responsibilities yourself. In addition, it requires that you deploy using Spinnaker, which is not a bad thing, but simply a limitation to consider.
Also from Netflix, Simian Army is the logical evolution from Chaos Monkey. It is a suite of Chaos Engineering tools designed to expand the types of failure that can be induced while experimenting to find weaknesses en route to enhancing resilience. Parts of Simian Army have been rolled into Spinnaker while other parts have been either released as somewhat-maintained, standalone open source projects or deprecated and never released because they were written as internal, private tools.
From New Relic, Chaos Panda is designed to help you implement Chaos Engineering with New Relic’s GraphQL API. Internal teams can configure GraphQL in pre-production testing to do things like add latency to query responses or to cause certain fields to fail at a specific failure rate. This tool appears to be limited currently to internal New Relic teams, but is interesting enough to warrant a mention here.
From Nutanix, X-Ray is a Chaos Engineering tool designed to test infrastructures based on their Acropolis Hypervisor (AHV) or VMware’s ESXi hypervisor. It can test for node availability, performance, provisioning, and data integrity. Test scenarios can also be customized, shared and imported into new X-Ray deployments. X-Ray is packaged as a VM and uses workloads paired with real-world scenarios to simulate typical workflows and events for their platform. Usage requires that you are already using Nutanix and an additional registration.
Pumba is an open source project written in Go focused on chaos testing for Docker. It simulates failures related to processes, containers that stop or disappear, and various network and performance issues. The code and binaries are available from GitHub. No official support is available, but documentation is available and development is active.
From Shopify, ToxiProxy is a TCP proxy written in Go and created to simulate network connections and conditions while your application is in development, testing, and CI environments. It has multiple attack vectors available within those constraints and the code is released as an open source project. There is no official support available, so implementation is up to you, however, the documentation seems pretty clear and complete. UnderArmor started their Chaos Engineering practice using ToxiProxy, but has since moved to Gremlin.
From T-Mobile, Turbulence is an extension of Chaos Toolkit that, like T-Mobile’s microservices, runs in a Cloud Foundry BOSH environment and can take an organization name, space name, and application name in and block either access to that application, or that application’s access to one or more of its bound services. Turbulence and its drivers have been released as open source projects by T-Mobile. No official support is available, but documentation is available and development is active.
There are other tools that you can use to perform chaos experiments. Any technology component in your stack can theoretically be targeted with some form of intentional failure to see how the rest of your stack handles that failure. You can find ways to attack that are minimal, well-bounded, and carefully limited to minimize the blast radius. The main thing to keep in mind at all times is that you are seeking resilience, looking for areas of weakness that you can work to strengthen. Poke. Prod. Experiment. Learn. Adapt. Grow. Some paths that follow this pattern are easier, safer, or more secure than other paths. Find the one that works best for your company, your technology stack, and your budget and personnel constraints.