Chaos Monkey Guide for Engineers

Tips, Tutorials, and Training

In 2010 Netflix announced the existence and success of their custom resiliency tool called Chaos Monkey.

What is Chaos Monkey?

In 2010, Netflix decided to move their systems to the cloud. In this new environment, hosts could be terminated and replaced at any time, which meant their services needed to prepare for this constraint. By pseudo-randomly rebooting their own hosts, they could suss out any weaknesses and validate that their automated remediation worked correctly. This also helped find "stateful" services, which relied on host resources (such as a local cache and database), as opposed to stateless services, which store such things on a remote host.

Netflix designed Chaos Monkey to test system stability by enforcing failures via the pseudo-random termination of instances and services within Netflix's architecture. Following their migration to the cloud, Netflix's service was newly reliant upon Amazon Web Services and needed a technology that could show them how their system responded when critical components of their production service infrastructure were taken down. Intentionally causing this single failure would suss out any weaknesses in their systems and guide them towards automated solutions that gracefully handle future failures of this sort.
‍‍

CHAOS ENGINEERING IS

the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production.

Chaos Monkey helped jumpstart Chaos Engineering as a new engineering practice. Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds to failure conditions, you can identify and fix failures before they become public facing outages. Chaos Engineering lets you validate what you think will happen with what is actually happening in your systems. By performing the smallest possible experiments you can measure, you're able to "break things on purpose" in order to learn how to build more resilient systems.

In 2011, Netflix announced the evolution of Chaos Monkey with a series of additional tools known as The Simian Army. Inspired by the success of their original Chaos Monkey tool aimed at randomly disabling production instances and services, the engineering team developed additional "simians" built to cause other types of failure and induce abnormal system conditions. For example, the Latency Monkey tool introduces artificial delays in RESTful client-server communication, allowing the team at Netflix to simulate service unavailability without actually taking down said service. This guide will cover all the details of these tools in The Simian Army chapter.

What Is This Guide?

The Chaos Monkey Guide for Engineers is a full how-to of Chaos Monkey, including what it is, its origin story, its pros and cons, its relation to the broader topic of Chaos Engineering, and much more. We've also included step-by-step technical tutorials for getting started with Chaos Monkey, along with advanced engineering tips and guides for those looking to go beyond the basics. The Simian Army section explores all the additional tools created after Chaos Monkey.

This guide also includes resources, tutorials, and downloads for engineers seeking to improve their own Chaos Engineering practices. In fact, our alternative technologies chapter goes above and beyond by examining a curated list of the best alternatives to Chaos Monkey -- we dig into everything from Azure and Docker to Kubernetes and VMware!

Who Is This Guide For?

We've created this guide primarily for engineers who are looking for an in-depth resource on Chaos Monkey, as a way to get started with Chaos Engineering. We want to help readers see how Chaos Monkey fits into the practice of Chaos Engineering.

Why Did We Create This Guide?

Gremlin's goal is to empower engineering teams to build more resilient systems through thoughtful Chaos Engineering. We're on a constant quest to promote the Chaos Community through frequent conferences & meetups, in-depth talks, detailed tutorials, and the ever-growing list of Chaos Engineering Slack channels.

While Chaos Engineering extends well beyond the scope of one single technique or idea, Chaos Monkey is the most well-known tool for running Chaos Experiments and is a common starting place for engineers getting started with the discipline.

The Pros and Cons of Chaos Monkey

Chaos Monkey is designed to induce one specific type of failure. It randomly shuts down instances in order to simulate random server failure.

Pros of Chaos Monkey

Prepares You for Random Instance Failures

Chaos Monkey allows for planned instance failures when you and your team are best-prepared to handle them. You can schedule terminations so they occur based on a configurable mean number of days and during a given time period each day.

Encourages Redundancy

Part and parcel of a distributed architecture, redundancy is another major benefit to smart Chaos Engineering practices. If a single service or instance is brought down unexpectedly, a redundant backup may save the day.

Built Into Spinnaker

Chaos Monkey Version 2.0 relies on Spinnaker. This is both a pro and a con. It enables cross-cloud compatibility but requires that the user is using Spinnaker.

Cons of Chaos Monkey

Requires Spinnaker

As discussed in The Origin of Chaos Monkey, Chaos Monkey does not support deployments that are managed by anything other than Spinnaker.

Requires MySQL

Chaos Monkey also requires the use of MySQL 5.X, as discussed in more detail in the Chaos Monkey Tutorial chapter.

Limited Failure Mode

Chaos Monkey's limited scope means it injects one type of failure - causing pseudo-random instance failure. Thoughtful Chaos Engineering is about looking at an application's future, toward unknowable and unpredictable failures, beyond those of a single AWS instance. Chaos Monkey only handles one of the "long tail" failures that software will experience during its life cycle. Check out the Chaos Monkey Alternatives chapter for more information.

Lack of Coordination

While Chaos Monkey can terminate instances and cause failures, it lacks much semblance of coordination. Since Chaos Monkey is an open-source tool that was built by and for Netflix, it's left to you as the end-user to inject your own system-specific logic. Bringing down an instance is great and all, but knowing how to coordinate and act on that information is critical.

No Recovery Capabilities

A big reason why Chaos Engineering encourages performing the smallest possible experiments is so any repercussions are somewhat contained -- if something goes awry, it's ideal to have a safety net or the ability to abort the experiment. Unfortunately, while Chaos Monkey doesn't include such safety features, many other tools and services have these capabilities, including Gremlin's Halt All button, which immediately stops all active experiments.

No User Interface

As with most open source projects, Chaos Monkey is entirely executed through the command line, scripts, and configuration files. If your team wants an interface, it's up to you to build it.

Limited Helper Tools

By itself, Chaos Monkey fails to provide many useful functions such as auditing, outage checking, termination tracking, and so forth. Spinnaker supports a framework for creating your own Chaos Monkey auditing through its Echo events microservice, but you'll generally be required to either integrate with Netflix's existing software or to create your own custom tools in order to get much info out of Chaos Monkey.

Guide Chapters

The Origin of Chaos Monkey

Why Netflix Needed to Create Failure

In this chapter we'll take a deep dive into the origins and history of Chaos Monkey, how Netflix streaming services emerged, and why Netflix needed to create failure within their systems to improve their service and customer experiences. We'll also provide a brief overview of the Simian Army and its relation to the original Chaos Monkey technology. Finally, we'll jump into the present and future of Chaos Monkey, dig into the creation and implementation of Failure Injection Testing at Netflix, and discuss the potential issues and limitations presented by Chaos Monkey's reliance on Spinnaker.

Chaos Monkey Tutorial

A Step-by-Step Guide to Creating Failure on AWS

This chapter will provide a step-by-step guide for setting up and using Chaos Monkey with AWS. We also examine the scenarios where Chaos Monkey is the right solution, and its limitations since it only handles random instance terminations.

Advanced Developer Guide

Taking Chaos Monkey to the Next Level

This chapter provides advanced developer tips for Chaos Monkey and other Chaos Engineering tools, including tutorials for manually deploying Spinnaker stacks on a local machine, virtual machine, or with Kubernetes. From there you can configure and deploy Spinnaker itself, along with Chaos Monkey and other Chaos Engineering tools!

The Simian Army

Overview and Resources

The Simian Army is a suite of failure-inducing tools designed to add more capabilities beyond Chaos Monkey. While Chaos Monkey solely handles termination of random instances, Netflix engineers needed additional tools able to induce other types of failure. Some of the Simian Army tools have fallen out of favor in recent years and are deprecated, but each of the members serves a specific purpose aimed at bolstering a system's failure resilience.

For Engineers

Chaos Monkey Resources, Guides, and Downloads

We've collected and curated well over 100 resources to help you with every aspect of your journey into Chaos Engineering. Learn about Chaos Engineering's origins and principles to shed light on what it's all about or dive right into one of the dozens of in-depth tutorials to get experimenting right away. You might also be interested in subscribing to some of the best Chaos Engineering blogs on the net or installing one of the many tools designed to inject failure into your applications, no matter the platform.

Chaos Monkey Alternatives

Tools for Creating Chaos Outside of AWS

Chaos Monkey serves a singular purpose -- to randomly terminate instances. As discussed in Chaos Monkey and Spinnaker and The Pros and Cons of Chaos Monkey, additional tools are required when using Chaos Monkey, in order to cover the broad spectrum of experimentation and failure injection required for proper Chaos Engineering.