Chaos Monkey
The Simian Army

Overview and Resources

READ TIME
Last Updated
October 17, 2018

The Simian Army is a suite of failure-inducing tools designed to add more capabilities beyond Chaos Monkey. While Chaos Monkey solely handles termination of random instances, Netflix engineers needed additional tools able to induce other types of failure. Some of the Simian Army tools have fallen out of favor in recent years and are deprecated, but each of the members serves a specific purpose aimed at bolstering a system's failure resilience.

In this chapter we'll jump into each member of the Simian Army and examine how these tools helped shape modern Chaos Engineering best practices. We'll also explore each of the Simian Chaos Strategies used to define which Chaos Experiments the system should undergo. Lastly, we'll plunge into a short tutorial walking through the basics of installing and using the Simian Army toolset.

Simian Army Members

Each Simian Army member was built to perform a small yet precise Chaos Experiment. Results from these tiny tests can be easily measured and acted upon, allowing you and your team to quickly adapt. By performing frequent, intentional failures within your own systems, you're able to create a more fault-tolerant application.

Active Simians

In addition to Chaos Monkey, the following simian trio are the only Army personnel to be publicly released, and which remain available for use today.

Janitor Monkey - Now Swabbie

Janitor Monkey also seeks out and disposes of unused resources within the cloud. It checks any given resource against a set of configurable rules to determine if its an eligible candidate for cleanup. Janitor Monkey features a number of configurable options, but the default behavior looks for resources like orphaned (non-auto-scaled) instances, volumes that are not attached to an instance, unused auto-scaling groups, and more.

Have a look at Using Simian Army Tools for a basic guide configuring and executing Janitor Monkey experiments.

Update: Swabbie is the Spinnaker service that replaces the functionality provided by Janitor Monkey. Find out more in the official documentation.

Conformity Monkey - Now Part of Spinnaker

The Conformity Monkey is similar to Janitor Monkey -- it seeks out instances that don't conform to predefined rule sets and shuts them down. Here are a few of the non-conformities that Conformity Monkey looks for.

  • Auto-scaling groups and their associated elastic load balancers that have mismatched availability zones.
  • Clustered instances that are not contained in required security groups.
  • Instances that are older than a certain age threshold.

Conformity Monkey capabilities have also been rolled into Spinnaker. More info on using Conformity Monkey can be found under Using Simian Army Tools.

Security Monkey

Security Monkey was originally created as an extension to Conformity Monkey, and it locates potential security vulnerabilities and violations. It has since broken off into a self-contained, standalone, open-source project. The current 1.X version is capable of monitoring many common cloud provider accounts for policy changes and insecure configurations. It also ships with a single-page application web interface.

Inactive/Private Simians

This group of simians were either been deprecated or were never publicly released.

Chaos Gorilla

AWS Cloud resources are distributed around the world, with a current total of 25 geographic Regions. Each region consists of one or more Availability Zones. Each availability zone acts as a separated private network of redundancy, communicating with one another via fiber within their given region.

The Chaos Gorilla tool simulates the outage of entire AWS availability zone. It's been successfully used by Netflix to verify that their service load balancers functioned properly and kept services running, even in the event of an availability zone failure.

Chaos Kong

While rare, it is not unheard of for an AWS region to experience outages. Though Chaos Gorilla simulates availability zone outages, Netflix later created Chaos Kong to simulate region outages. As Netflix discusses in their blog, running frequent Chaos Kong experiments prior to any actual regional outages ensured that their systems were able to successfully evacuate traffic from the failing region into a nominal region, without suffering any severe degradation.

*Netflix Chaos Kong Experiment - Courtesy of Netflix*

Latency Monkey

Latency Monkey causes artificial delays in RESTful client-server communications and while it proved to be a useful tool. However, as Netflix later discovered, this particular Simian could be somewhat difficult to wrangle at times. By simulating network delays and failures, it allowed services can be tested to see how they react when their dependencies slow down or fail to respond, but these actions also occasionally caused unintended effects within other applications.

While Netflix never publicly released the Latency Monkey code, and it eventually evolved into their Failure Injection Testing (FIT) service, which we discuss in more detail over here.

Doctor Monkey

Doctor Monkey performs instance health checks and monitors vital metrics like CPU load, memory usage, and so forth. Any instance deemed unhealthy by Doctor Monkey is removed from service.

Doctor Monkey is not open-sourced, but most of its functionality is built into other tools like Spinnaker, which includes a load balancer health checker, so instances that fail certain criteria are terminated and immediately replaced by new ones. Check out the How to Deploy Spinnaker on Kubernetes tutorial to see this in action!

10-18 Monkey

The 10-18 Monkey (aka <span class="code-class-custom">l10n-i18n</span>) detects run time issues and problematic configurations within instances that are accessible across multiple geographic regions, and which are serving unique localizations.

Simian Chaos Strategies

The original Chaos Monkey was built to inject failure by terminating EC2 instances. However, this provides a limited simulation scope, so Chaos Strategies were added to the Simian Army toolset. Most of these strategies are disabled by default, but they can be toggled in the <span class="code-class-custom">SimianArmy/src/main/resources/chaos.properties</span> configuration file.

Instance Shutdown (Simius Mortus)

Shuts down an EC2 instance.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>

Network Traffic Blocker (Simius Quies)

Blocks network traffic by applying restricted security access to the instance. This strategy only applies to VPC instances.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.blockallnetworktraffic</span>

EBS Volume Detachment (Simius Amputa)

Detaches all EBS volumes from the instance to simulate I/O failure.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.detachvolumes</span>

Burn-CPU (Simius Cogitarius)

Heavily utilizes the instance CPU.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.burncpu</span>

Burn-IO (Simius Occupatus)

Heavily utilizes the instance disk.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>

Fill Disk (Simius Plenus)

Attempts to fill the instance disk.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.shutdowninstance</span>

Kill Processes (Simius Delirius)

Kills all Python and Java processes once every second.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.killprocesses</span>

Null-Route (Simius Desertus)

Severs all instance-to-instance network traffic by null-routing the <span class="code-class-custom">10.0.0.0/8</span> network.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.nullroute</span>

Fail DNS (Simius Nonomenius)

Prevents all DNS requests by blocking TCP and UDP traffic to port <span class="code-class-custom">53</span>.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.faildns</span>

Fail EC2 API (Simius Noneccius)

Halts all EC2 API communication by adding invalid entries to <span class="code-class-custom">/etc/hosts</span>.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.failec2</span>

Fail S3 API (Simius Amnesius)

Stops all S3 API traffic by placing invalid entries in <span class="code-class-custom">/etc/hosts</span>.

Configuration Key

simianarmy.chaos.fails3

Fail DynamoDB API (Simius Nodynamus)

Prevents all DynamoDB API communication by adding invalid entries to <span class="code-class-custom">/etc/hosts</span>.

Configuration Key

simianarmy.chaos.faildynamodb

Network Corruption (Simius Politicus)

Corrupts the majority of network packets using a traffic shaping API.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.networkcorruption</span>

Network Latency (Simius Tardus)

Delays all network packets by <span class="code-class-custom">1</span> second, plus or minus half a second, using a traffic shaping API.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.networklatency</span>

Network Loss (Simius Perditus)

Drops a fraction of all network packets by using a traffic shaping API.

Configuration Key

<span class="code-class-custom">simianarmy.chaos.networkloss</span>

Using Simian Army Tools

Prerequisites

Installation

  1. Start by creating an AWS Auto Scaling launch configuration.
    BASH
    
    aws autoscaling create-launch-configuration --launch-configuration-name simian-lc --instance-type t2.micro --image-id ami-51537029
    
  2. Now use the generated simian-lc configuration to create an Auto Scaling Group.
    BASH
    
    aws autoscaling create-auto-scaling-group --auto-scaling-group-name monkey-target --launch-configuration-name simian-lc --availability-zones us-west-2a --min-size 1 --max-size 2
    
  3. (Optional) Check that the scaling group was successfully added.
    BASH
    
    aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names monkey-target --output json
    
    JSON
    
    {
      "AutoScalingGroups": [
        {
          "AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:123456789012:autoScalingGroup:918a23bc-ea5a-4def-bc68-5356becfd35d:autoScalingGroupName/monkey-target",
          "ServiceLinkedRoleARN": "arn:aws:iam::123456789012:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling",
          "TargetGroupARNs": [],
          "SuspendedProcesses": [],
          "DesiredCapacity": 1,
          "Tags": [],
          "EnabledMetrics": [],
          "LoadBalancerNames": [],
          "AutoScalingGroupName": "monkey-target",
          "DefaultCooldown": 300,
          "MinSize": 1,
          "Instances": [
            {
              "ProtectedFromScaleIn": false,
              "AvailabilityZone": "us-west-2a",
              "InstanceId": "i-0e47c9f0df5150263",
              "HealthStatus": "Healthy",
              "LifecycleState": "Pending",
              "LaunchConfigurationName": "simian-lc"
            }
          ],
          "MaxSize": 2,
          "VPCZoneIdentifier": "",
          "HealthCheckGracePeriod": 0,
          "TerminationPolicies": ["Default"],
          "LaunchConfigurationName": "simian-lc",
          "CreatedTime": "2018-09-13T03:43:13.503Z",
          "AvailabilityZones": ["us-west-2a"],
          "HealthCheckType": "EC2",
          "NewInstancesProtectedFromScaleIn": false
        }
      ]
    }
    
  4. (Optional) Add any additional, manually-propagated EC2 instances you might need, using the same ami-51537029 image used for the auto-scaling group.
    BASH
    
    aws ec2 run-instances --image-id ami-51537029 --count 1 --instance-type t2.micro --key-name id_rsa
    
    BASH
    
    # OUTPUT
    123456789012 r-0ade24933c15617ba
    INSTANCES 0   x86_64   False    xen ami-51537029    i-062b161f4a1cddbb7 t2.micro    id_rsa  2018-09-13T03:50:07.000Z    ip-172-31-30-145.us-west-2.compute.internal 172.31.30.145       /dev/sda1   ebs True        subnet-27c73d43 hvmvpc-0967976d
    
  5. (Optional) Attach any manually-created EC2 instances to the monkey-target auto-scaling group.
    BASH
    
    aws autoscaling attach-instances --instance-ids i-062b161f4a1cddbb7 --auto-scaling-group-name monkey-target
    

Receiving Email Notifications

  1. (Optional) If you want to receive email notifications you'll need to add an email address identity to AWS Simple Email Service (SES).
    us-east-1 Region only
    At present, SimianArmy only attempts to send email notifications through the AWS us-east-1 region, regardless of configuration settings. Thus, be sure the recipient address is in the us-east-1 AWS region.
    BASH
    
    aws ses verify-email-identity --email-address me@example.com --region us-east-1
    
  2. Open your email client and click the verification link.
  3. Verify the address was successfully added to the proper SES region.
    BASH
    
    aws ses list-identities --region=us-east-1
    
    BASH
    
    # OUTPUT
    IDENTITIES    me@example.com
    

Configuration

  1. Clone the SimianArmy GitHub repository into the local directory of your choice.
    BASH
    
    git clone git://github.com/Netflix/SimianArmy.git ~/SimianArmy
    
  2. (Optional) Modify the client.properties configuration to change AWS connection settings.
    BASH
    
    nano ~/SimianArmy/src/main/resources/client.properties
    
  3. (Optional) Modify the simianarmy.properties configuration to change general SimianArmy behavior.
    BASH
    
    nano ~/SimianArmy/src/main/resources/simianarmy.properties
    
  4. (Optional) Modify the chaos.properties configuration to change Chaos Monkey's behavior.
    BASH
    
    nano ~/SimianArmy/src/main/resources/chaos.properties
    
  5. By default, Chaos Monkey won't target AWS Auto Scaling Groups unless you explicitly enable them. If desired, enable the recently added monkey-target ASG by adding the following setting.
    BASH
    
    simianarmy.chaos.ASG.monkey-target.enabled = true
    
  6. (Optional) Modify the janitor.properties configuration to change Janitor Monkey's behavior.
    BASH
    
    nano ~/SimianArmy/src/main/resources/janitormonkey.properties
    
  7. (Optional) If you opted to receive SES notifications, specify the recipient email address within each appropriate configuration file. The following example modifies the conformity.properties file.
    BASH
    
    nano ~/SimianArmy/src/main/resources/conformity.properties
    
    BASH
    
    # The property below needs to be a valid email address to receive the summary email of Conformity Monkey
    # after each run
    simianarmy.conformity.summaryEmail.to = foo@bar.com
    
    # The property below needs to be a valid email address to send notifications for Conformity monkey
    simianarmy.conformity.notification.defaultEmail = foo@bar.com
    
    # The property below needs to be a valid email address to send notifications for Conformity Monkey
    simianarmy.conformity.notification.sourceEmail = foo@bar.com
    

Executing Experiments

Run the included Gradle Jetty server to build and execute the Simian Army configuration.

BASH

./gradlew jettyRun

After the build completes you'll see log output from each enabled Simian Army members, including Chaos Monkey 1.X.

Using Chaos Monkey 1.X

BASH

2018-09-11 14:31:06.625 - INFO  BasicChaosMonkey - [BasicChaosMonkey.java:276] Group monkey-target [type ASG] enabled [prob 1.0]
2018-09-11 14:31:06.625 - INFO  BasicChaosInstanceSelector - [BasicChaosInstanceSelector.java:89] Group monkey-target [type ASG] got lucky: 0.9183174043024381 > 0.16666666666666666
2018-09-11 14:31:06.626 - INFO  Monkey - [Monkey.java:138] Reporting what I did...

This older version of Chaos Monkey uses probability to pseudo-randomly determine when an instance should be terminated. The output above shows that <span class="code-class-custom">0.918...</span> exceeds the required chance of <span class="code-class-custom">1/6</span>, so nothing happened. However, running <span class="code-class-custom">./gradlew jettyRun</span> a few times will eventually result in a success. If necessary, you can also modify the probability settings in the <span class="code-class-custom">chaos.properties</span> file.

BASH

2018-09-11 14:33:06.625 - INFO  BasicChaosMonkey - [BasicChaosMonkey.java:89] Group monkey-target [type ASG] enabled [prob 1.0]
2018-09-11 14:33:06.625 - INFO  BasicChaosMonkey - [BasicChaosMonkey.java:280] leashed ChaosMonkey prevented from killing i-057701c3ab4f1e5a4 from group monkey-target [ASG], set simianarmy.chaos.leashed=false

By default, the <span class="code-class-custom">simianarmy.chaos.leashed = true</span>property in <span class="code-class-custom">chaos.properties</span> prevents Chaos Monkey from terminating instances, as indicated in the above log output. However, changing this property to <span class="code-class-custom">false </span>allows Chaos Monkey to terminate the selected instance.

BASH

2018-09-11 14:33:56.225 - INFO  BasicChaosMonkey - [BasicChaosMonkey.java:89] Group monkey-target [type ASG] enabled [prob 1.0]
2018-09-11 14:33:56.225 - INFO  BasicChaosMonkey - [BasicChaosMonkey.java:280] Terminated i-057701c3ab4f1e5a4 from group monkey-target [ASG]

Next Steps

Now that you've learned about the Simian Army, check out our Developer Tutorial to find out how to install and use the newer Chaos Monkey toolset. You can also learn about the many alternatives to Chaos Monkey, in which we shed light on tools and services designed to bring intelligent failure injection and powerful Chaos Engineering practices to your fingertips.

This is some text inside of a div block.
Chaos Monkey
This is some text inside of a div block.
DOWNLOAD PDF