AWS GameDay Lounge : Chaos Engineering with Gremlin

AWS GameDay Lounge : Chaos Engineering with Gremlin
Last Updated:
Categories: Chaos Engineering


To successfully complete this course, you’ll need:

  • A computer with Internet connection
  • An AWS account supplied at the GameDay Lounge
  • A Gremlin Pro account supplied by Gremlin

What is Chaos Engineering?

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.

Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.

If you would like to take a deep dive into learning more about the history and principles, check out this link.

Inject something harmful to build an immunity.

Why do Chaos Engineering?

  • Cost of Downtime

How to do Chaos Engineering?

  • Principles of Chaos Engineering:

    • Plan an experiment
    • Contain the Blast Radius and Magnitude
    • Scale or Squash
  • Terms to know:

    • Blast Radius: The number of hosts and/or containers that are targeted in an experiment.

    • Magnitude: The intensity of the attack you’re running.

    • Abort Conditions: What Conditions Would Cause You to Halt the Experiment?

      • Examples: Error Rate, Latency
      • Big Red Button - Make sure to Halt the experiment if one of your experiments hits one of the abort conditions
  • Scientific Method:

    • Form a Hypothesis
    • Experiment and Test It
    • Analyze Results
    • Expand Scope and Re-Test
    • Share Results

Today’s Infrastructure and Demo Environment

Today’s demo environment

Getting Set Up

Access AWS Console

  • Log in via instructions to access AWS console
  • Click on EKS to confirm there is an EKS cluster. It should have “GremlinGameDay” followed by a string of characters.
  • Switch over to EC2’s dashboard, and confirm there are 3 instances available for the EKS cluster as well as a bastion host.

Access Monitoring

For more information on using Container Insights, see this documentation:

Access Gremlin

Access your Bastion Host

  • Open EC2 Dashboard in the AWS Console:
  • Click on Running Instances and look for the instance with a “Bastion Host” name and select it
  • Click Connect
  • Select “EC2 Instance Connect”

For more information on EC2 Instance Connect and other options for connecting to your bastion host, see this documentation:

Access Sock Shop

  • Log in to the bastion host, and run

    • kubectl get svc -o wide -n sock-shop | grep LoadBalancer
  • Copy the load balancer’s DNS name

    • Sample:
  • Paste this DNS into a browser to access the sock shop front-end

  • Navigate around to get a feel of all the functions of the shop. Things to try out:

    • Register and Log in
    • Viewing various items
    • Adding and removing items to and from cart
    • Checking out items

Deploy Gremlin

  • In your bastion host, clone the Gremlin daemonset with git clone

  • Then edit the daemonset withvi gremlin/daemonset.yaml. In this file we need to edit 2 fields, Team ID and Team Secret

    • To get your Team ID:

      • Within Gremlin, access your company settings and click to the Teams tab, and your Team ID will be listed in the row. Copy this value.
      • Inside the daemonset YAML file, find the line that says “value: <YOUR TEAM ID>, and replace <YOUR TEAM ID> with the Team ID value that you copied from Gremlin.
    • To get your Team Secret:

      • In Gremlin, under the same Teams tab, click Reset in Secret Key. Copy this value as it’s viewable only once.
      • Inside the daemonset YAML file, find the line that says “value: <YOUR SECRET KEY>, and replace it with the Team Secret you previously copied from Gremlin.
  • Save your file and exit.

Running a Chaos Experiment

1st Experiment (Scenario)

For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.

1st Experiment (Scenario) Questions

  • Look into Container insights
  • Was this failure detected?
  • Did the outcome of this failure result in expected behaviors?
  • Would the service be able to handle this failure?

Remediation: Setup Auto-Scaling

Go to the AWS Console, Select ec2 from Services.

On the left navigation bar, select AutoScaling Groups.

Each Cluster get it’s own Auto-Scaling Group, select the one you need, and then at the lower navigation, select “Scaling Policies”

Select “Add Policy”, then “Create a simple scaling policy”

We will be creating two of these, one to scale up and one to set back to the usual 3.

Give the Policy a name, we will call it “Cluster-ScaleUp” and select “Create New Alarm”. Create the alarm to go off when CPU Util is greater than or equal to 13% for at least 1 minute, and name it Cluster-ScaleUp.

Press “Create Alarm”

Now you will be taken back to finish editing the policy

You want to edit the values to add “1” instance and then wait 120 seconds before the next activity. Press Create when finished.

We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown”, we will be creating a new alarm. This Alarm will be for when CPUUtilization is less than or equal to (<=) 13% within 15 minutes.

<Re-run scenario #1? Customize your scenario if you would like to see more auto scaling events kick off

2nd Experiment

For this experiment, we will test and discover what happens when there is network degradation as your primary service attempts to request a downstream dependency.

  • Scroll down and click “Choose a Gremlin”
  • Select “Network” -> “Latency” Gremlin
  • Change the length of the attack to 300 seconds, and set MS to 1000
  • Click Unleash Gremlin

2nd Experiment Questions

  • As the attack is running, try the following:

    • Add items to the cart
    • Remove items from the cart
    • Update quantity of
  • Is there any customer impact?

  • Are systems recovering gracefully?

  • Is there any way to mitigate this?

  • In, click Halt All Attacks to stop this attack.

  • Did systems recover?

  • What did we learn?

3rd Experiment

For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.

  • Scroll down and click “Choose a Gremlin”
  • Select “State” -> “Shutdown” Gremlin
  • Switch off “Reboot”
  • Click Unleash Gremlin

3rd Experiment Questions

  • As the attack is running, try the following:

    • Add items to Cart
    • Access
    • Remove
    • Check out
  • Is there any customer impact?

  • Are systems recovering gracefully?

    • How long did it take?
    • Is full customer experience restored?
  • How might you mitigate this?

Running a Chaos Experiment

Create Your Own Experiment!

Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.

Use the Blank Chaos Experiment card above to start forming a scenario. Then create this attack in Gremlin!

Key Questions to Ask

  • Was this failure detected?

  • Did this failure have customer impact?

    • If so, what are they?
  • Did the impact of this failure expected, or, did it match your hypothesis?

    • If not, what happened instead?
  • Can this failure be handled or mitigated?

Break Through

Join over 4000+ Engineers in the Chaos Engineering Community Slack: Join Us


Where should I go to get support?

To get support, head to to join the community, and join the #aws-gamedaylounge channel.

Can I replay this workshop on my own?

If you want to spin up this demo environment, the CloudFormation template to do so is located here.

By default, you will have to deploy this template in us-east-1.

Following completion of this deploy, you can then replay the workshop in this site.


Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started