Getting Started with Chaos Engineering

Eugene Wu
Solutions Architect
Last Updated:
July 8, 2020
Categories:
Chaos Engineering
,

Introduction

Chaos Engineering is a disciplined approach to finding failures before they become outages. You literally "break things on purpose" to learn how to build more resilient systems.

If you're curious to try Chaos Engineering for yourself, but want to practice in a demo environment first, this tutorial is for you.

In this tutorial, we'll walk through 3 chaos experiments to test the reliability of our demo app. We'll do this using Gremlin, a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform.

After completing this tutorial, you'll have hands-on experience running chaos experiments in a demo environment and be able to run them with confidence on your own infrastructure.

Prerequisites

To successfully complete this tutorial, you’ll need:

  • An AWS account
  • Console access with AWS CLI, kubectl and eksctl
  • A Gremlin account (request a free trial)

But first, what is Chaos Engineering?

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By testing how a system responds under stress, we can proactively fix vulnerabilities and make our systems more reliable.

So how does this work in practice? The best way to start is by creating a thoughtful, planned chaos experiment to validate expected behavior. First, ask yourself, "What could go wrong?" (For example, what happens when one of the third-party services we rely on goes down?) Then, use the scientific method to create a hypothesis, run a controlled experiment simulating the failure, and measure the impact.

After running your experiment, you'll have one of two outcomes. Either you’ve verified that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you can fix the problem before it causes an outage.

Tutorial overview

Will our demo app be resilient in the face of failure or will we experience an outage? Let’s find out with Chaos Engineering.

We’ll use Chaos Engineering to test how our demo app handles the following failure scenarios:

  1. High CPU
  2. Dependency outage
  3. Service container failure

When we inject these failures into the demo app, we'll be able to see if the system maintains its functionality or if its services degrade under stress.

Infrastructure and demo environment

We'll use this open-source microservices application as our demo environment. The demo app has eCommerce functionality. We'll refer to it as the Sock Shop going forward.

Getting set up

Set up eksctl and EKS cluster

  1. Follow instructions here to configure AWS CLI, kubectl and eksctl.
  2. In your respective region, launch an EKS cluster. For example:
    BASH
    
    eksctl create cluster --name sockshop-eks-cluster --version 1.15 --region us-west-2 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 4
    
    • Note: It’s important to use --version 1.15 as version 1.16 breaks sock shop’s YAML due to incompatibilities.

Deploy Sock Shop

  1. Using kubectl, deploy Sock Shop.
  2. Clone the repo below and and go into the deploy/kubernetes folder.
    BASH
    
    git clone https://github.com/microservices-demo/microservices-demo
    
  3. BASH
    
    kubectl create namespace sock-shop
    
  4. BASH
    
    kubectl apply -f complete-demo.yaml
    

Access Gremlin

  1. If you haven't already, request a free trial of Gremlin.
  2. Activate your account using the link sent to your email.

Retrieve your Team ID and Secret Key

To install the Gremlin Kubernetes agent, you will need your Gremlin Team ID and Secret Key. If you already know what those are, you can skip ahead to installing the agent. If you don’t know what your Team ID and Secret Key are, you can get them from the Gremlin web app.

  1. Visit the Teams page in Gremlin, and then click on your team’s name in the list.
  2. On the Teams screen click on <span class="code-class-custom">Configuration</span>. Make a note of your Team ID.
  3. If you don’t know your Secret Key, you will need to reset it. Click the <span class="code-class-custom">Reset</span> button. You’ll get a popup reminding you that any running agents using the current Secret Key will need to be configured with the new key. Hit <span class="code-class-custom">Continue</span>. Next you’ll see a popup screen that will show you the new Secret Key. Make a note of it.

Create a Kubernetes secret from Gremlin certificates

(Skip this step if you are using secret-based authentication)

  1. Download the Gremlin certificates (you need at least team manager access)
  2. Unzip certificates.zip
  3. Rename the files in the certificates folder. <span class="code-class-custom">Team Name.pub_cert.pem</span> becomes <span class="code-class-custom">gremlin.cert. Team Name.priv_key.pem</span> becomes <span class="code-class-custom">gremlin.key</span>.
  4. Create a gremlin namespace: <span class="code-class-custom">kubectl create namespace gremlin</span>
  5. Create a kubernetes secret by running the following:
BASH

kubectl -n gremlin create secret generic gremlin-team-cert --from-file=/path/to/gremlin.cert --from-file=/path/to/gremlin.key

kubectl

Download and apply the Gremlin configuration manifest
  1. Download the Gremlin configuration manifest by running the following:
    BASH
    
    wget https://k8s.gremlin.com/resources/gremlin-conf.yaml
    
  2. Open the file and update the following:
  3. Replace the following line with your team ID: "YOUR TEAM ID GOES HERE"
  4. Replace the following line with your team secret: "YOUR TEAM SECRET GOES HERE" (If you are using certificate-based authentication, remove this line.)
  5. Replace the following line with a string that you will use to identify your cluster: "YOUR UNIQUE CLUSTER NAME GOES HERE"
  6. Apply the manifest with this command:
    BASH
    
    kubectl apply -f /path/to/gremlin-conf.yaml
    

Download and apply the Gremlin agent manifest

If you are using certificate-based authentication:

  1. Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
BASH

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml

If you are using secret-based authentication:

  1. Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
BASH

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client-secret.yaml

Enabling Gremlin on the Kubernetes Master

Most Kubernetes deployments configure master nodes with the <span class="code-class-custom">node-role.kubernetes.io/master:NoSchedule</span> taint. You can run the following command to see if any of your nodes have this taint:

SHELL

kubectl get no -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME      TAINTS
kube-01   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
kube-02   

If you wish to install Gremlin on a Kubernetes master that has been tainted, add a tolerations section to the PodSpec of the Gremlin Agent Manifest.

YAML

tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule

You will need to reapply the Gremlin agent manifest after making this change.

Download and apply the K8s agent manifest

If you are using certificate-based authentication:

  1. Download and apply the k8s agent manifest by running:
BASH

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml

If you are using secret-based authentication:

  1. Download and apply the k8s agent manifest by running:
BASH

kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao-secret.yaml

Access monitoring

  1. Open CloudWatch in the AWS Console.
  2. On the CloudWatch Dashboard, click on Overview and select Container Insights.
  3. Make sure EKS Clusters is selected.

For more information on using Container Insights, see this documentation.

Access Sock Shop

  1. Log in to the bastion host, and run
    BASH
    
    kubectl get svc -o wide -n sock-shop | grep LoadBalancer
    
  2. Copy the load balancer’s DNS name.
    • Sample: a34.us-east-1.elb.amazonaws.com
  3. Paste this DNS into a browser to access the sock shop front-end.
  4. Navigate around to get a feel of all the functions of the shop. Things to try out:
    • Register and log in
    • View various items
    • Add items to cart
    • Remove items from cart
    • Check out items

Run chaos experiments

Experiment 1: Validate auto scaling on CPU load

For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.

  1. Select <span class="code-class-custom">Recommended Scenarios</span> in Gremlin, then click <span class="code-class-custom">View Details</span> for “Validate Auto Scaling” scenario.
  2. Scroll to the bottom and click <span class="code-class-custom">Add targets and run</span>.
  3. Select all 3 hosts found and click <span class="code-class-custom">Run Scenario</span>.

Questions

In AWS Console, look into Container insights:

  1. Was this failure detected?
  2. Did the outcome of this failure result in expected behaviors?
  3. Would the service be able to handle this failure?

Remediation: Set up auto scaling

  1. Go to the AWS Console and select <span class="code-class-custom">EC2 </span>from Services.
  2. On the left navigation bar, select <span class="code-class-custom">Auto Scaling Groups</span>.
  3. Each Cluster gets its own Auto Scaling Group. Select the one you need, and then at the lower navigation, select <span class="code-class-custom">Scaling Policies</span>.
  4. Select <span class="code-class-custom">Add Policy</span>, then <span class="code-class-custom">Create a simple scaling policy</span>.
    • We will be creating two of these, one to scale up and one to set back to the usual 3.
  5. Give the Policy a name, we will call it “Cluster-ScaleUp” and select <span class="code-class-custom">Create New Alarm</span>. Create the alarm to go off when CPU utilization is greater than or equal to 13% for at least 1 minute, and name it "Cluster-ScaleUp."
  6. Press <span class="code-class-custom">Create Alarm</span>. Now you will be taken back to finish editing the policy.
  7. Edit the values to add “1” instance and then wait 120 seconds before the next activity. Press <span class="code-class-custom">Create</span> when finished.
  8. We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown,” we will be creating a new alarm. This Alarm will be for when CPU utilization is less than or equal to (<=) 13% within 15 minutes.
    • Customize your scenario if you would like to see more auto scaling events kick off.

Experiment 2: Dependency outage

For this experiment, we will test and discover what happens when there is a service dependency outage as your primary service attempts to make requests to it.

  1. In Gremlin, create a new attack.
  2. Select the Containers tab.
  3. In the Search bar, look up carts-db.
  4. Scroll down and click Choose a Gremlin.
  5. Select Network -> Blackhole Gremlin.
  6. Change the length of the attack to 300 seconds.
  7. Click Unleash Gremlin.
  8. As the attack is running, try the following:
    • Add items to the cart
    • Remove items from the cart
    • Update quantity of items in cart

Questions:

  1. Is there any customer impact?
  2. Are systems recovering gracefully?
  3. Is there any way to mitigate this?
    • In Gremlin, click <span class="code-class-custom">Halt All Attacks</span> to stop this attack.
  4. Did systems recover?
  5. What did we learn?

Experiment 3: Service container failure

For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.

  1. In Gremlin, create a new attack.
  2. Select the Containers tab.
  3. In the Search bar, look up carts-db.
  4. Scroll down and click Choose a Gremlin.
  5. Select State -> Shutdown Gremlin.
  6. Switch off Reboot.
  7. Click Unleash Gremlin.
  8. As the attack is running, try the following:
    • Add items to cart
    • Access items in cart
    • Remove items in cart
    • Check out

Questions:

  1. Is there any customer impact?
  2. Are systems recovering gracefully?
    • How long did it take?
    • Is full customer experience restored?
  3. How might you mitigate this?

Create your own experiment

Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.

How to create a chaos experiment:

  • Create a hypothesis
  • Contain the blast radius
  • Run the experiment
  • Measure the impact
  • Share results

Questions

  1. Was this failure detected?
  2. Did this failure have customer impact?
    • If so, what are they?
  3. Did the impact of this failure expected, or, did it match your hypothesis?
    • If not, what happened instead?
  4. Can this failure be handled or mitigated?

Now increase the reliability of your own systems

While running experiments on a demo app is admittedly pretty fun, it doesn't improve the reliability of your systems. Start running experiments on your own infrastructure to test and validate your systems' response to failure and improve overall reliability.

Check out our documentation to install Gremlin anywhere, including bare-metal, on-prem, VMs, containers, serverless and Kubernetes environments.

If you'd like to try all Gremlin Attacks, including Packet Loss and Memory, request a demo and we'll set you up with a free trial of Gremlin.

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape