Gremlin is a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform.
This tutorial will show you how to install the Gremlin agent on Ubuntu 18.04 hosts, and how to perform your first Chaos Engineering experiment, a CPU attack.
Connect to your host with ssh and install the Gremlin repo:
Then configure the package manager to use the Gremlin repo:
1echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list
Import the GPG key:
1sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6
Then install the Gremlin agent:
1sudo apt-get update && sudo apt-get install -y gremlin gremlind
Click on your team in the list. You’ll be taken to the team details page.
To configure the Gremlin agent you’ll need the Team ID and Secret Key. The Team ID is created automatically. To create the Secret Key, hit the Create button. You’ll see a window where you can copy the Secret Key:
Make sure to make a note of your Secret Key, as this is the only time you will be able to view it. If you lose it, you’ll need to hit the Reset button and generate a new one.
Now that we have the Gremlin Team ID and Secret Key, we can finish configuring the agent. Go back to your SSH session on the Ubuntu host and run this command:
Input your Team ID and Secret Key when you’re prompted for them.
The setup is now complete and you’re ready to begin running Chaos Engineering experiments.
On your Ubuntu host, run the “top” command. This is how we’ll view the CPU usage for this experiment.
In the Gremlin web UI, click the Attack link in the left navigation bar, and then click the New Attack button. Select your Ubuntu host as the target:
Scroll down and click Choose a Gremlin. The CPU attack should be selected by default. If not, click on Resource and then CPU.
Scroll down again to enter the settings for the attack. For this first attack we’ll set the length to 180 seconds, select All Cores, and leave the CPU percentage at the default setting. Then click Unleash Gremlin, which will start the attack.
You’ll then see the attack listed as Active.
Go back to your SSH session on the Ubuntu host and examine your top output. Once the attack changes to a Running state, you should see much more CPU activity than previously.
The attack will end after the 180 seconds have passed. You’ll then see it listed in Gremlin as Completed.
It’s a recommended practice to define abort conditions before running Chaos Engineering experiments. Abort conditions are things that would make us want to halt an experiment immediately, because we are concerned about the safety of our systems. Abort conditions could be defined as an increase in error rate, an increase in latency, or specific alerts we receive. For abort conditions to be useful, our Chaos Engineering tool needs to allow us to halt experiments immediately. Gremlin allows us to halt individual attacks, or all running attacks.
In the Gremlin UI go to Attack and New Attack, and launch another CPU attack with the same settings as last time. Once it’s running you’ll see it listed again under the Active attacks.
Once the attack is in the Running state, there are two options for halting it. We can either click the Halt button to the right of the attack, or the Halt All Attacks button. In this case either would work, as we only have one attack running, but in some situations we might want to halt one attack without impacting others.
The ability to quickly halt all running experiments is an important part of Chaos Engineering, and allows us to experiment in a safe way.
At this point you have an Ubuntu 18.04 host running with Gremlin, you’ve run your first Chaos Engineering attacks, and you’ve learned how to halt running attacks. Congrats!
To learn more about Gremlin you can read the documentation, which explains the other types of Chaos Engineering attacks you can perform. To learn more about Chaos Engineering join our Chaos Engineering Slack, and read more tutorials on our Community page.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started