Gremlin is a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform.
This tutorial will show you how to install the Gremlin agent on Fedora hosts, and how to perform your first Chaos Engineering experiment, a CPU attack.
A Fedora host. You need to have sudo or root access on the host. This tutorial was tested with Fedora 30, but should work with other versions.
A Gremlin account (sign up here).
Connect to your host with ssh:
Add the Gremlin repo:
1sudo curl https://rpm.gremlin.com/gremlin.repo -o /etc/yum.repos.d/gremlin.repo
Then install the Gremlin client and daemon.
1sudo yum install -y gremlin gremlind
The next step is to configure the Gremlin agent with your Gremlin Team ID and Gremlin Secret. Log into the Gremlin web UI with your email address and password, and then go to Company Settings and click on Teams.
Click on your team in the list. Then click on Configuration.
To configure the Gremlin agent you’ll need the Team ID and Secret Key. Both are generated automatically when your company is created. The Team ID is displayed on this screen, but the Secret Key is hidden. If you don’t know your Secret Key, you can hit the Reset button to create a new one.
Resetting the key will require you to update the key on any other clients you have running. After hitting Rest, you’ll see a popup screen explaining this and asking for confirmation. Hit Continue.
Next you’ll see a window where you can copy the Secret Key:
Make sure to make a note of your Secret Key, as this is the only time you will be able to view it. If you lose it, you’ll need to hit the Reset button again to generate a new one.
Now that we have the Gremlin Team ID and Secret Key, we can finish configuring the client. Go back to your SSH session on the Fedora host and run this command:
Input your Team ID and Secret Key when you’re prompted for them.
The setup is now complete and you’re ready to begin running Chaos Engineering experiments!
On your Fedora host, run the “top” command. This is how we’ll view the CPU usage for this experiment.
In the Gremlin web UI, click the Attacks link in the left navigation bar, and then click the New Attack button.
There are several ways to target which hosts or containers you want to attack. The default is Hosts, and we’ll use that. Click the Exact button and select your Fedora host.
Scroll down and click Choose a Gremlin. Select Resource and then CPU.
Scroll down again to enter the settings for the attack. For this first attack we’ll set the length to 180 seconds, select All Cores from the pulldown menu, and leave the CPU percentage at the default setting. Then click Unleash Gremlin, which will start the attack.
You’ll then see the attack listed as Running.
Go back to your SSH session on the Fedora host and examine your top output. Once the attack changes to a Running state, you should see much more CPU activity than previously.
The attack will end after the 180 seconds have passed. You’ll then see it listed on the Attacks page as Completed.
It’s a recommended practice to define abort conditions before running Chaos Engineering experiments. Abort conditions are things that would make us want to halt an experiment immediately, because we are concerned about the safety of our systems. The abort conditions for an experiment could be defined as an increase in error rate, an increase in latency, or specific alerts we receive.
For abort conditions to be useful, our Chaos Engineering tool needs to allow us to halt experiments immediately. Gremlin allows us to halt individual attacks, or all running attacks.
In the Gremlin UI go to the Attacks page and hover over the three dots on the right of the attack you just ran. Click on Rerun Attack.
This will put you back in the targeting interface. The attack will default to all of the same settings you used last time, so just scroll down to the bottom of the screen and click Unleash Gremlin.
Once the attack is in the Running state, there are two options for halting it. We can either click the Halt button to the right of the attack, or the Halt All Attacks button at the top of the screen.
In this case either would work, as we only have one attack running, but in some situations we might want to halt one attack without impacting others.
The ability to quickly halt all running experiments is an important part of Chaos Engineering, and allows us to experiment in a safe way.
At this point you have a Fedora host running with Gremlin, you’ve run your first Chaos Engineering attack, and you’ve learned how to halt running attacks. Congrats! For next steps you could try running some other types of attacks, like Memory, Latency or DNS.
To learn more about Gremlin you can read the documentation, which explains the other types of Chaos Engineering attacks you can perform. To learn more about Chaos Engineering join our Chaos Engineering Slack, and read more tutorials on our Community page.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started