Chaos Engineering with Cassandra

Ana M Medina
Sr. Chaos Engineer
Last Updated:
December 23, 2019
Categories:
Chaos Engineering
,

Introduction

Gremlin is a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform. Cassandra is Apache’s database that is scalable and high availability without compromising performance. It’s open source, distributed and decentralized/distributed storage system.

This tutorial will teach you how to do Chaos Engineering on Cassandra Using Gremlin.

Overview

This tutorial will show you how to use Gremlin and Cassandra

  • Step 1 - Install Gremlin
  • Step 2 - Install Cassandra
  • Step 3 - Add data to Cassandra using cqlsh
  • Step 4 - Install iostat
  • Step 5 - Run a Disk Resource Chaos Engineering Experiment
  • Step 6 - Run a IO Chaos Engineering Experiment
  • Step 7 - Expanding the BlastRadius of an IO Chaos Engineering Experiment

Prerequisites

Before you begin this tutorial, you’ll need the following:

Step 1 - Install Gremlin

First, ssh into your host and add the gremlin repo:

BASH

ssh username@your_server_ip

BASH

echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list

Import the GPG key:

BASH

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C81FC2F43A48B25808F9583BDFF170F324D41134 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6

Install the Gremlin agent:

BASH

sudo apt-get update && sudo apt-get install -y gremlin gremlind

First, make sure you have a Gremlin account (sign up here). Then, we will grab the credentials needed to authenticate the agent we just installed. Log in to the Gremlin App using your Company name and sign-on credentials. (These were emailed to you when you signed up to start using Gremlin.) Click on the right corner circular avatar, selecting “Company Settings”.

Then, select the team you need. The ID you’re looking for is found under Configuration as “Team ID” click on your Team. Make a note of your Gremlin Secret and Gremlin Team ID.

Now, on your host, we will initialize Gremlin and follow the prompts.

BASH

gremlin init

Use the credentials you have saved from the last step.

Step 2 - Install Cassandra

In this step, you’ll be installing Cassandra onto your host. First, Install add the Apache repository of Cassandra:

BASH

echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

Add the Apache Cassandra repository keys:

BASH

curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -

Update the repositories:

BASH

sudo apt-get update

Install Cassandra:

BASH

sudo apt-get install cassandra

Verify it has been setup properly and has been started:

BASH

nodetool status

Your output should look similar to this:

Step 3 - Add data to Cassandra using cqlsh

In this step, you’ll add some data to Cassandra using Cassandra Query Language. For this tutorial we are going to be the cli tool, <span class="code-class-custom">cqlsh</span>. By default, Cassandra sets up a “Test Cluster” for us.

Start the cli:

BASH

cqlsh

You can learn about the default configuration via:

SQL

DESCRIBE CLUSTER

We are going to create our first Keyspace. In Cassandra a Keyspace is a namespace that defines data replication on nodes. We are going to be using SimpleStrategy for replication. Read more about it and other options here.

SQL

CREATE KEYSPACE user_db WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

Verify the creation:

SQL

DESCRIBE KEYSPACES

Select the newly created keyspace:

BASH

USE  user_db;

Create a table called user:

SQL

CREATE TABLE user (
    id int PRIMARY KEY,
    age int,
    name text,
    surname text,
    );

Verify the table has been created:

SQL

select * from user;

Now let’s add some data into this table:

SQL

INSERT INTO user (id, age, name, surname) VALUES (1, 21, 'ana', 'medina');

Verify the information table has been created:

SQL

select * from user;

We should see a table that looks like this:

Step 4 - Install iostat

To exit from <span class="code-class-custom">cqlsh</span>, just type <span class="code-class-custom">exit</span>. We will be now set up <span class="code-class-custom">iostat</span>, this is a linux command used for monitoring system I/O device loading, this is included when you install systat, a performance monitoring tool for Linux. On your host install systat:


sudo apt install sysstat

Step 5- Run a Disk Resource Chaos Engineering Experiment

For our first Chaos Engineering experiment we are going to be running a Disk Chaos Engineering experiment. This will be consuming disk space. Our hypothesis is, “When we consume 100% of our disk, we won’t be able to add entries to Cassandra.”

For monitoring this experiment we are going to run:


while sleep 2; do df --o; done

What we see below is the steady state of the application:

Going back to the Gremlin UI, select Attacks from the menu on the left and press the green “New Attack” button. We will be choosing the host we installed Gremlin on:

We will now go over to choosing the Gremlin. We will run a resource Chaos Engineering Attack, select “Resource” and choose “Disk” from the options. We will make the length 200 seconds, ask it to consume the Volume at 100 percent. We are then going to press “Show Advanced Options” and change the value of workers to 4 and make the block size 10000KB. Then press the green button to unleash the Gremlin.

Experiment Results

We can see on our monitoring that dev/xvda1 is running at 100% consumption.

Were you able to add entries into Cassandra?

Can you browse all of them?

Step 6 - Run a IO Chaos Engineering Experiment

We are going to create our second Chaos Engineering experiment. Performance is something we constantly need to keep in mind when using tools like Cassandra. We are going to run a Chaos Engineering experiment to learn more about how this host and implementation of Cassandra holds up to various disk/writes. Our hypothesis is, “When we consume I/O resources, Cassandra will still be usable and we will monitor this with i<span class="code-class-custom">ostat</span> too.”

Going back to the Gremlin UI, select Attacks from the menu on the left and press the green “New Attack” button. We will be choosing the host from the list.

We will now go over to choosing the Gremlin. We will run a resource Chaos Engineering attack, select “Resource” and choose “IO” from the options. We will make the length 300 seconds, keep the default Root Directory of /tmp and Mode of rw (read and writes). We are going to select “Show Advanced Options” and set it to run 100 Workers (The number of IO workers to run concurrently), with a Block Size (Number of Kilobytes (KB) that are read/written at a time) of 8000KB and a Block Count (The number of blocks read/written by workers) of 20. Then press the green button to Unleash the Gremlin.

As the experiment start running <span class="code-class-custom">iostat</span> and have it refresh every 1 seconds, on your host run:

BASH

iostat -x 1

Experiment Results

As the experiment is running along with <span class="code-class-custom">iostat</span>, we have also tried a few more entries, we see that they have been added without any problems.

We also want to make sure to look at our monitoring of our IO consumption on the host using <span class="code-class-custom">iostat</span>:

Since we saw that this Cassandra setup handled this IO experiment very well, we will run a third Chaos Engineering experiment.

Step 7 - Expanding the BlastRadius of an IO Chaos Engineering Experiment

We are going to expand its Blast Radius. What does that mean? Blast radius is the subset of a system that can be impacted by an attack. We saw what would happen when using a Block Size of <span class="code-class-custom">8000KB</span>, but what if we made the Block Size larger, following the real-world example of uploading files to a file sharing service? We are going to simulate files of <span class="code-class-custom">50MB</span>. Our hypothesis is, “When we consume more I/O resources, Cassandra will still be usable and we will monitor this with <span class="code-class-custom">iostat</span> too.”

Going back to the Gremlin UI, select Attacks from the menu on the left and press the green “New Attack” button. We will be choosing the host from the list.

We will now go over to choosing the Gremlin. We will run a resource Chaos Engineering attack, select “Resource” and choose “IO” from the options. We will make the length 300 seconds, keep the default Root Directory of /tmp and Mode of <span class="code-class-custom">rw</span> (read and writes). We are going to select “Show Advanced Options” and set it to run <span class="code-class-custom">100</span> Workers (The number of IO workers to run concurrently), with a Block Size (Number of Kilobytes (KB) that are read/written at a time) of <span class="code-class-custom">50000KB</span> and a Block Count (The number of blocks read/written by workers) of <span class="code-class-custom">20</span>. Then press the green button to Unleash the Gremlin.

Experiment Results

Just like the last experiment, we want to make sure to go back and look at the monitoring we are doing with iostat:

We also want to test how Cassandra is handling this Chaos Engineering experiment, we see that we are able to add new entries but it’s 2 seconds slower than last experiment. Chaos Engineering experiments like this allow you to make sure you can handle a high load of users trying to use your application and for them to have great experience. For the example of a file file-sharing service, you want to make your user is able to upload files at a timely speed as well as be able to view and delete the file as quickly as possible.

Conclusion

Congrats! You’ve now run a few Chaos Engineering experiments for Cassandra. Where you able to learn something new about your Cassandra configuration? There are many other Chaos Engineering experiment you can run to focus on Cassandra resiliency. One we see folks not run as often is we’re treating our hosts as kettle and not pets to verifying your Auto Scaling Groups groups. We have seen that some folks get scared on shutting down hosts, especially when dealing with data, but you want to make sure you’re constantly ready for all sorts of failure to occur. If you have any questions at all or are wondering what else you can do with this demo environment, feel free to DM me on the Chaos Slack: @anamedina (join here!).

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape