Amazon RDS is a managed relational database service that lets you easily deploy, scale, and replicate databases. You can create an instance of Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, or SQL Server, and have it be fully managed for your application, while benefiting from a fully managed infrastructure, automatic updates, automatic recovery from host failures, and more.
Amazon RDS has many resiliency features built in, but these don’t account for the other types of failures that can impact our RDS applications: for example, losing network connectivity between our applications and RDS. Application misconfigurations, unexpected restarts, or availability zone (AZ) failures can happen at any time, and when they do, we want to know that our application can recover gracefully without disrupting the user experience.
Using Chaos Engineering, we can proactively test these different conditions, see how our application responds, then use these insights to make both our application and RDS deployment more resilient. This tutorial will show you how to use Gremlin to run chaos experiments that will test the resiliency of your applications when using Amazon RDS.
Before starting this tutorial, you’ll need:
This tutorial will show you how to:
First, let’s create a database instance in RDS. For this tutorial, we’ll use MariaDB. Log into the AWS Management Console and select Amazon RDS from the list of services. Scroll down to the Create database section and select Create database. Choose your preferred creation method (I used Standard for this example), then choose MariaDB for the engine. Since this is just a demo instance, I recommend choosing Free tier (db.t2.micro is more than enough for this example) or Dev/Test.
Give the instance a name of your choice. We’ll name ours
mariadb-gremlin-demo. Add credentials for the admin account (or check the box to let RDS automatically generate a password) and make sure to copy these credentials as we’ll need to use them in the next step. Finish configuring the instance however you’d like, then click Create database. The new database will be provisioned in a few minutes.
Once the instance is created, we’ll need to populate it with a database and table. We’ll use the
mysql command line tool to connect to the database, although any MariaDB/MySQL client will work.
Open a connection to the database using the following command, making sure to replace
YOUR_HOST with your database server endpoint,
YOUR_PORT with the port number, and
YOUR_USER with your MariaDB username. You’ll be prompted to enter your password:
1mysql -h YOUR_HOST:YOUR_PORT -u YOUR_USER -p
From here, we’ll run two scripts: the first creates a database named
todo, and the second creates a table named
1CREATE DATABASE todo;23CREATE TABLE todo.tasks (4 id INT(11) unsigned NOT NULL AUTO_INCREMENT,5 description VARCHAR(500) NOT NULL,6 completed BOOLEAN NOT NULL DEFAULT 0,7 PRIMARY KEY (id)8);
Now we’re ready to deploy our app and connect it to our database! Enter
quit() to close the client.
Next, we’ll deploy an application to our host and connect it to the database. This application provides a web page that lets users add items to a TODO list and persists these items to our MariaDB database. The application has two services: a client service that runs the website and frontend, and an API service that connects to the database and processes requests from the client. See the GitHub repository for more details.
First, we’ll open a terminal on the host where we have the Gremlin agent installed, then clone the source code from the GitHub repository provided by MariaDB:
1git clone https://github.com/mariadb-corporation/dev-example-todo.git
Next, we’ll build the application. This requires NPM, which you can install by following these instructions. Once NPM is installed, we’ll run the following commands to navigate to the
client service folder and install its dependencies:
1cd dev-example-todo/client2npm install
Before we actually run the client service, we need to run the API service, which acts as a middle layer between the client and MariaDB. The project provides several examples in different languages, but we’ll deploy the Node.js version. Open a second terminal window and run the following commands:
1cd dev-example-todo/api/nodejs/basic2npm install
When running the API, we’ll provide our database connection details as environment variables. In the following command, replace the following strings:
YOUR_HOST: Your MariaDB server’s hostname.
YOUR_PORT: Your MariaDB server’s port number.
YOUR_USER: The username you want to use to log in to MariaDB.
YOUR_PASS: The password used to log in to MariaDB.
Note: If the database you created in step 1 has a different name other than
DB_NAME=<your database name>.
1DB_HOST=YOUR_HOST DB_PORT=YOUR_PORT DB_USER=YOUR_USER DB_PASS=YOUR_PASS DB_NAME=todo npm start
Once the API is up and running, switch back to your client terminal and start the client:
Now, open the URL for your server in the browser on port 3000 and you’ll see the following screen. Try adding tasks and refreshing the page. If the data persists, then the database connection is working!
Now that we’ve set up our application, let’s run a chaos experiment!
In this experiment, we’ll simulate a full scale outage between our instance and RDS. We’ll do this by running a blackhole attack, which drops all network traffic. Since this instance communicates with multiple different services, we’ll limit the scope of this attack (the blast radius) to only affect traffic going to and from the database server.
First, we’ll log into Gremlin by signing in at app.gremlin.com. In the left-hand side bar, select Attacks, then New Attack. In the attacks screen, select the Infrastructure tab, then scroll down to select the host where the TODO application is running. Here, our host is named
Under Choose a Gremlin, select Network, then select Blackhole. In the Hostnames text box, add the hostname for the database server. You can optionally add
3306 to the Remote Ports field to only impact traffic on that port, but adding just the hostname is fine for this example.
Next, click Unleash Gremlin to start the attack. Once the attack is running, open your todo app in your browser. What happens when you try refreshing the page, or adding an item? As it turns out, the page loads just fine, but we don’t see any items.
For users of our site, this would be confusing. It appears as if all of their data just disappeared. If we take a look at our terminal output for the API server, we can see what happened. The API server sent an asynchronous (
async) request to MariaDB and continued rendering the webpage in the meantime. The database is unavailable due to the blackhole attack, so the request eventually times out and returns an error. But at this point, the page has already been displayed to the user, so we get no visual indication of a problem.
1(node:12932) UnhandledPromiseRejectionWarning: Error: retrieve connection from pool timeout after 10000ms2 at Object.module.exports.createError (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/misc/errors.js:55:10)3 at timeoutTask (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/pool-base.js:300:16)4 at Timeout.rejectAndResetTimeout [as _onTimeout] (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/pool-base.js:322:5)5 at ontimeout (timers.js:438:13)6 at tryOnTimeout (timers.js:300:5)7 at listOnTimeout (timers.js:263:5)8 at Timer.processTimers (timers.js:223:10)9(node:12932) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)10(node:12932) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.11(node:12932) UnhandledPromiseRejectionWarning: Error: retrieve connection from pool timeout after 10001ms12 at Object.module.exports.createError (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/misc/errors.js:55:10)13 at timeoutTask (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/pool-base.js:300:16)14 at Timeout.rejectAndResetTimeout [as _onTimeout] (/home/ubuntu/dev-example-todo/api/nodejs/basic/node_modules/mariadb/lib/pool-base.js:322:5)15 at ontimeout (timers.js:438:13)16 at tryOnTimeout (timers.js:300:5)17 at listOnTimeout (timers.js:263:5)18 at Timer.processTimers (timers.js:223:10)19(node:12932) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6)
To avoid this issue, we could add some code to our
.catch() block that shows a user-friendly error message or popup. We could also add a loading indicator on the client side to show when a database request is being made. From an operations perspective, we should consider adding redundancy to our RDS instance and implementing load balancing to reduce the risk of a complete outages like this in the first place.
Running chaos experiments like this blackhole experiment can reveal unexpected behaviors in systems and the services they depend on. We should consider other conditions that might impact our application's behavior. For example, we saw what happened when we lost connection to the database, but what happens a network misconfiguration adds an extra 100ms of latency to database traffic? What if it introduces packet loss or corruption? What if we lose connection to our DNS server and can no longer resolve our database's hostname? We should try these experiments with our database and other dependencies.
Now that you have an environment with an Amazon RDS instance and the Gremlin agent installed, try running these different experiments and record your observations. If you want to run more advanced experiments, check out our library of Recommended Scenarios.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started