June 26, 2020 - 5 min read

Performance tuning MongoDB with Chaos Engineering

You’ve pored over the MongoDB documentation, crafted highly polished and well-tuned queries, and confidently deployed your new code to production. Everything ran great at first, but once CPU or RAM usage hit a certain point, your queries suddenly slowed to a crawl. What happened, and how can you prepare for situations like this in the future?

This is an unfortunate but common scenario with databases like MongoDB. Queries behave one way in an isolated development environment, but end up underperforming once they’re deployed to production. This could be due to:

  • Spikes in CPU usage due to a surge in user traffic
  • Several large queries are running simultaneously, bringing disk I/O to a crawl
  • A misconfigured router sending network traffic along a slower route, causing latency

With Chaos Engineering, we can test for these different scenarios and optimize for them before deploying to production. This gives us the opportunity to improve the performance of both our database and our queries, thereby improving the user experience. In this post, we’ll walk you through applying Chaos Engineering techniques to MongoDB using Gremlin to simulate slowdowns and keep your queries running fast.

How can Chaos Engineering help with MongoDB optimization?

Chaos Engineering is the practice of intentionally and scientifically testing systems by injecting failure into them. It involves performing actions generally considered harmful—exhausting computing resources, blocking network traffic, shutting down instances, etc.—in a safe and controlled way, observing the results, and forming conclusions that will help us improve our systems. This is often used to make systems more reliable by testing for failure safety, but we can also use it to test performance in stressful conditions.

For distributed database platforms like MongoDB, there are a lot of things that can go wrong: unexpectedly high resource consumption, slow connections between mongod instances, failed primary nodes, and lengthy election cycles to name a few. With Chaos Engineering, we can simulate these scenarios and optimize for them long before our application goes live. This might include:

  • Provisioning infrastructure that’s better attuned to MongoDB’s requirements
  • Adjusting shard sizes, indexes, replication strategies, and auto-scaling policies
  • Testing different versions of queries to find the most efficient

Donald Knuth famously wrote that premature optimization is the root of all evil 97% of the time, but “we should not pass up our opportunities in that critical 3%.” With Chaos Engineering, we can account for that critical 3%.

How to start experimenting on MongoDB

Now that we’ve covered the benefits of Chaos Engineering, let’s start experimenting. We’ll prepare a test MongoDB cluster and application, gather baseline performance metrics, design and run an experiment, then compare the results of the experiment to our baseline.

For this example, we’re using MongoDB version 4.2.5 with a replica set of three CentOS 7 servers. Our application is a simple web-based registration form that allows users to submit their name and email address, which we’ll store in a collection called gremlin.registrations. We also have a page where users can view a list of current registrations, of which we have 25 thousand.

Step 0: Enable monitoring

The metric we want to focus on is query execution time. We can use the MongoDB database profiler to record profiling data for each operation, which includes this metric. To enable profiling, we’ll open a mongo shell and switch to our gremlin database:

1> use gremlin

Then, we’ll run the following command:

1> db.setProfilingLevel(2)

MongoDB has 3 profiling levels: 0 (off), 1 (slow operations only), and 2 (all operations). Logging all operations can add a significant amount of overhead, so we recommend only enabling it during development and testing.

Now, we can run a few queries and retrieve profiling data from the system.profile collection. When querying this data, we want to focus on the following fields:

  • ts: the timestamp of the operation. We’ll use this field to correlate profiling data to chaos experiments based on their start and end times.
  • millis: the amount of time required for MongoDB to complete the operation.
  • cursorid: a pointer to the result set of a query. MongoDB will execute our query in batches, and we’ll use this field to aggregate and sum the results.

We also recommend collecting infrastructure metrics so that you can measure the impact of experiments on your nodes. Since we’ll be using Gremlin to run our experiments, we can view metrics for CPU and shutdown attacks right in the Gremlin web app. For other experiments, we can use the Full Time Diagnostic Data Collection (FTDC) feature of MongoDB, or use a separate monitoring solution like Datadog. For now, we’ll focus on the profiling data.

Step 1: Establish a baseline

Before running any experiments, we’ll establish a performance baseline. As we run experiments, we can compare the results of those queries to our baseline to measure the change in query performance. To eliminate any outside factors, we’ll run the same query on the same data set.

Here’s the query that our application effectively uses to retrieve registration data:

1> db.registrations.find()

Instead of returning all 25,000 records at once, MongoDB returns batches of 1,000. To get the query’s total execution time, we need to aggregate our profiling data by cursor ID:

1> db.system.profile.aggregate([{ $match: {ns: "gremlin.registrations"} }, {$group: { _id: "$cursorid", millis: {$sum: "$millis"}, recordcount: {$sum: "$nreturned"}}}])

When we execute both queries, we see the following:

1{ "_id" : 34478518728, "millis" : 0, "recordcount" : 25000 }

Note that "millis" : 0, which means the query effectively finished immediately. Since we have plenty of capacity in our test cluster and no other active processes, we expect to see fast results. Let’s see if we can achieve the same level of performance under load.

Step 2: Run chaos experiments

Now that we have our baseline, let’s start experimenting. A chaos experiment consists of one or more attacks, which are methods of injecting failure. Relevant attacks for MongoDB can include:

  • CPU and RAM consumption (MongoDB and system performance)
  • I/O (read/write performance)
  • Network latency (communication between nodes)
  • Packet loss (node communication, load balancing operations, leader elections, data integrity, and more)

We’ll run our chaos experiment using Gremlin, a Chaos Engineering SaaS platform. You can try running chaos experiments for yourself by creating a Gremlin Free account.

For this example, we’ll use a memory attack to increase RAM consumption by 50%. This could simulate a number of scenarios, such as a spike in the number of read operations, a slow and inefficient query, or a rogue process on the node. We’ll run our query three times to get a decent baseline, start the attack, and repeat the query three more times while the attack is running:

BaselineMemory Attack: 50%
Run 10 ms152 ms
Run 20 ms465 ms
Run 30 ms581 ms

As we can see, this had a noticeable impact on performance. Response times can vary from just 152 milliseconds to over half a second, creating an inconsistent and often sluggish user experience. If this doesn’t seem like a lot, remember that this is just one query and one experiment. We have to consider the impact of other queries running simultaneously, as well as other factors such as CPU availability, disk throughput, and network throughput. Even just 152 milliseconds can compound once we start running in production.

From here, we can begin planning how we want to approach optimization, whether that means rewriting our query, creating indexes to avoid rescanning the entire collection, or increasing capacity on our nodes. We won’t go into detail about optimizing MongoDB, but we recommend reading the MongoDB query optimization documentation and optimization tutorial for more guidance.

Step 3: Refine and repeat

Once you’ve stress tested your queries and implemented potential fixes, continue experimenting to verify their effectiveness. For each optimization action that you take, repeat your experiments and compare the results. Increase RAM consumption to 80% instead of 50%. Consume disk I/O to see how well your cluster performs under disk pressure. If you’re using the nearest read preference, shut down one of your replicas to measure performance when the preferred node is offline. Make sure to check for regressions after each optimization. Modifying one component can have ancillary effects on other components, especially in a distributed system like MongoDB.

Congratulations on optimizing your MongoDB queries! Optimization is just one way Chaos Engineering can help with managing a MongoDB cluster. You can also use it to improve the reliability of your clusters, reduce the risk of downtime, and test for high-scale events beforehand. To learn more, read our guide on Chaos Engineering for MongoDB.