Chaos Engineering Guide: MongoDB

Chaos Engineering Guide: MongoDB

Introduction

MongoDB is a popular database that includes features designed for performance and scale. But as with any software, the way it works on paper can differ from real-world implementations. We’ll discuss the performance and high-availability features of MongoDB and how to test them in your own implementation to ensure they’re working for you.

Specifically, we will illustrate how to test what happens when your primary database node fails, when that node is working but loses network connectivity, when disk I/O hits hardware limits, and when queries are slow and causing shard latency. Discovering your system’s actual responses to these situations will either verify that your planning and mitigation works as intended or give you some hard data to guide reliability work.

MongoDB history and design

MongoDB was originally one of many components in what was to be a platform-as-a-service that would make building and hosting apps easy. While the rest of the platform that 10gen—the company that would go on to become MongoDB—created has been lost to history, the database has become one of the most popular NoSQL databases.

That popularity is largely driven by the original intent of the platform: MongoDB makes it easier to develop and run applications at scale.

From an operational perspective, MongoDB makes scaling easier with both data sharding and replication built-in. Sharding allows large datasets to be spread across multiple machines in order to support very large data sets and high throughput. MongoDB even includes a built-in balancer that automatically redistributes data among servers in your sharded cluster. MongoDB’s replication features provide high availability, minimizing downtime and data loss in the event that a server fails.

MongoDB’s rich features should work out-of-the-box, but as with any software implementation your environment and specific configurations may affect this. Testing these features is critical to ensuring they work as expected and to better understand how failure can occur.

Ensuring reliable replication

MongoDB replica sets are composed of a primary node that receives all write operations and may serve read operations, and secondary nodes that contain replicas of the primary’s data set. Secondary nodes may also serve read operations or simply be on standby for failover. Replica sets may also contain Arbiters—servers that do not store or serve data, but only exist to help with leader election.

Distributing the members of your replica set across different availability zones and regions will provide resiliency against datacenter and location-specific incidents.

Here’s how MongoDB describes the failover process:

When a primary does not communicate with the other members of the set for more than the configured electionTimeoutMillis period (10 seconds by default), an eligible secondary calls for an election to nominate itself as the new primary. The cluster attempts to complete the election of a new primary and resume normal operations.

It’s worth noting that write operations are halted during an election and only resume after a secondary has successfully been promoted to primary. While the time to elect a new primary is typically short (usually a couple seconds, but it’s highly dependent on your architecture and network), adding this time to the 10 second (default) timeout period before MongoDB declares a primary as unreachable means you may be unable to write to your cluster for enough time to impact the customer experience.

Chaos Experiment 1: Primary failure

Experiment

We want to test that when the primary fails, a secondary node is automatically promoted to the primary role. To do this, run a shutdown attack on the primary node with the reboot option disabled.

Hypothesis

When the primary node is lost from the replica set, the remaining members of the cluster will stop receiving heartbeats from it. This should trigger a new leader election and the elected secondary node will be promoted to primary. When the old primary node is restarted, it should rejoin the replica set, acknowledge that a new primary has been elected, demote itself to a secondary role, and begin sync operations to replicate data from the new primary.

Abort conditions

You should halt your Chaos Experiment if any of the following occur:

  • If the cluster fails to elect a primary within a reasonable amount of time
  • If your application cannot write to the cluster after a new primary is elected
  • If the cluster loses data (more on this below)

Testing data durability

Early in its popularity, MongoDB was famously fast—aka “Web Scale”—but notoriously unreliable due to a misconfigured setting called write concern. The write concern indicates how many servers must acknowledge receiving the data before MongoDB can return a successful confirmation to the client. One way to boost the appearance of performance is to set the write concern to 0—no acknowledgement required. Configured this way, MongoDB will inform the client that it has successfully received data with no guarantee that the data has been safely stored.

By default, the write concern is set to 1. In a replica set, this means the primary node has stored the data. But what happens in a situation, such as our previous Chaos Experiment, where the primary node has received the data, but fails before it can replicate the data to the secondary nodes?

Under those conditions, the primary node will still be demoted to a secondary role, but upon rejoining the replica set, any unreplicated data it has will be rolled back to bring it into sync with the rest of the cluster.

When running a high availability MongoDB cluster, you’ll want to set the write concern to a number greater than 1 to ensure that secondary nodes also acknowledge the data, or to majority, which means that over half of the secondaries have acknowledged the data. As the MongoDB documentation notes:

A rollback does not occur if the write operations replicate to another member of the replica set before the primary steps down and if that member remains available and accessible to a majority of the replica set.

Chaos Experiment 2: Primary node network loss

Experiment

This experiment tests that data is persistent when the write concern is set above 1. To do this, set the write concern to majority. Next, write a record to your MongoDB instance immediately followed by a blackhole attack on the primary node, which drops all network traffic to the node.

Hypothesis

With a write concern of majority, most of the secondary nodes will have the data, assuring against data loss. When the primary node is blackholed, it will demote itself to secondary and the remaining members of the cluster will elect a new primary. When the old primary node rejoins the replica set, it will keep its secondary role. No data will be lost.

Abort conditions

You should halt your Chaos Experiment if any of the following occur:

  • If the cluster fails to elect a primary, which would prevent data from being written, quickly enough to avoid violating your Service Level Objectives
  • If your application cannot write to the cluster after a new primary is elected
  • If the cluster loses data. This would indicate that a secondary node with the latest data was unable to be promoted to primary

Protecting performance

Sharding is a way to break up data into logical groups so those groups can be distributed across multiple nodes. MongoDB groups records into 64 megabyte chunks and uses a shard key to determine how to distribute those chunks. The MongoDB balancer automatically distributes chunks across shard nodes. As records are created or updated, chunks will be split and the balancer will redistribute chunks to keep them balanced among shard nodes.

It’s important to select a good shard key. Although the balancer automatically distributes chunks to balance the number per shard node, it does not account for read or write patterns. See the MongoDB manual for more information about choosing a shard key.

The balancer attempts to minimize the impact of chunk migrations on database performance by only moving one chunk at a time. It also has configurable migration thresholds and balancing windows. But even with these features, the balancer will put load on your shard nodes and replica sets, so you’ll need to reserve some overhead resource capacity.

Chaos Experiment 3: Disk I/O capacity

Experiment

This experiment tests that your MongoDB sharded cluster is performant, even when migrating chunks. To simulate a chunk move, first disable balancing by running the sh.disableBalancing() method on your collection—running an experiment during a real chunk migration could compound issues. Next, run an I/O attack on two shard nodes—mimicking the behavior of the balancer as it moves data between them. As the attack runs, monitor your query and write latency to ensure there is no performance degradation. When your experiment is complete, run the sh.enableBalancing() method to re-enable balancing.

Hypothesis

With properly sized nodes, you should not see any significant decrease in performance.

Abort conditions

You should halt your Chaos Experiment if any of the following occur:

  • If the cluster performance degrades and affects the user experience
  • If your application cannot write data due to blocked I/O

Identifying potentially problematic queries

We’ve discussed sharding, but in an application using a sharded MongoDB backend, how does your application know which shards to query? Enter the MongoDB mongos—which serves as a router and data aggregator. Rather than interacting directly with shard nodes (which is not possible in a properly configured sharded cluster), your application interacts with the mongos. The mongos will analyze your query, send the query to the appropriate shards, and for read operations, aggregate the results and apply any limits, sorts, or skips, before sending them back to your application.

If your query uses the shard key to constrain results—such as querying for a specific user ID where the data collection is sharded by user ID—the mongos will identify which shard holds that record and direct the query to that shard. However, if your query does not use the shard key, the mongos will perform a broadcast operation, passing the query to all shards, then aggregate the results.

Imagine that you have a data collection that contains information about 16th century British literature. Each document has the title of the work and the author’s name—for example, one record would be { “title”: “Romeo and Juliet”, “author”: “William Shakespeare” }. If we set the shard key to be titles, searching for this work by its title becomes simple. The mongos just targets the shard that holds literature beginning with the letter R. However if we run a query such as db.collection.find({ “author”: “J. K. Rowling” }), the mongos will need to pass that query to every shard and wait for a response from all of them before being able to authoritatively declare that there were no works by J. K. Rowling in the collection. In scenarios such as this, a single slow shard can bring down the performance of the entire cluster.

For more information see the MongoDB query optimization documentation.

Chaos Experiment 4: Shard latency

Experiment

In this experiment, we’ll identify queries that require broadcast operations and may degrade application performance. To do this we’ll need to log slow operations. Slow is relative and should be based on your particular application and Service Level Objectives, but for this example we’re setting it to 5000 milliseconds. Set the mongos parameter --slowms to 5000 and the --slowOpSampleRate to 1.0. Next, run a scenario using a latency attack to inject 5000ms of latency on one shard. Operations taking more than 5000ms will be logged to the MongoDB diagnostic log. Because this one attack will also cause high latency on operations isolated to the single shard, the scenario should include subsequent steps to shift the latency attack to other shards. Queries that perform broadcast operations will appear in the diagnostic log multiple times, even as the latency attack shifts to different shards, because those queries aggregate data from all shards.

Hypothesis

Queries that use targeted operations only need to communicate with specific shards and will see intermittent latency (dependent on which shard or shards hold the required data). Queries that use broadcast operations will always see latency as they require interaction with all shards.

Abort conditions

You should halt your Chaos Experiment if the cluster performance degrades enough to negatively affect the user experience.

Conclusion

MongoDB’s rich feature set makes operating a performant, highly available datastore easier. To understand how these features work in your unique environment, and to test that they do indeed work as intended, you’ll need to test them. We’ve provided a few examples you can run. As you read and worked through them, you probably thought of other ideas for chaos experiments specific to your system that would help you learn what you want to know and, ultimately, enhance your resilience.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started