Amazon DynamoDB is a NoSQL database service boasting high availability, high durability, and single-digit millisecond performance. It offers a wealth of reliability features such as automatic replication across multiple Availability Zones in an AWS region, automatic backups, in-memory caching, and optional multi-region and multi-master replication. Since it’s a fully-managed service, we don’t need to worry about things like provisioning hardware, maintaining servers, or replicating data, as these are all provided in the base product or available as add-ons.
However, that doesn’t mean we’re out of the woods. There are plenty of ways that a DynamoDB deployment can fail, and our engineers must be aware of these when building applications in order to maintain a high degree of reliability. Identifying these risks early in the development process lets our teams:
- Build more resilient applications
- Maintain performance as our applications scale
- Provide a fast, low-latency experience for users
In this article, we look at how these risks can affect a DynamoDB deployment, and how Chaos Engineering helps with identifying and mitigating these risks. This will result in a smoother DynamoDB deployment, greater application reliability, and happier customers.
DynamoDB supports a wide number of use cases, particularly those requiring access to large-scale data in near real-time. However, there are many variables that can impact DynamoDB performance and availability. While we might not have direct control over how DynamoDB operates, we can still influence these variables. These can include:
Service Level Agreements (SLAs): As of this writing, AWS promises a monthly uptime percentage of 99.99% for DynamoDB. This means that for every 5 minute interval, 1 in 10,000 requests is allowed to fail before AWS violates the agreement. Enabling Global Tables increases this to 99.999%, which is the equivalent of 1 in 100,000 requests.
Integration with other services: Each service that you integrate with DynamoDB brings its own SLA, adds failure points, and introduces unique complexities. For example, Amazon Redshift can be used to perform data analysis on data copied from DynamoDB, but only promises an SLA of 99.9%. If your application relies on both DynamoDB and Redshift, your overall SLA will actually be lower than if you just used DynamoDB.
Application implementation and developer experience: When developers integrate new services into their applications, there’s always a risk of bugs, poor design decisions, and accidental oversights. This might include writing inefficient queries, under-provisioning read capacity and write capacity when using provisioned mode, or approaching non-relational databases with the mindset of a relational database service. These problems aren’t necessarily caused by DynamoDB, but will greatly impact performance and cost.
Service disruptions: As rare as they are, disruptions will occasionally cause the DynamoDB service to fail. For example, the release of Global Secondary Indexes (GSIs) caused an increase in requests that exceeded the throughput capacity of the DynamoDB metadata service, causing a number of DynamoDB servers to stop handling requests and increasing the error rate to 55% for users of the us-east AWS region. This had rippling effects on other AWS services including the EC2 Auto Scaling service, the Amazon CloudWatch Metrics Service, and the AWS Management Console.
Now that we know some of the key issues we need to address, let’s look at some of the ways we can resolve them:
Increase availability and data durability using Global Tables. DynamoDB tables are region-specific, but Global Tables replicates and distributes your databases across multiple regions and multiple masters. This provides greater resilience against outages and disruptions (such as the US-East disruption mentioned above), and can reduce latency by placing data closer to your customers. However, this also effectively multiplies the cost of data storage by the number of regions, while also adding data transfer costs.
Use DynamoDB Accelerator (DAX). DAX is an in-memory caching service that can deliver responses up to 10x faster than DynamoDB alone. DAX provides fast performance even with millions of requests per second, can reduce latency from milliseconds to microseconds, and requires minimal changes to applications. As with global tables, though, DAX adds hourly pricing and data transfer charges that can quickly become expensive.
Configure workloads for faster timeouts and failover. By default, DynamoDB clients can wait as long as 30 seconds for a response before they timeout. This means that in the event of a failure, your application—and therefore your customers—could wait half a minute only to be told their request failed. Not only will this frustrate and drive away customers, but it can also create longer-term problems such as resource overconsumption. Instead, we can set lower timeout periods to notify users of errors much more quickly, or use an asynchronous client so that our application continues running while waiting for DynamoDB to respond.
Read our step-by-step guide to make key DynamoDB configuration changes and learn about the first Chaos Experiments to start with to improve DynamoDB reliability.
Once we’ve set up DynamoDB, how do we verify that it’s as fast, reliable, and failsafe as we expect it to be? Most of the issues we face aren’t caused by the DynamoDB service, but by how we configure and network our applications when leveraging DynamoDB. With Chaos Engineering, we can systematically test each of these issues to ensure that we’ve designed our applications to take full advantage of DynamoDB.
Chaos Engineering allows us to inject measured amounts of harm into applications and infrastructure with the goal of making them more reliable. It involves planning and running chaos experiments, which take a structured and scientific approach towards causing harm.
These might include:
- Verifying global tables work by blocking network traffic to a specific AWS region, and noting whether our data is still accessible.
- Testing DAX by blocking network traffic to DynamoDB endpoints and only allowing applications to access DAX endpoints.
- Measuring application response time during slow queries by injecting latency into DynamoDB calls.
Running these experiments gives engineering teams the opportunity to test applications and build reliability against performance problems or outages. It also lets us verify that DynamoDB is working as designed, and that its integration into your existing infrastructure goes as smoothly as possible.
Initially, chaos experiments should target a small set of systems that we know can tolerate failure (our blast radius). This might mean limiting an attack to a single server, or only targeting development or staging applications. As we become more confident in our ability to tolerate failure, we can scale up our blast radius to include additional systems, and eventually move testing onto our production systems. With repeated experimentation, we can ensure our DynamoDB-powered applications remain available and fast for our customers.
DynamoDB lets us leverage the stability and scale of the AWS platform to build highly reliable databases, but as with any system, there’s always a risk of failure. If your team is using or considering DynamoDB, request a demo to see how Gremlin can help you identify and resolve those risks before they can become production outages.
- Chaos Engineering was conceived as a direct response to the complexity and nondeterministic nature of cloud-based…Eugene WuSolutions Architect
- Failure mode and effects analysis ( FMEA ) is a decades-old method for identifying all possible failures in a design, a…Matthew HelmkeTechnical Writer