Getting started with Chaos Engineering on AWS

Chaos Engineering on AWS

Amazon Web Services (AWS) is the world's most popular cloud computing platform. Much of this success is due to the scalability and reliability they enable for customers and their applications. However, this support doesn't come out of the box. AWS provides reliability and scalability, but it's up to their users to implement these functions in a way that best supports their workloads and applications.

Knowing how to configure and apply AWS' tools is one thing; verifying that your configuration works is an entirely separate project. Fortunately, we can use Chaos Engineering—the practice of injecting faults into systems to check for weaknesses—to test this. This way, we can make sure our AWS-hosted applications can withstand any conflicts or adversity that our production environment might throw at them.

In this article, we'll explain how this works and how you can apply it to your own systems.

Reasons to implement Chaos Engineering on AWS

Reliability is critical for software companies, but the complexity of cloud platforms like AWS—combined with the fast speed of development expected by modern DevOps teams—makes it difficult for teams to guarantee reliability out of the gate. The goal of Chaos Engineering is to test and verify application resiliency in your production environments so that they can withstand real-world failures.

AWS recognizes this need, which is why they made reliability a pillar of their Well-Architected Framework (WAF). The WAF is a guide for AWS customers on how to optimize applications for AWS, all the way from the design phase up to operating and monitoring. Operational Excellence and Reliability are the key pillars for Chaos Engineering, as they encompass running and managing applications.

Although the WAF provides extensive guidance and instruction, it doesn't provide a way to test whether your newly configured applications meet its standards. In other words, how do you know whether your applications are truly well-architected after you've put in all the effort to make them so? The answer is to test for the conditions that the WAF aims to prevent, and the way to do that is by intentionally injecting faults into your AWS workloads. Doing this reveals behaviors in your applications that are unexpected, undesirable, or otherwise deviate from the WAF.

“You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results.” — AWS Reliability Pillar announcement blog

Chaos Engineering does more than just highlight operational issues. It also helps identify potential gaps in your monitoring and alerting setup by giving you a way to trigger those alerts. For instance, if you have an Amazon CloudWatch alert designed to notify you when CPU usage reaches critically high levels, you can run a Chaos Engineering experiment designed to consume CPU to test whether that alert fires.

Chaos Engineering isn't strictly about reliability, either. Performance is also a critical aspect of cloud systems. After you've tested your assumptions about reliability, you can use Chaos Engineering to test your assumptions about the performance of your AWS deployments. You can find the parts of your application that isn't scaling properly, or isn't optimized for low CPU or low memory availability, or is especially susceptible to noisy neighbors.

How your AWS deployments can fail under chaos

It's impossible to list all of the ways an AWS deployment can fail. This isn't because AWS is inherently unreliable, but rather because the sheer number of different services, configuration options, and workflows means there's always the potential for unexpected and unpredicted failure modes. For example, these could include:

Undersized EC2 instances that can't scale fast enough to handle incoming traffic.
Network routing errors caused by a bad VPC or Route53 configuration.
A sudden and unexpected node failure caused by a datacenter power loss.
Poor application throughput caused by a misconfigured Amazon RDS instance.
Undocumented single points of failure, such as unreplicated EC2 nodes.
Gradually decreasing performance caused by a lack of caching.
Total failure caused by a region or Availability Zone (AZ) outage.

Many teams develop incident response procedures for handling these types of failures, and while incident response is important, it's a reactive process. Response plans can only be created after a failure has already happened. And even if you have a response plan, are you certain that it's effective? Have you and your team already tested it? An untested plan is an unanswered question that could result in outages.

Challenges with implementing Chaos Engineering on AWS

Implementing Chaos Engineering on AWS shares many of the same challenges as implementing Chaos Engineering on other platforms:

How do you know where to start testing?
How do you determine which systems to test and what tests to run?
How do you do multi-cloud or hybrid-cloud testing?

The first step is to identify which AWS services and features you're‌ using. Your applications might consist of several AWS services: ECS containers, EKS clusters, Lambda functions, EC2 instances, etc. Each of these services has different failure modes and requires different types of tests. For example, Elastic Kubernetes Service (EKS) abstracts away the Kubernetes control plane, so you don't need to worry about the reliability of your control plane nodes. However, you do need to consider what happens to your applications when one of your worker nodes (the nodes that run your Kubernetes workloads) fails.

Multi-cloud testing is also an important consideration. Nearly 90% of organizations use a multi-cloud approach (Flexera), which means any Chaos Engineering tools they adopt should ideally support testing on more than one cloud. There are tools like AWS Fault Injection Simulator (FIS) that make Chaos Engineering on AWS more accessible, but if your workloads are spread across AWS, Azure, Google Cloud Platform (GCP), or others, you'll either need to adopt additional tools or forego Chaos Engineering on this platforms altogether.

How to get started with Chaos Engineering on AWS

1. Ensure proper monitoring and metrics are in place

Observability is a key part of Chaos Engineering; without visibility into EC2 instance performance, cluster health, HTTP request/response status, and other metrics, you might not notice when a problem occurs, let alone why it happened.

You can set up monitoring relatively easily using Amazon CloudWatch, or using other tools such as Dynatrace, Datadog, New Relic, etc. One major benefit of CloudWatch is that it natively supports collecting observability data from AWS services, and usually does so by default.

2. Define your steady state and develop hypotheses

The steady state is how a system performs under normal or ideal conditions. When running in pre-production, your steady state is typically the average amount of load your application uses when it's completely up and running. It's important to collect steady state metrics because these provide a baseline for comparison. Running Chaos Engineering experiments will almost certainly change your metrics, and measuring the difference between the changed metric's value and its steady state value gives you the experiment's impact.

This process is usually done just before running your first attack, as it reduces the time frame that other factors (like load) have to affect your systems.

3. Select a Chaos Engineering tool

You've identified the systems that you want to test. You've developed a hypothesis. And, you've measured your steady state. How do you‌ go about running Chaos Engineering experiments?

This step requires a Chaos Engineering tool. A good Chaos Engineering tool doesn't just support fault injection on the AWS services you use but also provides control plane management and reporting. It should make it easy for you to initiate experiments on select systems, observe their effect, and easily stop or roll back experiments if needed.

There are a few approaches you can take to Chaos Engineering on AWS:

AWS Fault Injection Simulator (FIS) is an AWS service built specifically for running Chaos Engineering experiments on AWS services. Since it's built by AWS, it can inject faults deep into the AWS API in ways that other tools can't. However, it only provides a limited number of test types, supports just a few AWS services, and is limited to AWS. These limitations prevent its usefulness for large-scale or complex tests, or for testing hybrid or multi-cloud deployments.
Chaos Engineering frameworks like Chaos Toolkit support AWS through drivers and add-ons. A benefit of tools like these is that they're extensible, but this often means having to set up and configure tests yourself. This process takes time away from testing, and can be error-prone depending on how much effort it takes to write tests.
Custom tooling gives you the greatest control over what AWS services you test and what tests you run. This can bridge the gap when a Chaos Engineering framework doesn't completely fulfill your requirements. For example, you might use AWS FIS to simulate AWS API failures, while using a tool like stress-ng to generate load on EC2 nodes. Unfortunately, custom tools add maintenance time and costs, especially if you need to build them from scratch. They're also more prone to failure since engineers must develop their own tooling.
Gremlin is the leading commercial enterprise Chaos Engineering and reliability management platform. Gremlin provides a suite of Chaos Engineering experiments and reliability tests that you can run on host, container, and Kubernetes-based services including EC2 and EKS. While Gremlin doesn't have in-depth integration with the AWS API the way FIS does, it supports multi-cloud, hybrid-cloud, and on-premise environments.

4. Create and run your tests

Once you've selected a tool, the next step is to design and run your first experiment. Running a Chaos Engineering experiment consists of six steps:

Identify your potential failure points.
Form a hypothesis about how your systems will behave under failure conditions.
Define your blast radius, which is the set of systems that your experiment will impact.
Run the experiment.
Observe the results.
Implement fixes, scale up the experiment, and repeat

Setting a small blast radius is especially important when getting started with Chaos Engineering. The blast radius is the scale of the test. A small blast radius would be a single application or server, while a large blast radius would be an entire Availability Zone or region. Starting with a small blast radius helps isolate the impact of the experiment to a limited area, making it easier to observe and measure the effects. It also prevents any unexpected problems from impacting other systems. As you become more confident in running experiments, or if the current blast radius is too small to provide meaningful results, increase the blast radius step-by-step.

Lastly, or each test, you should set abort conditions. Abort conditions are the system conditions under which the test should be stopped, regardless of whether it's finished running. They're used to prevent accidental damage to the systems being tested if they enter an undesirable or unexpected state. For example, if you're testing a single application on a single server, you might set your abort conditions to:

Halt if testing impacts other applications or servers.
Halt if the system or application becomes unresponsive.
Halt if the Chaos Engineering tool fails to target the right system or application.

5. Identify failures, deploy fixes, and re-test

When your experiment finishes running, compare the metrics gathered during the experiment to your baseline. Is the impact what you expected? Were there any unusual results, such as simultaneous spikes across multiple different metrics? Do the results prove your hypothesis, do they refute it, or do they not provide a clear answer? And most importantly, did the experiment reveal problems in your systems?

From your observations, create a list of fixes to implement on your systems. Once the fixes are in place, repeat the same experiments to validate that the fixes are working as intended.

A fix, like any change to a complex system, can impact the system beyond just the single service it was implemented for. That's why it's important to repeatedly run chaos experiments while gradually increasing the blast radius. This ensures that your fixes don't just improve the system(s) they were implemented for, but also improve reliability at a larger scale.

This is also an opportunity to improve your monitoring. If you notice any gaps in your monitoring/alerting setup, work on creating and deploying fixes for it. Remember, observability is key to Chaos Engineering, and having a robust monitoring setup makes running experiments much easier and much more effective. Some questions to ask of your monitoring setup are:

If the experiment caused an incident, did your monitors detect it?
Did automated alerts fire and notify the right parties about the incident?
Was a ticket automatically created, assigned to the right person, and given the correct severity level?

Chaos Engineering on AWS with Gremlin

Running Chaos Engineering experiments on AWS can seem daunting at first, but with the right tools and procedures, you can quickly start making your systems more reliable. Gremlin provides an AWS-ready platform that supports experimenting on the most popular AWS services including Amazon EC2 and Amazon EKS. Gremlin also helps you validate continued adherence to the AWS Well-Architected Framework (WAF) so you can get the most value out of the tools available to you

Learn how Charter Communications uses Gremlin to ensure the reliability of their customer data platforms in AWS.

When you have a Chaos Engineering tool selected, watch our webinar: Continuous validation of the AWS Well-Architected Framework with Chaos Engineering.

Note

Each of these steps is described in detail in our guide on how to get started with Chaos Engineering.