We all want more time to innovate, to dream, and to make an impact. But unstable applications and fragile architectures rob us of that time. We spend too much of it reacting to outages instead of building stronger systems.

Chaos Engineering, which uses thoughtfully planned experiments to teach us how our systems behave in the face of failure, gives us that time back. Together as a community, we're on a mission to share Chaos Engineering practices so we can all have more time, space, and energy to innovate.

Here are our recommendations for getting started with Chaos Engineering today. đź’Ą

Steps for getting started with Chaos Engineering

1. Consider your failure points and map dependencies

The practice of Chaos Engineering developed in response to the increased complexity of cloud-based architectures and shorter development cycles. These two factors introduce failure points that aren’t easily addressed by traditional testing methods. By contrast, Chaos Engineering tests the complex relationships and dependencies in distributed architectures as this is where many critical systems fail.

Consider a single service in your architecture. What other upstream services does it rely on and what downstream services rely on it? It may be helpful to draw out a rough map of these various dependencies.

2. Form a hypothesis

Once you have an idea of how your service interacts with other components of your architecture, think about where failures may occur.

For example:

  • Are services tightly coupled in a “distributed monolith” where a single service failure renders several or all other services inoperable?
  • Could increased network latency between services cascade (or multiply) throughout the system?
  • Are services necessary to core functionality (sometimes described as “in the critical path”) resilient to common scenarios like node failure?

Take one of these potential failure modes and develop a hypothesis. For example, “If Service A experiences a node failure, it will failover within an acceptable amount of time with no impact to Service B.” Then, consider what metrics you would need to evaluate your hypothesis. In our example, we could look at latency and error count from Service B.

For your first experiment, It can be useful to choose a hypothesis for which you have a relatively high level of confidence. However, Chaos Engineering is an iterative process of experimentation so don’t get too hung up on choosing the “right” thing to test first. Every chaos experiment is an opportunity to learn more about your system.

3. Define the smallest possible blast radius

We recommend running your first chaos experiment in a non-production environment. While Chaos Engineering provides more value when conducted in production environments, it is prudent to minimize risk when first starting out.

When you determine the conditions for your experiment, you are establishing the “blast radius.” For your first experiment, try to set the blast radius to be as small as possible. You can always expand the radius later if your system doesn’t respond to a minimal stimulus. For example, if you are testing a service’s tolerance to node failure, start by shutting down or restarting a single instance of a single service.

At this point, make a note of what other metrics you want to monitor to ensure that your system doesn’t go “out of bounds.” These abort conditions are a line in the sand that ensures any chaos experiment can be immediately halted if it produces unexpected negative results.

4. Run your first attack

Now it’s time for the fun part!

There are many ways you can introduce failure into your system but we advocate using Gremlin for a few key reasons:

  • Safety. Using built-in safety features like the “Halt” button and Status Checks minimize the risk of your experiment causing unexpected harm to your system and allows you to easily roll back any experiment quickly and completely.
  • Security. Gremlin offers SSO and role-based access controls so you can be confident that only authorized users are able to impact your system.
  • Simplicity. Gremlin’s Linux or Windows agents are quick and easy to install so you can spend more time on experimentation than setup.

If you choose to use Gremlin, simply select the attack type, targeted host or container, and set the magnitude of the attack.

5. Observe the results

After you initiate the attack, closely monitor the metrics you defined in steps 2 and 3. As soon as the attack is finished (or you have enough data to validate or invalidate your hypothesis), end the attack and return your system to steady-state.

Based on what you observed, was your hypothesis correct? Note down the results.

6. Scale or squash

At this point, you need to decide whether your experiment uncovered issues that you need to remediate or if it demonstrated that your system was resilient to the failure you injected.

If a fix is needed, be sure to re-run the same experiment against the updated version to validate it works as intended.

If your system handled the failure gracefully, consider expanding the magnitude or the blast radius of your experiment, repeating these steps until you either find the point that your system fails or until you are comfortable with your system’s level of resilience.

Resources for getting started

As you begin to incorporate Chaos Engineering into your development and management workflows, you may find the following resources useful.

Tutorials

Slack

With over 5,500 members, the Chaos Engineering Slack provides an opportunity to meet other engineers practicing Chaos Engineering. You can ask questions, get feedback, or simply chat with your peers. (You even get a sheet of Chaos Engineering stickers just for joining!)

Events & Meetups

The global Chaos Engineering community is growing! Between conferences, meetups, webinars, and talks, you can find reliability-related events in your area. To hear about events happening in your region, check out Meetup or subscribe to Breaking News, the Chaos Engineering Newsletter.

Advocating for Chaos Engineering within your organization

As with any new initiative, you may encounter inertia that requires you to take specific steps to secure buy-in to roll out a Chaos Engineering program at your company. We have put together resources to help you through this process:

No items found.
Tammy Butow
Tammy Butow
Principal SRE
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL