Chaos Engineering

Chaos Engineering is a disciplined approach to identify potential failures before they become outages.

What is Chaos Engineering?

Chaos Engineering is a disciplined approach of identifying potential failures before they become outages. Ultimately, the goal of Chaos Engineering is to enhance the stability and resiliency of our systems.

Creating reliable software is a fundamental necessity for modern cloud applications and architectures. As systems are increasingly being distributed by design, the potential for unplanned failure and unexpected outages increases significantly. Thankfully, Chaos and Reliability Engineering techniques are quickly gaining traction within the community. Many organizations - both big and small - have embraced Chaos Engineering over the last few years.

NOTE: The articles in this guide will use the term team to indicate a singular group that is responsible for an application that you are considering testing or actively testing using chaos experiments like those described in this guide.

  • What Is This Guide?

    Chaos Engineering is a new practice within the realm of DevOps and Site Reliability Engineering. There are a variety of thoughts and opinions about what it is or what it should be, mostly from a high-level. Our goal is to help give some clarity about how to proceed beyond theories and concepts into practical steps.

    This guide was created to give some specifics. Details. Concrete examples and direction for those who have bought into the idea and want to know what to actually do to get started.

  • Who Is This Guide For?

    This guide is for Site Reliability Engineers (SREs), DevOps practitioners, Platform Engineers, and anyone else thinking about how to enhance the reliability of their computing systems, especially by enhancing those systems’ abilities to stay up and running and providing a good experience for end users even when problems like component failures arise. It is specifically for people who want guidance without a lot of marketing and sales verbiage. At Gremlin we have created what we believe is a user-friendly and powerful means of implementing Chaos Engineering and we hope you will consider and ultimately use it. At the same time, we have intentionally written this content in a platform-agnostic way so that you can see the value of what Chaos Engineering offers.

  • Why Did We Create This Guide?

    We want to build upon the introduction we have created in our Gremlin introductory content across our website and in our presentations. Those high-level views are intended to whet the appetite. Here we flesh out the idea with much greater detail, including implementation examples and a precise definition of what is needed before you start and while you implement in your current setting. Then, we follow that by including some extra ideas to spark your imagination for the future.

Contents

There are two main sections to our Chaos Engineering content. First, we start with a series outlining the stages of reliability within an organization as you progress from first learning about Chaos Engineering to preparing to implement and on to increasing adoption. Second, we have a series of individual posts which each illustrate how one might choose to implement Chaos Engineering with specific technologies, architectures, and approaches that may be a part of your stack or something you are considering for the future.

Chaos Engineering Through Staged Reliability

In his ChaosConf 2018 talk titled Practicing Chaos Engineering at Walmart, Walmart’s Director of Engineering Vilas Veeraraghavan outlines how he and the hundreds of engineering teams at Walmart have implemented Resilience Engineering (which we will refer to as the pursuit of reliability within SRE). By creating a robust series of “levels” or “stages” that each engineering team can work through, Walmart is able to progressively improve system reliability while dramatically reducing support costs.

This series expands on this model by diving deep into the five Stages of Reliability. Each post examines the necessary components of a stage, describes how those components are evaluated and assembled, and outlines the step-by-step process necessary to move from one stage to the next.

This series also digs into the specific implementation of each stage by progressing through the entire process with a real-world, fully-functional API application hosted on AWS. We’ll go through everything from defining and executing disaster recovery playbook scenarios to improving system architecture and reducing RTO, RPO, and applicable support costs for this example app.

With a bit of adjustment for your own organizational needs, you and your team can implement similar practices to quickly add Chaos Engineering to your own systems with relative ease. After climbing through all five stages your system and its deployment will be almost entirely automated and will feature significant resiliency testing and robust disaster recovery failover.

An additional tool to help you get started is Gremlin's reliability calculator.

Preparing for Disaster

Stage 0 is all about implementing good site reliability engineering practices, laying the groundwork for Chaos Engineering. The steps outlined in this post aren't necessarily prerequisites, but instead will evolve naturally alongside your Chaos Engineering practice.

Read now

What to do in Stage 0

  • Establish observability
  • Define the critical dependencies
  • Define the non-critical dependencies
  • Create a disaster recovery failover playbook
  • Create a critical dependency failover playbook
  • Create a non-critical dependency failover playbook
  • Publish the above and get team-wide agreement
  • Manually execute a failover exercise
  • Implementation example

Injecting Chaos Internally

Stage 1 describes the early stages of implementing Chaos Engineering, where you begin to inject failure into non-production systems and establish good practices for documenting what you learn.

Read now

What to do in Stage 1

  • Perform critical dependency failure tests in non-production
  • Publish test results
  • Implementation example

Pushing the Envelope Forward

Stage 2 helps you take your first steps into automation and testing in production.

Read now

What to do in Stage 2

  • Perform frequent, semi-automated tests
  • Execute a reliability experiment in production
  • Publish test results
  • Implementation example

Automating Chaos Internally

Stage 3 is where you implement fully automated testing in your non-production systems and begin figuring out how to automate disaster recovery failover.

Read now

What to do in Stage 3

  • Automate resiliency testing in non-production
  • Semi-automate disaster recovery failover
  • Implementation example

Injected Automated Chaos Everywhere

Stage 4 is a fully mature implementation of Chaos Engineering where you begin to have ideas of your own to add to and expand your testing plan.

Read now

What to do in Stage 4

  • Integrate reliability testing in CI/CD
  • Automate reliability and disaster recovery failover testing in production
  • Implementation example

Chaos Engineering and Technology Options

This is a series covering interesting technologies, architectures, and approaches that companies are using today or considering for the future.

Chaos Engineering Article

Chaos Engineering for Serverless Infrastructure

Serverless deployments are becoming an important facet of many companies overall application architecture and must also be tested with Chaos Engineering experiments to enhance reliability. Here is how.

Chaos Engineering Article

Chaos Engineering Tools Comparison

This article describes some of the common tools that the Chaos Engineering community considers when starting to implement the practice in an organization. The goal is to give a high level introduction to some frequently mentioned options and list some of the strengths of each using a brief table and then an annotated list.

Chaos Engineering Article

Chaos Engineering for Istio Service Mesh

Istio is a popular, open source cloud-native service mesh management application with freely available source code. This article demonstrates how to perform a few Chaos Engineering experiments using features already available in Istio.

Community Tutorial

Chaos Engineering: the history, principles, and practice

With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all depend on these systems…

Blog Post

What is Chaos Engineering? SREs and Leaders Define the Practice & Where It's Going

Chaos Engineering is a practice that is growing in implementation and interest. What is it and why are some of the most successful companies…

Download PDF

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free