Gremlin

The Engineer's Guide to Chaos Engineering

Chaos Engineering is a disciplined approach of identifying potential failures before they become outages. Ultimately, the goal of Chaos Engineering is to enhance the stability and resiliency of our systems.

This guide was created to give some specifics. Details. Concrete examples and direction for those who have bought into the idea and want to know what to actually do to get started.

Download the PDF

What is Chaos Engineering?

Creating resilient software is a fundamental necessity for modern cloud applications and architectures. As systems are increasingly being distributed by design, the potential for unplanned failure and unexpected outages increases significantly. Thankfully, Chaos and Resilience Engineering techniques are quickly gaining traction within the community. Many organizations – both big and small – have embraced Chaos Engineering over the last few years.

What Is This Guide?

Chaos Engineering is a new practice within the realm of DevOps and site reliability engineering. There are a variety of thoughts and opinions about what it is or what it should be, mostly from a high-level. Our goal is to help give some clarity about how to proceed beyond theories and concepts into practical steps.

This guide was created to give some specifics. Details. Concrete examples and direction for those who have bought into the idea and want to know what to actually do to get started.

Who Is This Guide For?

This guide is for site reliability engineers (SREs), DevOps practitioners, Platform Engineers, and anyone else thinking about how to enhance the reliability of their computing systems, especially by enhancing those systems’ abilities to stay up and running and providing a good experience for end users even when problems like component failures arise. It is specifically for people who want guidance without a lot of marketing and sales verbiage. At Gremlin we have created what we believe is a user-friendly and powerful means of implementing Chaos Engineering and we hope you will consider and ultimately use it. At the same time, we have intentionally written this content in a platform-agnostic way so that you can see the value of what Chaos Engineering offers.

Why Did We Create This Guide?

We want to build upon the introduction we have created in our Gremlin introductory content across our website and in our presentations. Those high-level views are intended to whet the appetite. Here we flesh out the idea with much greater detail, including implementation examples and a precise definition of what is needed before you start and while you implement in your current setting. Then, we follow that by including some extra ideas to spark your imagination for the future.

Over a decade of collective experience unleashing chaos at companies like

  • Amazon
  • Netflix
  • SalesForce