Chaos Engineering: Finding Failures Before They Become Outages

Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime.

Download the Whitepaper

Thanks for requesting Chaos Engineering: Finding Failures Before They Become Outages! View the whitepaper here. (A copy has also been sent to your email.)

About the Authors

Jordan Pritchard

Director of Infrastructure & Site Reliability Engineering

Michael Kehoe

Architect of reliable, scalable infrastructure

Rodney Lester

Technical Lead, Reliability Pillar of Well Architected Program

Tammy Butow

Principal SRE

Jay Holler

Manager, Site Reliability Engineering

Ramin Keene


Download the whitepaper to get a primer on Chaos Engineering and learn:

  • The fundamentals of Chaos Engineering and how to get started.
  • How Amazon and Netflix approach Chaos Engineering.
  • Benefits of Chaos Engineering, the culture, and popular tools.
  • Practical applications of Chaos Engineering in production.
  • Scheduling and automating regular Chaos Experiments.

  • Incident classification: SEV descriptions and levels, and SEV and time-to-detection (TTD) timelines

  • Organization-wide critical service monitoring, including key dashboards and KPI metrics emails

  • Service ownership and metrics for organizations maintaining a microservices architecture

  • Effective on-call principles for site reliability engineers, including rotation structure, alert threshold maintenance, and escalation practices

  • Chaos Engineering practices to identify random and unpredictable behavior in your system

  • Monitoring and metrics to detect incidents caused by self-healing systems

  • Creating a high-reliability culture by listening to people in your organization

By thoughtfully injecting failure into their systems, engineers can find vulnerabilities and address them before they result in downtime and lost revenue.

This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.

Over a decade of collective experience unleashing chaos at companies like

download the whitepaper

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape