There are two main sections to our Chaos Engineering content. First, we start
with a series outlining the stages of reliability within an organization as you
progress from first learning about Chaos Engineering to preparing to implement
and on to increasing adoption. Second, we have a series of individual posts
which each illustrate how one might choose to implement Chaos Engineering with
specific technologies, architectures, and approaches that may be a part of
your stack or something you are considering for the future.
Chaos Engineering Through Staged Reliability
In his ChaosConf 2018 talk titled Practicing Chaos
Engineering at Walmart,
Walmart’s Director of Engineering Vilas Veeraraghavan outlines how he and the
hundreds of engineering teams at Walmart have implemented Resilience
Engineering (which we will refer to as the pursuit of reliability within SRE).
By creating a robust series of “levels” or “stages” that each
engineering team can work through, Walmart is able to progressively improve
system reliability while dramatically reducing support costs.
This series expands on this model by diving deep into the five Stages of
Reliability. Each post examines the necessary components of a stage, describes
how those components are evaluated and assembled, and outlines the
step-by-step process necessary to move from one stage to the next.
This series also digs into the specific implementation of each stage by
progressing through the entire process with a real-world, fully-functional API
application hosted on AWS. We’ll go through everything from defining and
executing disaster recovery playbook scenarios to improving system
architecture and reducing RTO, RPO, and applicable support costs for this
With a bit of adjustment for your own organizational needs, you and your team
can implement similar practices to quickly add Chaos Engineering to your own
systems with relative ease. After climbing through all five stages your system
and its deployment will be almost entirely automated and will feature
significant resiliency testing and robust disaster recovery failover.
An additional tool to help you get started is Gremlin's reliability calculator.