Knowing your systems and how they can fail: Twilio and AWS talk at Chaos Conf 2020

This year’s Chaos Conf was packed full of incredible talks from some of the industry’s foremost experts on Chaos Engineering. Two of these talks were particularly interesting to me: “The More You Know: A Guide to Understanding your Systems” by Twilio’s Tyler Wells, and “Failing Over without Falling Over” by AWS’ Adrian Cockroft. Both talks set out to answer an important question: what are we aiming to accomplish with Chaos Engineering, and how do we do it thoughtfully?

Be wary of “Availability Theater”

The first question Adrian asked is: do you have a backup datacenter, and if so, how often do you failover to it? If you haven’t practiced failing over, or you’re not confident in your ability to failover in an actual emergency, you only have a facade of availability.

If you’ve got a backup datacenter...and you’re not confident that you can failover to it at a moment’s notice, you’ve invested a lot of money for a facade of availability.

Adrian Cockroft

VP CLOUD ARCHITECTURE STRATEGY AT AWS

‍

Testing failover is difficult with modern systems due to their complexity and the fact that failover processes add even more complexity. As this complexity builds upon itself, it increases the likelihood of failures occurring, and the whole system falls over.

Adrian used the analogy of complex systems as cables vs. chains. Chains are only as strong as their weakest link, while cables have multiple strands that can break before the entire cable snaps. Adding resilience to a system is like adding strands to your cable: it gives you a larger safety margin before failure. What we need to do is test these safety margins and learn the cause of snapped strands before we have a catastrophic failure.

Build up that muscle memory to understand your systems if things go right or wrong. To know your systems is to love your users.

Tyler Wells

SENIOR DIRECTOR AT TWILIO

‍

How do I prepare my systems?

Once we’ve identified a need for greater reliability, we need to answer the difficult question of “how do we do it?” Tyler set out to answer this question by creating a framework for learning your systems and using Chaos Engineering in a structured way to test the resilience of these systems. The first component of his framework is the service context, which tells us:

What the system does.
How customers use it.
The quality of service we’re targeting with this system.
How the system’s current quality of service compares to our service level objectives (SLOs).

In Adrian’s talk, he turned to the engineers and experts who have worked on safety-critical systems. He presented the book Engineering a Safer World by Nancy G. Leveson (handbook and talks available here) and introduced Systems Theoretic Process Analysis (STPA) as a way of breaking down complex, dynamic systems in a way that lets us more easily identify failure modes (called hazards). Systems have three distinct entities:

A controlled process (e.g. an application).
An automated controller (e.g. a deployment platform).
A human controller (e.g. a developer or SRE).

Put into practical terms, think about a Kubernetes application. Kubernetes is the automated controller, the application is the controlled process, and we are the human controller. We send commands to Kubernetes to control our application, Kubernetes operates on the application, then it informs us of the application’s state. When looking for failure modes, we should look for any conditions that could disrupt this process and build resilience or redundancy against them.

Proactively testing for failure with GameDays

Once we know more about what we’re testing for, how do we actually go about testing for it? And if we have redundant systems in place, how do we test failing over safely?

The answer is GameDays. A GameDay is time set aside for teams to test the resiliency of a system, observe the outcomes, and make actionable decisions based on those outcomes. The goal is to see whether our systems behave the way we expect, and if they fail, find ways to address the failure.

Before jumping into a GameDay, we need to be prepared. Fortunately, Tyler open-sourced Twilio’s GameDay template immediately after his Chaos Conf talk. Thanks, Tyler! When prepping for a GameDay, we want to:

Clearly define our service context.
Make sure our test environment is healthy.
Set up our observability and alerting tools.
Prepare any synthetic testing or load generating tools to run alongside our chaos experiments.

When we execute the GameDay, we want to look for things like:

Did our systems behave the way we expected?
Are our metrics reporting the right values, and are our dashboards showing meaningful data?
Are we still meeting our SLOs?
Where can we improve in our operations, our monitoring, and in our runbooks?

Make sure to check out Twilio’s GameDay template on GitHub and give the repository a star if you find it useful!

Conclusion

Knowing your systems is the first step towards knowing how they can fail. But with how complex modern systems are—and with failover mechanisms adding even more complexity—we need a way to understand our systems and how we expect them to operate. Adrian and Tyler both offer frameworks for wrapping our heads around this complexity, making it easier to understand our systems so we can move forward with making them reliable and resilient.

You can watch Adrian’s talk here: