It's here, Gameday has arrived. You've prepared diligently -- studied the principles of chaos engineering, accepted that failure will happen, injected failure yourself. You are ready to prove that you have what it takes -- to survive an outage in production. It's time to run a Gameday.
Gamedays are like fire drills -- an opportunity to practice a potentially dangerous scenario in a safer environment. They are the capstone which allows us to measure the resilience of a system. Running a Gameday tests our company -- from engagement to incident resolution, across team boundaries and job titles. It verifies the system at scale, ultimately in production. By proactively testing these events, we can choose the terms of engagement -- the time, the place and the root cause.
The first step to any drill is knowing what to practice. Are you evacuating one of your data centers or cloud provider's regions? Validating that the loss of a key dependency won't bring you down? Planning out the Gameday is a great opportunity to collaborate with other parts of the company, share context, and learn about the system as a whole.
When in doubt, over communicate. Especially when you're going to be breaking things. Let everyone interested or involved know your plan. Have a Command Center -- a chat room, conference bridge, or a conference room -- where anyone can check on the status of the Gameday. Share your success and abort criteria, as well as any dashboards you are watching. Having many eyes on the problem can speed up detection if things go wrong.
Always have a rollback plan and a set of abort criteria. Often if there is any customer facing impact (beyond what is expected), then the impact is reverted and an investigation begins. Monitoring, alerting, and engagement are key parts of your system to verify.
It is critical to understand and minimize the blast radius of an exercise. Run your Gameday first in a test or staging environment. Start with the smallest blast radius that will teach you something about your system. This may be breaking a single container, degrading a single instance, or injecting failure into a single request. Next fail the entire service, zone, or a percentage of requests. At each step you either gain confidence in your system or find an issue which needs to be fixed.
There is a benefit to starting small and dialing it up -- different scales teach us different characteristics of our system. At small scale we test the functional: Do we handle exceptional cases correctly? Is our system usable in a degraded state? At large scale we learn about resource constraints and cascading failure: Do we protect ourselves if traffic builds up? Are our timeouts set aggressively enough? Only by testing the small and the large scale will we be prepared for what will occur in the real world.
The end goal of any Gameday is to run in Production.
Fail services regularly. Take down data centers, shut down racks, and power off servers. Regular controlled brown-outs will go a long way to exposing service, system, and network weaknesses. Those unwilling to test in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon.
James Hamilton, On Designing and Deploying Internet-Scale Services
It's production that matters at the end of the day. That's where customers live, where the money is made. It's production's configuration that counts if things go wrong. Many systems are tuned for the ‘happy-case', and find themselves woefully unprepared when failure strikes.
Furthermore, when failure does strike, there is little time for learning during the event. Don't train your on-calls by handing them a pager and wishing them good luck. Teams need to proactively test their reactive skills! By regularly testing important scenarios, your teams will build muscle memory and be able to act quickly and confidently in a crisis. De-mystify and de-stress your incidents by practicing in advance!