OK, so you've decided that Chaos Engineering sounds like a good idea. How do you get started? We get that question a lot, and we wanted to outline some tips for implementing these practices in your environment.
A quick aside. Chaos is a cool name, but it is a misnomer in the best way to approach failure testing. Sometimes a design decision like enabling Chaos Monkey in a new environment can be a great way to enforce realistic constraints on teams operating there. It can be a bit daunting however to apply a random strategy when dealing with an existing environment. We think the best way to get started is a thoughtful, planned experiment to validate expected behavior.
One of the most powerful questions in Chaos Engineering is "What could go wrong?". By asking this question about our services and environments, we can review potential weaknesses and discuss expected outcomes. Similar to a risk assessment, this informs priorities about which scenarios are more likely (or more frightening) and should be tested first. By sitting down as a team and white-boarding your service(s), dependencies (both internal and external), and data stores, you can formulate a picture of "What could go wrong?". When in doubt, injecting a failure or a delay into each of your dependencies is a great place to start.
You've got an idea what can go wrong. You've chosen a scenario -- the exact failure to simulate -- to inject. What's happens next? This is a excellent thought exercise to work through as a team. By discussing the scenario, you can hypothesize on the expected outcome when running live. What will be the impact to customers, to our service or to our dependencies?
In order to understand how your system behaves under duress, you need to measure it. It's good to have a key performance metric that correlates to customer success (such as orders per minute, or stream starts per second). As a rule of thumb, if you ever see an impact to these metrics, you want to halt the experiment immediately. Next, is measuring the failure itself, you want to verify (or disprove) your hypothesis. This could be the impact on latency, requests per second, or system resources. Lastly, you want to survey your dashboards and alarms for unintended side effects.
Always have a plan in case things go wrong. Know going in that sometimes even the backup plan can fail. Talk through the ways in which you're going to revert the impact. If you're running commands by hand, be thoughtful not to break ssh or control plane access to your instances. One of the core aspects of Gremlin is safety. All of our attacks can be reverted, allowing you to safely abort if things go wrong.
After running your first experiment, hopefully, there is one of two outcomes. You've verified either that your system is resilient to the failure you introduced, or you've found a problem you need to fix. Both of these are good outcomes. On one hand, you've increased your confidence in the system and its behavior, on the other you've found a problem before it caused an outage.
Chaos Engineering is a tool to make your job easier. By proactively testing and validating your system's failure modes you will reduce your operational burden, increase your availability, and sleep better at night.