Three Simple Questions to Help You Determine Your Readiness
Chaos Engineering is one of the most exciting and advanced activities in computing today. As companies are breaking up their monolithic infrastructures into microservices and moving from company-owned data centers to the cloud, system architecture is becoming more and more complex. Most cloud deployments are complex beyond the ability of any one person being able to understand every part of the system.
This has led to an ever-growing interest in the science of Chaos Engineering, with an intent toward making systems more and more resilient to the inevitable bumps and potholes inherent in complexity.
Some will say that it is impossible to foresee every potential problem. While Gremlin does not disagree with the absolute in that statement, we believe it is possible to foresee many problems, design experiments to test our hypotheses, and then work to mitigate against any issues we find, making our systems ever more resilient. We believe we will also find issues we did not previously anticipate, allowing for serendipitous work and even greater resilience.
This is not a simple task, and there are prerequisites to getting started. Outlining and clearly stating what these are is the purpose of this article. A series of upcoming articles will explore in greater depth the ideas we present here in brief.
The simplest way to understand your company’s preparedness for Chaos Engineering is to ask these three qualifying questions. If you can respond Yes to each of them, then you are ready to begin your resilience journey. If not, you will now have the basic information you need to know in order to prepare.
- Does your company measure downtime?
- Can you quantify damage to the business as a result of downtime?
- Does someone own that number?
Answering these questions will let you know immediately whether reliability is a priority for your business. If it is, great! You are ready to begin. If the answer to any of these questions is No, then resilience is not yet an active priority and you will fight an uphill battle trying to implement Chaos Engineering in pursuit of resilience. You must have active buy-in from the business if you are to be successful.
Chaos Engineering is not a simple thing. There are a myriad of experiment types and options. Using even the most basic requires that you spend time setting up aspects of your infrastructure and testing mechanisms. This is a non-trivial task that will require time and effort from your engineers, even with a SaaS, web-based solution like the Gremlin web app that is designed to get you up and running easily.
Engineering time costs the company money. In order to show that the costs of performing Chaos Engineering experiments will result in a beneficial return on investment, you must be able to quantify the costs of non-resilient behavior. Measuring downtime, quantifying the resulting damage, and knowing who owns that number are the first steps to proving the value of engineering for resilience.
Let’s think for a moment about the system traits that must be in place for even the most basic Chaos Engineering experiment. This will help you quantify the costs of implementation, so that you can compare them with the cost of down time.
You need observability into your system. You must be able to measure what is happening. Logging is a good start. Monitoring is better. Active alerting is best. As the television show Mythbusters taught us, "the only difference between screwing around and science is writing it down." We can say something similar about experimenting on our systems; if you aren’t measuring before you try something, you have no way to know the impact of your experiment. Observability is the first step.
You need to have defined your critical dependencies. What is the foundation of your system? What does it require to function? Do you know? Find out and document everything you can think of. When a problem arises, you need to know where to look and what to look at if you are to have any hope of a quick resolution and limiting expensive downtime. If a dependency problem comes up, what will you do about it? Create dependency resolution playbooks in advance to give your engineers a list of things to check and procedures to follow. This mitigates against the risk of brain fog caused by anxiety in a critical moment. Speaking of that leads to the next trait.
You need proper incident management. A problem arises. Who owns that problem? Who is responsible for finding a solution? We are still describing the reactive stage of site reliability engineering (SRE) and DevOps. You must have a plan and an owner before an issue comes up. You should have outage playbooks with a similar list of things to check and procedures to follow. You should have a SEV (severity) program defining the severity of different types of problems and the procedures to follow and people to call for each.
Once an issue is resolved, you need to have everyone involved perform a root cause analysis, and importantly, learn to do so without assigning blame. Complex systems break down. Instead of looking for a scapegoat, learn to look for resilience. How can the issue be prevented or at least mitigated against? Finally, you need someone to own the resolution and implementation in a timely manner. Some fixes and changes take longer than others, but having an owner helps keep the process moving forward toward ever greater resilience.
Only now are you ready for your first chaos experiment. So far, everything has been about observing and reacting. However, having those in place affords you the opportunity to proactively test things, knowing that you have plans for dealing with the chaos you are about to inject.
A few large, leading companies have reached this point and begun to implement Chaos Engineering. They are ready to run controlled, manual experiments on their infrastructure and are doing so.
Very few have thus far matured past this point into automating experiments on a regular basis across their infrastructure. They are actively testing and attacking their systems to find weaknesses on a regular, scheduled basis through automation.
Even fewer have implemented automated experimentation and testing in the CI/CD build pipeline. This is another opportunity to use Chaos Engineering to test aspects of the system you might not otherwise have the opportunity to examine.
A very small number have reached the level of maturity to run essential tests regularly in production with regular GameDays performed in production and with chaos experiments as a part of every deploy.
The Chaos Engineering field is young. What we have described is the landscape today along with a high level road map for the future. A series of upcoming articles will present in much greater detail the growing consensus surrounding the various stages of operational maturity that we have outlined here only briefly.
Resilience does not happen by accident. No system is perfectly reliable and some downtime is to be expected from even the best systems. Minimizing that downtime is the goal. To do so, we must be intentional, we must embrace the risk of testing our own systems, yes, in production, and we must learn how to manage that risk effectively because by doing so we also manage the risk across the system. This is how we create resilient systems and why we embrace Chaos Engineering. Are you ready?