The Chaos Engineering Slack recently hit 1,000 members. Engineers around the world are hungry to learn more about this emerging discipline. We at Gremlin have known this for a while now; that’s why we created the Gremlin Community space, where we share tutorials and guides to teach you about Chaos Engineering.
Conferences and meetups are an even better way to learn, but not everyone has the time or opportunity to attend them. We could simply share presentation slides here on our blog—and we do that sometimes—but all too often, if you weren’t there, it’s hard to get much from slides alone. So we want to try something new: the Presentation As Blog Post. We hope this format can bridge the gap between live presentations and their bare-bones slides.
This post, the second of its kind in the Gremlin Community space (here’s the first), summarizes a brief talk I recently gave at O’Reilly’s Velocity Conference 2018. If you weren’t there, read on!
Chaos Engineering is intentionally injecting failure into a system to proactively identify and fix problems before they cause outages. It’s an emerging discipline, but its roots are decades old. So why is it now becoming the go-to approach for building resilient systems? Why does the current state of distributed architectures require chaos as the best solution for system failure?
Kolton Andrus explores how Chaos Engineering has evolved into the discipline now practiced at leading cloud companies, how to begin your journey toward resilient systems, and how to make those pagers quit buzzing at 3:00am.
First, I want to give you my definition of Chaos Engineering. In my opinion, it is thoughtful, planned experiments designed to teach us about our systems, to find the weak points. This has gone by a few different names in the past; when I was growing up, it was called Failure Testing or Disaster Recovery Testing. But Chaos Engineering has become the term most strongly associated with the discipline today.
The analogy I like to draw is that of the vaccine: We are going to inject something harmful in order to build up immunity. This brings the point back home to something we’re already comfortable with. If you were to say to someone 200 hundred years ago that you were going to purposefully inject them with a disease, they may figure you were crazy, but today it’s a common and accepted practice. This is how we feel about Chaos Engineering.
Chaos Engineering is an opportunity to be proactive. It’s the antithesis of most incident response I’ve been a part of throughout my career, which is very reactive. In reactive practices, you ask the 5 Whys of what went wrong after the incident has already occurred, after it has already affected customers or your bottom line. You do the deep-dive afterwards, not before. Chaos Engineering lets us get in front of the problem. It gets us thinking about what can go wrong before things blow up in our faces.
Once you’ve tested some hypothesis about how your system will respond under duress—either confirming that it behaves as expected or learning something new and implementing a fix—you then want to automate the experiment. This helps prevent what Sidney Dekker calls the “drift into failure,” whereby systems, due to the contentions within them, tend to naturally revert back to states where failure can reoccur. By automating experiments, we can be forewarned when we’ve made changes that are going to result in trouble.
Okay, so I’ve talked a bit about what Chaos Engineering is. How about what it isn’t? It’s not about being a cowboy or cowgirl, shooting servers in production just to see what happens, being secretive and not telling anyone what you are doing—in short, it’s not about being reckless.
I think when many companies first hear about Chaos Engineering, they fear recklessness. And rightfully so—giving someone with production access the green light to break things randomly is certainly worth questioning. That’s why it’s important to consider the etymology of the term: yes, there is chaos, and yes, we are saying that you need to actually break things and inject harm. But there is also engineering, which means we are being disciplined and thoughtful. We are approaching it from a history of designing systems in an intelligent way. It’s the healthy marriage of these two concepts that’s really driving the traction of companies adopting Chaos Engineering.
Thinking about the terminology a little more, I think Chaos Monkey, for all of the great things it’s done, may have caused some confusion that we are now working to undo. Back to the point about cowboys and cowgirls, the discipline has developed over the past decade to a point where it’s really no longer about randomly breaking things. Instead of the term being Chaos Engineering, perhaps the two words should be reversed—what we really want to do is engineer chaos. Food for thought!