To wield chaos tools responsibly, your organization needs a trusting, collaborative culture.
Service A is flapping again. That’s your service, so you answer the page. As you suspected, the problem lies not with your code, but with one of its dependencies: Service B. It’s feeding you malformed data. Service B has been pretty flaky lately, and you’re tired of getting paged about it. “Doesn’t management see that Service B is the real problem?” you wonder.
Tom owns Service B. He knows it’s been flaky lately; he gets paged too. He feels bad, but it’s really not his fault. Services C and D keep letting him down. “Doesn’t management see that C and D are the real problem?” he wonders.
How do teams become so isolated from each other? It starts small. After an outage, someone points a finger. If no one repudiates that behavior, others copy it, and after future outages, teams reflexively raise their defenses when they should be coming together.
In a culture fraught with finger-pointing, Chaos Engineering can help you turn the tide. But chaos tools, like any tool, are not enough on their own. It’s all in how you use them. Commit to using them with Tom, not apart from him, and not only will you both get more sleep, others will notice—and follow suit.
Early on in the DevOps movement, many engineers thought DevOps meant cool tools. If you used this configuration management tool and that continuous integration tool, you were doing DevOps. By now, most teams know DevOps Engineering is actually a practice—an ethos, even. The tools enable the practice, but only if the practitioners work together towards a common goal. Tools plus Culture equals Practice.
It’s the same with Chaos Engineering, and the stakes for ignoring culture, thereby mistaking tools for practice, are just as high.
Suppose you use a chaos tool to break Service B to show Tom (or his manager) that it’s flaky. That’s Chaos Engineering, right? And that’s better than simply talking to Tom about his service—better to show than tell—right? Wrong on both counts.
It may be easier to break Service B than to go straight to Tom; it’s never fun to call someone out (unless you’re a jerk). But this doesn’t foster a culture of trust. And if it causes a production outage, it’s unethical, says Nora Jones, co-author of Chaos Engineering (O’Reilly):
Obviously the hypothesis with Chaos Engineering should always be: I don’t think this will cause customer pain.
Alright, then can’t you break Service B in the development environment to prove your point? “Sure,” says Nora, “but what is going on inside your team such that you can’t get your point across another way?” In other words, maybe your team culture is dysfunctional.
You cannot use chaos tools effectively in the midst of cultural dysfunction, and if you try, you may end up fueling the dysfunction. So before you inject failure anywhere, you and your teammates need to get on the same page.
Start by going directly to Tom. Plan a time to get together and whiteboard out how your services interact with each other. Encourage him to invite anyone else to the meeting. (You may be surprised when the owners of Services C and D show up.) With each such meeting, your whole system becomes a little less siloed.
It’s not that the services will be less siloed; you probably won’t decide to condense your distributed system into a monolith. It’s that the teams will be less siloed, which is more profoundly impactful than you might think.
That’s because “the way your team interacts with each other and with your software—and prioritizes things—is a part of your system,” says Nora Jones. This is not metaphorical. Your team’s behavior and planning (or lack thereof) translate into action (or inaction) that has a real world impact your services. Your system is more than bits and chips and fiber-optic cable—it’s also all the care you give it. Without quality care from close collaborators, your system will fall apart pretty quickly.
Once you, Tom, and the others understand what is failing, its time to figure out how—it’s time to plan a GameDay. Start in development or staging, design chaos experiments together, and use your favorite chaos tools to run the experiments and find failure. (You already know what the first failure is; it’s why you went to Tom in the first place.) Go fix the failures, reconvene in a week or two, and run another GameDay to find new failures. Eventually, automate your experiments to ensure already-fixed failures remain fixed.
When everyone feels ready, start running chaos experiments in production. (Because not even the most detailed staging environment can mimic production.) But first make sure everyone has full access to all chaos and monitoring tools so anyone can see what experiments are running and halt them at any time. At no point should anyone run risky experiments without warning others. Though if someone breaks that rule—and some day, they will—everyone needs access to the kill switch.
This is the practice of Chaos Engineering. Yes, each service team may curate its own documentation and architecture diagrams from the comfort of its own silo, yet these documents—as current as anyone can keep them—are no substitute for collaborative practice.
What differentiates Chaos Engineering culture from DevOps culture? Not much—they share a lot. Safety First. Test Everything. Communicate Openly and Often. Share Responsibility. Be Ready to Roll Back. And so on.
Most organizations practice these values when they—and their systems—are small. As they grow, fewer and fewer engineers know how the whole system fits together. Individual teams practice these values within themselves, but not across team boundaries. Unsurprisingly, chaos ensues. Chaos tools are a vaccine, but only in the hands of engineers devoted to reaching across team boundaries.
As you reach across team boundaries, service boundaries start to seem less real. True, your services may not live in the same country, server, or code repository, but no matter how far and wide they’re spread, they still comprise one interdependent system with shared goals. The practice of Chaos Engineering helps everyone remember that—and deliver on those goals together.
If you’re doing it right, you’ll stop seeing Service B’s failure as Tom’s failure. In a perfect world, you’ll stop seeing Service B as Tom’s service (or Service A as yours). You can still ask Why is Service B failing? But you’ll start to ask the important follow-up questions too. Why can’t Service A handle Service B’s failure? How should it handle Service B’s failure? Is Service A propagating the failure to its own dependent services? When you reach for chaos tools but neglect the culture, you tend never to ask these critical questions. Commit to the culture—promote and defend it—and you will answer them.
Ready to commit, but haven’t chosen a chaos tool? Read our guide, Chaos Engineering Tools: Build vs. Buy.
- It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal SRE
- Failure mode and effects analysis ( FMEA ) is a decades-old method for identifying all possible failures in a design, a…Matthew HelmkeTechnical Writer