A Primer on Automating Chaos

August 9th, 2017

Automation is a must when operating and scaling cloud-based systems. The more servers and services there are to manage, the harder it gets for a team to fulfill their operational duties without proper automation in place. Automation is a workforce multiplier that helps us to manage our ever-growing infrastructure, but it can do much more than that. According to The Practice of Cloud System Administration, it also achieves the following goals:

  • Improve accuracy by being less error-prone than humans
  • Increase repeatability by doing tasks in a consistent way
  • Improve reliability because it’s easier to measure and fix processes once automated
  • Save time for more important engineering work
  • Make processes faster with less room for mistakes
  • Enable more safeguards in all stages of a process
  • Empower users to do otherwise difficult or impossible tasks in a self-service manner
  • Reduce wait time for both users and system administrators, with fewer interruptions

Automated Chaos

Given these advantages, it’s not surprising that automation is one of the cornerstones of Chaos Engineering. A quick reminder: Chaos Engineering allows us to validate our assumptions and learn something new about our systems – hidden problems that could arise in production – by performing experiments on them. Automating experiments is one of the advanced Chaos Engineering principles that, when applied correctly, can further increase our confidence in distributed systems.

But don’t worry – that doesn’t mean you need to automate experiments from the very beginning, nor do you have to run them continuously to benefit from Chaos Engineering. Starting small is, as so often, a good idea. Manual failure testing, which can be as simple as terminating a process with the kill command or shutting down a server with halt, is still a good approach to learning the basics of fault injection while gradually establishing the right mindset.

In fact, doing it the manual way is sometimes the only possibility to simulate more complex scenarios that would otherwise be too hard or too expensive to automate. (These larger-scale experiments are excellent candidates for Gameday team events, by the way.)

Most of the time, however, automation is both desirable and feasible. Applying it to Chaos Engineering can happen in many different ways with varying degrees of sophistication. Let’s look at an example.

From Homegrown to Enterprise

In order to simulate network latency to external services – a common real-world event – you come up with an elaborate set of tc commands, which you document in a wiki for later reuse. You soon realize, however, that copy-pasting commands and adapting them to the chaos experiment at hand is an error-prone process. You therefore decide to write a shell script that takes parameters (e.g. the service’s IP address and port), and add it to a Git repository. As it would be convenient to have the script readily available on all the servers where you want to induce latency, you invest some more time to deploy everything automatically using your favorite configuration management tool. Great! Now you’re all set for automated chaos experiments in production. Well, not really.

For one, shell scripts are ill-suited for experimenting on distributed systems where asynchronous communication is prevalent and safe error handling essential. (Never mind the fact that you probably shouldn’t check out a copy of your best-loved test scripts on production servers.) So you rewrite the existing functionality in a capable systems programming language like Rust or Go, where the build artifact is an easy-to-distribute binary. At some point it occurs to you that driving experiments from the command line is nice – Let’s glue everything together with ssh! What can go wrong? – but it still leaves a lot to be desired in terms of reliability, interoperability, visibility, security, etc.

For this reason, you go out and find an existing open source tool that looks promising. It has a built-in scheduler and also an API that can be used to terminate instances programmatically. Life seems good – your infrastructure has become more resilient to a few specific server failures – until the requirements change (they always do). All of a sudden, microservices are a thing, and many teams go over to shipping their applications as Docker containers. That’s when you realize it’s not enough to merely impact the hosts where the containers are running on. You also want to empower engineering teams to experiment on their application containers in isolation – ideally using the same automation features they’re already accustomed to. In other words, you’ve outgrown the open source solution.

A quick note: We at Gremlin are proud to announce first-class support for Docker containers! Our product allows you to test the resilience of your container infrastructure in a safe, secure, and simple way – no matter if you’re using Kubernetes, Nomad, or Amazon ECS. Our customers already benefit from this exciting new feature.

This example illustrates the typical progression happening in many companies adopting Chaos Engineering – and it usually doesn’t stop here. Generally speaking, the more committed an organization is to the practice, the more (automation) demands it will have.

Advantages of Automated Fault Injection

Automation can come in many different forms and levels of sophistication – from homegrown tools to open source projects to enterprise-ready solutions like Gremlin. One thing is for sure: applying automation to Chaos Engineering the right way offers tremendous advantages over manual fault injection, for example:

  • Create reliable experiments that are more likely to produce similar outcomes when repeated.
  • Perform safety checks before, during, and after experiments. (Never confuse staging with production again, automatically roll back any impact once done, etc.)
  • Run many more experiments in less time, enabling “Chaos at scale”.
  • Schedule experiments to run periodically at fixed times, dates, or intervals (or even randomly).
  • Automatically evaluate test results, ideally as part of continuous integration.
  • Empower developers to run their own chaos experiments to improve the quality of the services they are responsible for.

Things to Keep in Mind

Be aware that automation alone is not a panacea. At the end of the day, it’s not only about solving the problem right, but it’s also about solving, i.e. automating, the right problem. You still need to think deeply about failure domains and design chaos experiments accordingly.

Try to minimize the blast radius of experiments to reduce the dangers of automation gone wrong. To that end, we put a lot of effort into Gremlin to safely revert the impact of our infrastructure attacks, no matter what. Safety is one of our key tenets, and we will cover it here in more depth in the future.

Speaking of dangers, you need to be aware of ongoing experiments: talk to stakeholders, set up dashboards, send out email or chat notifications. Visibility is a crucial aspect of Chaos Engineering, and even more so with automation always at work in the background.

Confidence to Move Forward

All systems drift into failure – it’s inevitable. Luckily, practices like automated fault injection can counteract that process, giving us the confidence to move forward in a calm way, rather than just hoping for the best.

However, as systems evolve, even that confidence will diminish when we don’t repeat experiments regularly. The solution – to run experiments as part of continuous integration/deployment – will be the topic of another blog post. Stay tuned!

We’ve built Gremlin with automation in mind: start chaos experiments from the command line, trigger them programmatically per API, or schedule everything in a convenient web interface. Sounds interesting? Contact sales@gremlininc.com to get started today.