How to train your engineers in Chaos Engineering
Adopting a Chaos Engineering tool is a great step towards improving reliability, but a tool is only useful if you know how to utilize it effectively.
Introducing a new tool or practice to engineering teams can be difficult. Reliability is an important part of software development, but is often overshadowed by activities such as new feature development and bug fixing. When we train our engineers on how to do Chaos Engineering, we’re not just showing them how to use a tool, but asking them to change some of their work habits. This requires a bit more effort than walking through a set of actions.
In this white paper, we’ll explain how to train your engineering teams on Chaos Engineering using Gremlin. We’ll walk through introducing and explaining the concept of Chaos Engineering, demonstrating how to create and run a chaos experiment, and promoting the practice across your engineering organization.
This white paper assumes that you have already adopted Gremlin. If you are trying to make a business case to your organization to adopt Gremlin, please read our white paper on how to convince your organization to adopt Chaos Engineering.
Introducing the concept of Chaos Engineering
First, let’s explain what Chaos Engineering is exactly. Formally, it’s the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm to observe how the system responds for the purpose of improving the system’s resilience.
In other words, Chaos Engineering is the deliberate and controlled use of failures to test how a system responds under various conditions. This is often used to simulate expected real-world conditions such as outages, high resource load, and degraded performance. By observing how our systems operate under these conditions, we can infer how they’ll operate under similar conditions after being deployed to production.
The ultimate goal of Chaos Engineering isn’t to cause system failures, but to uncover causes of failure in a safe, controlled, and methodical way.
How Chaos Engineering works
Chaos Engineering involves injecting harm into applications or systems. Rather than inject harm haphazardly, we structure injections as chaos experiments, which consist of:
- A hypothesis explaining what the experiment is testing for and what you expect the outcome to be.
- An attack, which is the actual process of injecting harm. Attacks can take many forms, such as consuming CPU capacity or dropping network traffic between two resources. Attacks have:
- A blast radius, which is the subset of systems impacted by the attack. This is often measured by customer impact (e.g. 10% of customers), but can also be measured in discrete infrastructure components such as the number of hosts or applications.
- A magnitude, which is the scale or intensity of the attack. For example, an attack consuming 90% of CPU capacity on a host has a much higher magnitude than one consuming 10%.
- Abort conditions, which are predefined conditions indicating when we should stop an experiment in order to avoid causing further harm.
- A conclusion, which summarizes the results of the attack and what it means for our hypothesis.
For your first chaos experiments, reduce your blast radius and magnitude as much as possible. This limits the amount of complexity that your engineers have to deal with when analyzing the results of the attack and addressing any failures. As your systems become more resilient, gradually increase the scale of your attacks. This progressive approach gives your engineers time to become comfortable with running chaos experiments while improving reliability in a safe and controlled way.
The “why” of Chaos Engineering
Chaos Engineering can be a difficult concept to introduce to engineers for multiple reasons:
- It reveals faults in systems that engineers build and maintain.
- It requires engineers to perform potentially harmful actions in their environments.
- It creates additional work for engineers, especially when high-priority defects arise.
It’s important that we address these concerns upfront. To do that, we need to shift the focus of our training from “how” to “why.”
Chaos Engineering is similar to preventative car maintenance. Just because your car drives fine today doesn’t mean you should skip your next scheduled oil change. There’s an upfront investment in time, money, and labor, but the result is a significantly reduced chance of your car failing in the future. Likewise, just because your systems are working fine today doesn’t mean problems won’t arise over time.
The reason behind this is simple: like cars, hardware and software aren’t perfect. Defects can creep in from all directions, causing unpredictable and sometimes catastrophic problems. Production outages are expensive to fix, so the sooner we can find and address these issues, the happier we and our customers will be.
Responding to pushback
If you’re having trouble convincing your engineers that Chaos Engineering is worth investing their time in, present it in ways that will benefit them directly. For example, it can:
- Reduce the number of high-severity incidents they have to respond to with.
- Reduce the number of bug reports from customers.
- Cut time spent on bug fixing and leave more time for feature development.
- Improve their understanding about how their systems work.
- Boost their abilities as an engineer by teaching a valuable skill.
Of course, we still want to emphasize the company-oriented objectives that Chaos Engineering helps achieve, such as preventing expensive outages and increasing customer trust. However, engineers are the ones who will be running experiments and implementing fixes, so the more we can speak to their pain points, the better.
You might also get pushback from engineering leads, which is to be expected when introducing a new initiative. You might hear arguments such as:
- We don’t have the time or budget to focus on reliability.
- Our systems are reliable enough.
- We don’t want to add any more uncertainty to our systems.
- Chaos Engineering is too complicated.
Refer back to the car analogy. Replacing a set of tires is expensive, time-consuming, and requires expertise, but it’s much cheaper and safer than losing a tire on the highway. Ask your engineers and engineering leads to think about all the time they spent responding to incidents and handling high-priority bug reports. Chaos Engineering takes just a few hours per week, and when integrated with an automated testing framework or continuous integration and continuous delivery (CI/CD) pipeline, significantly reduces time spent managing incidents both in the short-term and long-term.
Chaos Engineering is a choice between a ten second controlled failure and a multi-hour uncontrolled failure.
Director of Engineering at Stitch FX
If your engineers argue that they don’t have time, work with their leads to reprioritize their goals. Shift some time away from new development and put it towards experimentation and fixing failure modes. You don’t need to stop development completely, but the more time engineers are able to allocate towards learning Chaos Engineering, the sooner they can start applying it.
Lastly, if your engineers or leads argue that their systems are already reliable, or that they can’t tolerate any additional uncertainty, have them run a simple chaos experiment such as consuming CPU or RAM. If something unexpected happens, such as an application crashing, they’ve just demonstrated the benefits of Chaos Engineering and now have a clear motivation for using it. As for complexity, Gremlin makes it easy to run chaos experiments. Installing the Gremlin daemon takes just a few minutes, and engineers can quickly design and run experiments using the Gremlin web app, REST API, or CLI.
Creating additional incentives
If your engineers still aren’t receptive, consider methods of positive reinforcement. For example, one approach might be to turn reliability into a competition between teams. Measure each team’s reliability metrics (such as uptime or error rate) over the course of a year or quarter, then compare each team’s metrics at the end. The team with the highest overall reliability wins a prize, whether it’s a bonus, extra vacation hours, a gift card, or some other incentive.
Another example is tying error budgets to end-of-year bonuses. An error budget is the maximum amount of time that a system is allowed to be down for a period of time, and is typically based on your service level agreements (SLAs). Each team starts the period with an allocated budget, and any downtime will cause them to “spend” part of their budget. Tying bonuses to error budgets will encourage engineers to maximize it however possible, which in turn incentivizes them to use Chaos Engineering.
Now that we’ve established the “why,” let’s look into the “how.” We’ll start by setting objectives for the training, then walk through creating and executing a chaos experiment using Gremlin.
Training your teams on using Gremlin
In this chapter, we’ll explain how to train a team on running chaos experiments using Gremlin. The goal of this training is twofold: teach the practice of Chaos Engineering, and teach how to use Gremlin to run chaos experiments.
Rather than train your entire engineering department all at once, we recommend starting with a single team of experienced engineers who own a specific application or service. Experienced engineers have in-depth knowledge of their services, making them more likely to understand the impact of a chaos experiment and how to address any failures that occur. This makes them more likely to not only grasp the ideas behind Chaos Engineering, but also be able to train other engineers.
With that in mind, this training will include the following steps:
- Set reliability objectives.
- Bring the team together.
- Create a training environment.
- Add observability to your systems.
- Run a chaos experiment and analyze the results.
- Use Recommended Scenarios to validate resilience to common failure modes.
- Analyze the results of your experiment.
- Implement fixes and re-test.
- Move Chaos Engineering into production.
Remember that this is a learning experience. The focus isn’t on system reliability or how quickly your engineers can fix problems, but on helping engineers learn Chaos Engineering, the Gremlin platform, and maybe something new about their systems.
The examples shown in this white paper are based on an open source application called Online Boutique. Online Boutique is a microservice-based e-commerce website consisting of twelve services running on Kubernetes. Each service performs a specific function, such as displaying web pages, tracking inventory, handling payments, and processing orders. This is a great example for teams learning Gremlin because of how easily we can target individual services and observe the effects of an attack.
Many of our examples focus on the frontend service, which provides the customer-facing website. The frontend is the point of entry to our shop; without it, users can’t browse products or place orders. Using Gremlin, we’ll test the reliability of our frontend service to ensure it can withstand conditions like network outages, crashes, and dependency failures.
Online Boutique is an open source project that can be deployed to any Kubernetes cluster. You can download it from https://github.com/GoogleCloudPlatform/microservices-demo.
Setting reliability objectives
Before we start training, let’s set our objectives. We want to accomplish three things:
- Teach the principles and application of Chaos Engineering.
- Train our team on using Gremlin to run chaos experiments.
- Integrate Chaos Engineering into the team’s everyday workflow.
Remember, the focus of this training is on teaching the practice of Chaos Engineering, not to make immediate reliability improvements. If a system fails, it’s not a reason to blame anyone, but rather a proof of concept. Ultimately, we want the team to reach a point where they’re motivated to run chaos experiments on their own, and have the tools and knowledge to do so.
Bringing the team together
Communication and collaboration are important when introducing Chaos Engineering to a team, especially once we start running experiments. Choose a time when the entire team can convene (physically or virtually) in a real-time meeting space, such as a video call or chat room. This helps the team share ideas, troubleshoot, and analyze outcomes faster, which in turn fosters learning.
With the team together, explain the concept of Chaos Engineering. Remember to focus on how it can benefit the team directly. Focus on any pain points that the team recently experienced, such as high-severity incidents, customer bug reports, on-call incidents, and time spent troubleshooting after-hours. We’re not doing this to highlight the team’s failures, but to show how we can improve their lives with this new practice.
For example, imagine we had a recent incident where our frontend service lost connection to all other services. The service was still running, but when customers visited our website, the home page would appear to load indefinitely. We fixed the problem temporarily by restarting the service, but we still don’t know why it happened in the first place. We’ll focus this training around recreating this scenario.
Recreating past incidents like this is a great way to show your teams how to apply Chaos Engineering, help them practice their incident response procedures, and test the effectiveness of their fixes.
This training exercise is similar to a GameDay, which is a dedicated time for teams to meet and run chaos experiments on their systems. Where GameDays focus on improving reliability, this training focuses on helping teams learn the practice of Chaos Engineering. After this training, schedule a GameDay with your team and encourage them to make it a regular practice. To learn more, read our tutorial on how to run a GameDay. For geographically distributed teams, see how to run a remote GameDay.
Creating a training environment
For training purposes, have your engineering teams create standalone testing environments for running chaos experiments. Chaos Engineering is most effective when done on production systems, but while your engineers are learning, a test environment gives them more freedom to experiment without having to worry about accidentally causing an outage.
With Online Boutique, we can spin up a fresh Kubernetes cluster and deploy the shop using its manifest file. We can use Gremlin to run attacks directly on Kubernetes resources (Pods, DaemonSets, Deployments, etc.), but we first need to install the Gremlin Kubernetes agent.
Here is our test cluster as shown in the Gremlin web app. Note that the frontend service is selected and highlighted as part of our blast radius. This shows us exactly which resources are being targeted for attack:
Adding observability to your systems
To understand the real impact that a chaos experiment has on our systems, we need to be able to observe their internal and external states.
Some experiments—such as shutting down a server or blocking network traffic to a web server—have a clear and obvious impact, but others may be more subtle. For example, adding network latency to one service could reduce request throughput on other services, causing a cluster-wide decrease in throughput and increase in CPU or RAM utilization. Collecting metrics such as resource consumption, request throughput, latency, errors, and availability shows us how our systems respond under stressful conditions, which can tell us where to focus our efforts when developing fixes.
Before running an experiment, have the team determine which metrics are the most relevant to their services, and have them actively monitor these metrics over the course of the experiment. Consider creating dashboards so that team members can view the status of their systems in real-time. This level of visibility can highlight failure modes that engineers wouldn’t expect to find, including those affecting other systems. If this is a new environment, leave some time for your monitoring solution to collect baseline data so that the team can see the difference between normal operating conditions and conditions caused by the experiment.
Gremlin can automatically collect and display metrics for CPU and shutdown attacks. When running a CPU attack, Gremlin charts CPU usage before, during, and shortly after the attack. This shows you the exact moment the attack starts, and how CPU usage changes relative to your baseline levels. You can also link to third-party monitoring tools after an attack has completed.
If you have an existing monitoring and alerting system in place, this is a great time to consider testing your alerts. For example, if you have an alert set to fire when RAM usage exceeds 80%, create an experiment that uses a memory attack to allocate 80% of RAM across all of your nodes and verify that your engineers receive a timely notification. This demonstrates another use case of Chaos Engineering, which is making sure your monitoring setup is ready for production.
Running a chaos experiment and analyze the results
Now we get into the real meat of the training: running an experiment.
First, let’s design an experiment. Recall that an experiment has four key components: a hypothesis, an attack, abort conditions, and a conclusion. Earlier, we explained how we had an outage where our frontend service lost communication with our backend services. After some more troubleshooting, our engineers made a fix where instead of timing out, the frontend would load the page with a user-friendly error if the backend was unavailable. To validate this, we’ll create a chaos experiment that reproduces the outage, then reload the page to see if the fix works as expected.
Our hypothesis is this: if we block network traffic between our frontend service and our backend services, our website will load with an error message. We also won’t see features such as the product catalog or advertisements. To test this, we’ll use a blackhole attack to drop all traffic that isn’t traffic to and from the frontend (traffic not on ports 80 or 443). If the website fails to load at all, we’ll abort the test and record it as a failure.
To create the attack, log into the Gremlin web app at https://app.gremlin.com. From the Dashboard, click Create Attack and select the type of resource you wish to attack. Since we’re targeting a Kubernetes service, we’ll select Kubernetes.
Next we’ll select the frontend Deployment as the target. We’ll select our Kubernetes test cluster from the drop-down, then scroll down and select Deployments > frontend. The blast radius diagram to the right of the Deployment highlights the selected service.
Now let’s set up our attack. Under Choose a Gremlin, Select Network > Blackhole.
We’ll configure this attack to run for 60 seconds and exclude ports 80 and 443 from both ingress and egress (incoming and outgoing traffic) to allow HTTP and HTTPS traffic. We'll also exclude port 8080, which Kubernetes uses to perform health checks on the service. All other ports will be blocked.
Click Unleash Gremlin to run the attack. When the attack enters the “Running” stage, our network resources will be impacted and we can begin testing.
If you need to stop an attack for any reason, click the “Halt” button in the Attack Details screen, or the “Halt All Attacks” button in the top-right corner of the Gremlin web app.
With the attack running, let’s try refreshing the website. Instead of displaying an error message, it seems to be stuck in a loading state. This refutes our hypothesis and shows that our fix didn’t work the way we expected it to. Let’s halt the attack and dig deeper into the cause.
Using Recommended Scenarios to validate resilience to common failure modes
While we recommend having your teams design their own chaos experiments, a quick and easy way to get started is using Recommended Scenarios. These are designed to test real-world failure modes across different technologies including cloud providers, Kubernetes, message queues, databases, and monitoring solutions. You only need to select your targets, and run the Scenario. This is a great way to test your systems against common failures and learn how to design your own Scenarios.
Analyzing the results
With the experiment complete, have your team do a full analysis of the outcome. Ask questions such as:
- Which upstream service is holding up the frontend?
- Why didn’t our fix work as expected?
- Do we need to lower our timeout thresholds?
- Did the experiment have any unintended side-effects?
Running a chaos experiment can cause normal operations to fail as well. For example, the frontend service runs automated health checks over port 8080. If we were to blackhole that port, the health check would fail and Kubernetes would stop the service.
You can record your observations in the Gremlin web app by editing the Results section of the Attack Details page in the Gremlin web app. If the experiment had an unexpected outcome—such as causing a failure in another service—record this as well. If you used a third-party tool to track and record metrics, you can add a link to the tool in the Metrics Reference field. This is useful for comparing the results of recent experiments to older experiments to track changes in reliability over time.
Remember, this is a blameless process. If something failed, don’t hold anyone at fault. The reason we’re performing Chaos Engineering is to find and fix problems like these, not to point fingers or shift blame. And the reason we’re doing this in test/staging is so that we don’t have to worry about harming the business.
Implementing fixes and re-testing
Given our results and observations, what can we do to fix the problem? This is where the expertise of your engineers is useful. Their understanding of the application, infrastructure, and tools will help them with brainstorming solutions. Encourage your engineers to talk with each other, hash out ideas (even bad ones), and try different solutions. Depending on the complexity of the failure and of your systems, this may take longer than a single training session.
For example, one solution we can try is to scale up our frontend Deployment. Notice how we only have one instance (called a replica) of our frontend service running. If we add another instance, Kubernetes will deploy the new instance to another node and load balance network traffic between the two. Let’s try it by running the following command on our Kubernetes cluster:
<span class="code-class-custom">kubectl scale --replicas=2 deployment/frontend</span>
We can confirm that the replica is running by looking at our targets in the Gremlin web app:
Now let’s re-run the experiment. Under Attacks, we’ll click on the blackhole attack that we ran previously, then click Rerun. By default, this will target both instances, but we can target a single instance by reducing “Percent of pods to impact” to 50%.
Next, we’ll click Unleash Gremlin and try loading the shop. Some of the requests will go through successfully, but some will fail. This is because some requests are still being routed to the instance that the blackhole attack is running on. This tells us that this isn’t a perfect solution, but we’re on the right track. We just need to iterate and try again. We should encourage this practice of experimenting, fixing, and validating until we have a solution that works reliably and that we feel comfortable deploying to production.
Once you feel confident in your fix, increase the blast radius and magnitude of your experiment and test that your services remain available. Increasing the attack scale can reveal new and unexpected failure modes that your fixes might not account for, which is why it’s important to start small and gradually scale up your chaos experiments.
Moving Chaos Engineering into production
Running chaos experiments in production can be a scary thought for teams just starting out. However, the reason we do Chaos Engineering is to make our production systems more reliable. It doesn’t matter how stable our test and staging environments are if customers can’t use our applications.
With that said, how can we experiment in production while minimizing the risk to our customers? We can use deployment strategies like blue-green deployments, canary deployments, or dark launches to reduce the blast radius of our experiments in terms of impacted customer traffic. Dark launches are especially useful, as they allow us to copy user traffic to a subset of systems set aside specifically for experimentation and testing. This lets us run experiments under real-world conditions without actually affecting our ability to process requests.
If we want to experiment without changing our deployment strategy, we can fine-tune our attack parameters to only impact certain kinds of traffic. For example, we limited the magnitude of a blackhole attack to specific ports, and we can do the same for IP addresses, network devices, hosts, and certain external service provider settings (e.g. if we were running our shop on AWS). If we only wanted to block traffic to one specific upstream service, we could enter its hostname or internal IP address in the attack parameters. In this screenshot, we specifically target our product catalog service by entering its DNS name into the Hostnames field when creating an attack:
Reducing the blast radius doesn’t just limit the potential impact of our experiment, but also reduces the number of variables. We can start by looking for reliability issues between our frontend and product catalog services, then incrementally increase our blast radius to include additional services. By the time we start testing our entire application holistically, we’ve already found and addressed some of the biggest reliability threats.
Building Chaos Engineering as an organizational practice
Now that you’ve successfully introduced Chaos Engineering to one of your teams, start spreading the practice to other teams. If possible, choose another team that owns their own service(s) and has dealt with reliability issues in the past. Encourage the team that you just finished training to help with training the new team. This helps ingrain the practice into the minds of both teams, encourages cross-team collaboration, and builds camaraderie.
Some teams will be more receptive to Chaos Engineering than others. Some will need more leadership and guidance, while others might work better with a more self-guided and experimental approach. If a team needs more context, Gremlin provides a number of additional resources for learning about Chaos Engineering, including:
- The Gremlin Chaos Engineering page
- Chaos Engineering: the history, principles, and practice
- Numerous Gremlin tutorials
- The Chaos Engineering Slack community
No matter the approach, always remember to show how Chaos Engineering can help each team with their specific problems.
Onboarding with Chaos Engineering
As you hire new engineers, make Chaos Engineering training part of your onboarding process. Educate incoming engineers on the principles of Chaos Engineering and the importance of building reliable applications and systems. Show them how to run experiments using Gremlin, and have them sit in on GameDays and FireDrills. This helps:
- Teach them the process of running experiments so they can confidently run their own.
- Encourage them to think about reliability when writing code.
- Familiarize them with your applications and systems, and the different ways that they can potentially fail.
- Prepare them for problems that they might encounter in production or when working on-call.
As you continue to train your existing teams and onboard new employees, Chaos Engineering will eventually become common practice across the engineering organization. Make sure to continually promote the practice through GameDays, FireDrills, and by allowing teams to spend time on proactive experimentation. In addition, encourage your engineers to become part of the global Chaos Engineering community by joining the community Slack channel. If you have any questions about the Gremlin product, contact us.