- 3 min read

If you're adopting Kubernetes, you need Chaos Engineering

When Ticketmaster started their Kubernetes migration, they had to address a huge problem: whenever ticket sales opened for a popular event, as many as 150 million visitors flooded their website, effectively causing distributed denial of service (DDoS) attacks. With new events happening every 20 minutes and $7.6 billion in revenue at stake, outages could mean hundreds of thousands in lost sales.

Kubernetes is the world’s leading container orchestration tool for many reasons, but one of its strongest features is its resilience and failover capability. Kubernetes automates much of the work needed to keep applications up and running, including monitoring workloads, restarting crashed applications, and scaling up systems during peak events.

But Kubernetes isn’t a silver bullet, and as many as 15% of companies using containers and orchestration tools still say reliability is one of their top challenges. Organizations planning a migration need to be aware of how Kubernetes—and the organization’s applications running on it—can potentially fail.

For a successful Kubernetes rollout, you need to proactively test your cluster and applications to understand how they can fail. Chaos Engineering lets you do this in a safe, controlled way. In this article, we’ll explain why Chaos Engineering is key to a successful Kubernetes adoption, and how Gremlin can help.

How Reliability Applies to Kubernetes

Reliability refers to how well you can trust that a system will remain available. A reliable system will continue serving customers even if part of that system fails. Kubernetes was designed with the assumption that customer-facing systems will fail, and so it has mechanisms built-in for creating redundancy, detecting failures, rerouting customer traffic to healthy components, and restraining or replacing failed components. All of this happens automatically and behind the scenes, leaving engineers free to focus on building applications.

The downside—especially for teams that are unfamiliar with Kubernetes—is that all of these features add complexity, and this creates a steep learning curve. If one of these components fails, engineers need to know:

  • What impact does this have on our cluster, our applications, and our customers?
  • How do we recover as quickly and efficiently as possible?
  • Does Kubernetes have any mechanisms for handling this kind of failure automatically?
  • How do we prevent this from happening in the future?

The only way to answer these questions is through experience. Teams need to deliberately and proactively test Kubernetes by causing failure, and this is only possible through Chaos Engineering.

How Chaos Engineering Makes Kubernetes More Reliable

Chaos Engineering is the practice of deliberately injecting failure into applications and systems in order to test for reliability. We can use Chaos Engineering to ensure that Kubernetes and our applications remain available even when a component fails. With a new Kubernetes deployment, Chaos Engineering also gives us an opportunity to test different aspects of our deployment for failure as we’re building them. Our engineers will have a better understanding of how Kubernetes works and how to architect it in the most resilient way possible, long before we start running production workloads.

[Chaos Engineering] allows you to consider edge cases and ways that your code may go wrong in ways that you did not previously consider prior to running a chaos experiment.

Tom Deering
Senior Software Engineer at Workiva

Gremlin provides a safe, secure, and easy-to-use platform for running chaos experiments on Kubernetes. With Gremlin, you deploy an agent that automatically detects Kubernetes hosts and resources, allowing you to target specific resources for experimentation. Gremlin provides a number of attacks that create various failure states including network outages, high CPU load, and stopped processes. If an experiment causes unexpected problems, you can halt the experiment and fall back to a steady state at the click of a button, preventing any harm to your cluster or applications. When your team is ready to run more complex experiments, you can use Scenarios to combine several attacks and simulate real-world outages.

Selecting Kubernetes resources to target for a chaos experiment in Gremlin

We use Gremlin to test various failure scenarios and build confidence in the resiliency of our microservices. The ability to target containerized services with an easy-to-use UI has reduced the amount of time it takes us to do fault injection significantly.

Paul Osman
Senior Engineering Manager at Under Armour

To see how easy it is to perform Chaos Engineering on Kubernetes with Gremlin, read our blog post on targeting Kubernetes resources in Gremlin. Once your team is ready to start improving reliability, our whitepaper on The First 5 Chaos Experiments to Run on Kubernetes offers several example experiments to help you get started.

October 7, 2021 - 4 min read

Getting started with Disk attacks

Persistent storage is one of the more difficult aspects of managing distributed systems. When we attach a storage device to a host—whether it’s flash storage, network attached storage (NAS), or old fashioned spinning disks—we generally don…