When Ticketmaster started their Kubernetes migration, they had to address a huge problem: whenever ticket sales opened for a popular event, as many as 150 million visitors flooded their website, effectively causing distributed denial of service (DDoS) attacks. With new events happening every 20 minutes and $7.6 billion in revenue at stake, outages could mean hundreds of thousands in lost sales.

Kubernetes is the world’s leading cloud-native container orchestration tool for many reasons, but one of its strongest features is its resilience and failover capability. Kubernetes automates much of the work needed to keep applications up and running, including monitoring workloads, restarting crashed applications, and scaling up systems during peak events.

But Kubernetes isn’t a silver bullet, and in 2020, 12% of companies using containers and orchestration tools cited reliability as one of their top challenges to adoption. Organizations planning a migration need to be aware of how Kubernetes—and the applications, services, and workloads running on it—can potentially fail.

For a successful Kubernetes rollout, you need to proactively test your cluster and applications to understand how they can fail. Chaos Engineering lets you do this in a safe and controlled way. In this article, we’ll explain how Chaos Engineering is key to a successful Kubernetes adoption, and how Gremlin can help. For a complete list of the most critical Kubernetes risks, download a free copy of our ebook.

How Reliability Applies to Kubernetes

A reliable system is one that you can trust will remain available. Even if part of the system fails, it will continue operating and serving customers. Kubernetes was designed with the assumption that customer-facing production environments will fail, and so it has mechanisms built-in for detecting these failures, creating redundancy, rerouting customer traffic to healthy components, and restarting or replacing failed components. All of this happens automatically and behind the scenes, leaving engineers free to focus on building applications.

The downside—especially for teams that are new to Kubernetes—is that all of these features add complexity, and this creates a steep learning curve. If one of these components fails, engineers need to know:

  • What impact does this have on our cluster, our applications, and our customers?
  • Does Kubernetes have any mechanisms for handling this kind of failure automatically?
  • How do we recover as quickly and efficiently as possible?
  • How do we prevent this from happening in the future?

Unfortunately, these questions can’t always be answered using traditional testing methods. The most effective way to validate that a failure-handling mechanism works is to actually create the type of failure it’s meant to prevent, and this practice of deliberately causing failure in order to validate the resilience of a system is exactly what Chaos Engineering is.

How Chaos Engineering Makes Kubernetes More Reliable

The goal of Chaos Engineering is to improve the reliability of a system by ensuring it can withstand turbulent conditions, and this is especially true for complex distributed cloud-native platforms like Kubernetes. Using Chaos Engineering, we can ensure that our applications–and the Kubernetes cluster itself–can remain available even when a core component fails.

This is useful for improving the resilience of existing clusters, but it’s especially important for teams that are new to Kubernetes. With a new Kubernetes migration or deployment, Chaos Engineering creates an opportunity to test different aspects of the cluster as it’s being built. This can give your engineers a better understanding of how Kubernetes and how to architect it in the most resilient way possible long before running your first production workloads.

[Chaos Engineering] allows you to consider edge cases and ways that your code may go wrong in ways that you did not previously consider prior to running a chaos experiment.
Tom Deering

Senior Software Engineer at Workiva

Gremlin provides a safe, secure, and easy-to-use platform for running chaos experiments on Kubernetes. The Gremlin Kubernetes agent automatically detects Kubernetes resources including nodes, containers, Pods, and Deployments, and allows you to target any of these resources for experimentation. Gremlin provides a number of attacks that can create various failure states including network outages, high CPU load, and stopped processes. If an experiment causes unexpected problems, you can halt the experiment and fall back to a steady state at the click of a button, preventing any harm to your cluster or applications. When your team is ready to run more complex experiments, you can use Scenarios to combine several attacks and recreate real-world outages.

Selecting Kubernetes resources to target for a chaos experiment in Gremlin
Gremlin makes Chaos Engineering easy and seamless. For us, it’s cut down the amount of time involved in designing and executing the chaos experiments, particularly for our Microservices and Kubernetes.
Chaitanya Krant

To see how easy it is to perform Chaos Engineering on Kubernetes with Gremlin, read our blog post on targeting Kubernetes resources in Gremlin. Once your team is ready to start improving reliability, our eBook "Kubernetes Reliability at Scale," offers several example experiments to help you get started.

No items found.
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

K8s Reliability at Scale

To learn more about Kubernetes failure modes and how to prevent them at scale, download a copy of our comprehensive ebook

Get the Ultimate Guide