whitepaper

14

PAGES

10

MIN READ

May 26, 2026

Critical Kubernetes Reliability Risks

In this quick reference, you’ll learn what determines a critical reliability risk, the most common critical reliability risks, and methods for detecting them at enterprise scale.

00Introduction

Reliability risks are potential points of failure in your system where an outage could occur. And if you can find and remediate reliability risks, then you can prevent incidents before they happen.

In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more.

In this short reference, we'll take a look at critical Kubernetes reliability risks common in most systems, as well as how you can operationalize detection of these reliability risks at scale.

01The fundamentals

What are critical reliability risks?

Kubernetes deployments are complex with a lot of moving pieces, so it can be hard to determine where the risks are, which risks are critical, and which are minor.

You can start by looking at the most important reliability features of any Kubernetes stack:

  • Scalability

    Can the system quickly respond to changes in demand?

    Can it scale up and scale back down?

    Do you support scaling of your pods and clusters?

  • Redundancy

    Can your applications keep running if part of your cluster fails?

    Do you have replication across multiple availability zones (or even multiple regions) so your services aren't impacted if you lose a data center or an AZ or a rack?

  • Recoverability

    Can Kubernetes recover if something fails?

    Can your pods and nodes auto-restart after an outage?

    How long does it take for Kubernetes to detect and restart failed pods?

  • Consistency/Integrity

    Are your pod replicas all using the same container image?

    Do you have any obsolete code running anywhere?

    Do you have any failed rollouts or deployments that may have left you with different versions of the same container image running at the same time?

To find critical reliability risks, you have to be able to find answers to these questions.

02By the numbers

The most common Kubernetes reliability risks

Working closely with customers, Gremlin found critical reliability risks in nearly every organization.

26%
of deployments have no zone redundancy
80%
are redundant in one or fewer zones (AWS Well-Architected Framework recommends more than two)
25%
don't have Memory Limits
20%
of objects were bare pods (pods that won't restart if they fail)
15%
don't have Memory Requests

By cross-referencing the data with the definition of a critical reliability risk above, we arrived at a list of the most common Kubernetes critical reliability risks. To prevent these problems, you'll need a way to automatically detect each of these risks and determine whether you're still vulnerable to them. We'll explain more about how to do this in the next section.

Each of the risks below includes links to blog posts to help you learn more about finding and fixing them.

Resource risks

Running out of resources directly impacts system stability. If your nodes don't have enough CPU or RAM available, they may start slowing down, locking up, or terminating resource-intensive pods to make room.

Setting requests is the first step towards preventing this, because they specify the minimum resources needed to run a pod. Limits are somewhat the opposite and set an upper cap on how much RAM a pod can use, preventing a memory leak from taking all of a node's resources.

Check for these risks

Redundancy risks

Unfortunately, containers often crash, terminate, or restart with little warning. Even before that point, they can have less visible problems like memory leaks, network latency, and disconnections. Liveness probes allow you to detect these problems, then terminate and restart the pod.

On the node level, you should set up Kubernetes in multiple availability zones (AZs) for high availability. When these risks are remediated, your system will be able to detect pod failures and failover nodes if there's an AZ failure.

Check for these risks

Container deployment risks

If a container crashes, Kubernetes waits for a short delay and restarts the pod. Kubernetes will retry a few times before eventually giving up and giving the container a CrashLoopBackOff status. Similarly, when Kubernetes fails to pull the container image, it will retry for a few minutes until it gives up, then give the container a status of ImagePullBackOff.

There are also times when a pod simply can't be scheduled to run. Commonly, this happens because the cluster doesn't have the resources, or your pod requires a persistent volume that isn't available.

You should be able to automatically detect these risk conditions

Application risks

Whenever you update your application, there are hidden reliability risks. Updates typically roll out gradually, not all at once. What happens if your team releases another update before the first rollout finishes? What happens if you push a release while Kubernetes is upgrading itself? You might end up with two different versions running side-by-side.

Another common application risk is introduced by using init containers. These are handy for preparing an environment for the main container, but introduce a potential point of failure where the init container can't run and causes the main container to fail.

Check for these risks
03At scale

How do you detect these risks at scale?

There are a number of tools available to find these risks, but many of them are designed to be used by individual engineers or isolated small teams.

Enterprise-level systems require a different approach that can work across an organization to uncover reliability risks in a standardized, repeatable process. If you want to monitor for and surface these risks predictably, then you'll need an approach that is:

  • Automatic

    The system needs to run continuously, not just during deployment or provisioning.

  • Comprehensive

    It needs to analyze all of our deployments across all namespaces, and it needs to provide the same insights into each one (whether they're at risk or not).

  • Universal

    It needs to work across all Kubernetes environments, whether it's self-hosted in an organization-owned data center or fully managed in the cloud.

  • Fast

    It needs to detect and report on risks as close to instantaneously as possible to limit the risk of outages.

Let's look at some of the current methods for detecting these risks.

Admission Controllers

Admission Controllers (ACs) are built into Kubernetes to check and validate objects that are being deployed. Typically these are used as gatekeepers to ensure security, but they can also be used to check for reliability risks. They function across individual clusters or namespaces, applying the same policies universally. A big strength of ACs is that they detect risks immediately

Shortcomings
  • They can't be easily managed across multiple clusters or namespaces
  • Risks are only evaluated when a deployment is created, which won't catch risks if there are in-production changes or configuration drift
  • If you have a deployment that knowingly breaks policy, you can't specify exceptions.

Observability tools

You're probably already monitoring your systems, so it seems natural to use observability tools to detect these risks. After some setup, these can be used to catch when reliability risks cause issues so you can respond quickly to the incident.

Shortcomings
  • It's a reactive process, and you won't know something's gone wrong until after it's happened
  • Many risks are present before deployment, and all of them can be detected long before observability kicks in
  • Can require additional setup, tweaking, validating, and more.

Manual checks

Probably the most common method to detect these misconfigurations is for an engineer to manually scan or scrape configs, then correct any issues they find. While you don't have to pay for a tool, it's a process that takes a lot of engineering time.

Shortcomings
  • Checks are hard to standardize and can't be run on a continuous basis
  • Coverage is difficult and depends on many different teams or service owners
  • Issues are only detected when an engineer pulls the config, which means they could be present for some time before being found and resolved.
04How Gremlin helps

Gremlin's Detected Risks

As part of the Gremlin Reliability Platform, the Detected Risks feature identifies potential failure points and guides teams toward remedies in minutes with minimal configuration. After installing the Gremlin agent, add your services to Gremlin or annotate your Kubernetes deployments, then it will automatically scan for the risks and show you a report of them.

It was designed to help you continuously uncover reliability risks as easily as possible with minimal setup so you can fix them before they ever reach your customers. With Detected Risks, you can easily monitor all of your Kubernetes services, whether they're running on prem, in the cloud, or hybrid.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape