Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen.

In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more.

And they’re more prevalent than you might think. Working closely with large enterprise customers, Gremlin found critical reliability risks in nearly every organization’s production environments.

Let’s take a look at the top ten critical Kubernetes reliability risks—and what you can do to find and mitigate them before they cause an outage.

1. Missing CPU requests

A common risk is deploying pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your pod from running. 

Requests serve two key purposes:

  1. They tell Kubernetes the minimum amount of the resource to allocate to a pod. This helps Kubernetes determine which node to schedule the pod on and how to schedule it relative to other pods.
  2. They protect your nodes from resource shortages by preventing over-allocating pods on a single node.

Without this, Kubernetes might schedule a pod onto a node that doesn't have enough capacity for it. Even if the pod uses a small amount of CPU at first, that amount could increase over time, leading to CPU exhaustion.

Find out how to detect missing CPU requests and how to resolve the risk.

2. Missing memory requests

A memory request specifies how much RAM should be reserved for a pod's container. When you deploy a pod that needs a minimum amount of memory, such as 512 MB or 1 GB, you can define that in your pod's manifest. Kubernetes then uses that information to determine where to deploy the pod so it has at least the amount of memory requested.

When deploying a pod without a memory request, Kubernetes has to make a best-guess decision about where to deploy the pod.

If the pod gets deployed to a node with a limited amount of free memory remaining, and the pod gradually consumes more memory over time, it could trigger an out of memory (OOM) event that terminates the pod. This could even make the pod unschedulable, which manifests as the dreaded CrashLoopBackOff status.

Learn more about finding and resolving memory request risks.

3. Missing memory limits

A memory limit is a cap on how much RAM a pod is allowed to consume over its lifetime. When you deploy a pod without memory limits, it can consume as much RAM as it wants like any other process. If it continually uses more and more RAM without freeing any (known as a memory leak), eventually the host it's running on will run out of RAM.

At that point, a kernel process called the OOM (out of memory) killer jumps in and terminates the process before the entire system becomes unstable.

While the OOMKiller should be able to find and stop the appropriate pod, it's not always guaranteed to be successful. If it doesn't free enough memory, the entire system could lock up, or it could kill unrelated processes to try and free up enough memory.

Setting a limit and a request creates a range of memory that the pod could consume, making it easier for both you and Kubernetes to determine how much memory the pod will use on deployment.

Find out how to set memory limits and prevent memory leaks.

4. Missing liveness probes

A liveness probe is essentially a health check that periodically sends an HTTP request (or sends a command) to a container and waits for a response. If the response doesn't arrive, or the container returns a failure, the probe triggers a restart of the container.

The power of liveness probes is in their ability to detect container failures and automatically restart failed containers. This recovery mechanism is built into Kubernetes itself without the need for a third-party tool. Service owners can define liveness probes as part of their deployment manifests, and their containers will always be deployed with liveness probes.

In theory, the only time a service owner should have to manually check their containers is if the liveness probe fails to restart a container (like the dreaded CrashLoopBackOff state). But in order to restart the container, a liveness probe has to be defined in the container’s manifest.

Learn how to detect missing liveness probes and make sure they’re defined.

5. No Availability Zone redundancy

By default, many Kubernetes cloud providers provision new clusters within a single Availability Zone (AZ). Because these AZs are isolated, one AZ can experience an incident or outage without affecting other AZs, creating redundancy—but only if your application is set up in multiple AZs.

If a cluster is set up in a single AZ and that AZ fails, the entire cluster will also fail along with any applications and services running on it. Which is why the AWS Well-Architected Framework recommends having at least two redundant AZs for High Availability.

Kubernetes natively supports deploying across multiple AZs, both in its control plane (the systems responsible for running the cluster) and worker nodes (the systems responsible for running your application pods).

Setting up a cluster for AZ redundancy usually requires additional setup on the user's side and leads to higher cloud hosting costs, but for critical services, the benefits far outweigh the risk of an incident or outage.

Find out how to set up AZ redundancy and scan for missing redundancy.

6. Pods in CrashLoopBackOff

CrashLoopBackOff is the state that a pod enters after repeatedly terminating due to an error. Normally, if a container crashes, Kubernetes waits for a short delay and restarts the pod.

The time between when a pod crashes and when it restarts is called the delay. On each restart, Kubernetes exponentially increases the length of the delay, starting at 10 seconds, then 20 seconds, then 40 seconds, etc., up to 5 minutes. If Kubernetes reaches the max delay time of 5 minutes and the pod still fails to run, Kubernetes will stop trying to deploy the pod and gives it the status CrashLoopBackOff.

CrashLoopBackOff can have several causes, including:

  • Application errors that cause the process to crash.
  • Problems connecting to third-party services or dependencies.
  • Trying to allocate unavailable resources to the container, like ports that are already in use or more memory than what's available.
  • A failed liveness probe.

There are many more reasons why a CrashLoopBackOff can happen, and this is why it's one of the most common issues that even experienced Kubernetes developers run into.

Get tips for CrashLoopBackOff troubleshooting, detecting it, and verifying your fixes.

7. Images in ImagePullBackOff

Before Kubernetes can create a container, it first needs an image to use as the basis for the container. An image is a static, compressed folder containing all of the files and executable code needed to run the software embedded within the image.

Normally, Kubernetes downloads images as needed (i.e. when we deploy a manifest). Kubernetes uses the container specification to determine which image to use, where to retrieve it from, and which version to pull.

If Kubernetes can't pull the image for any reason (such as an invalid image name, poor network connection, or trying to download from a private repository), it will retry after a set amount of time. Like a CrashLoopBackOff, it will exponentially increase the amount of time it waits before retrying, up to a maximum of 5 minutes. If it still can't pull the image after 5 minutes, it will stop trying and set the container's status to ImagePullBackOff.

Learn about detecting and troubleshooting ImagePullBackOff, then verifying your fixes.

8. Unschedulable pod errors

A pod is unschedulable when it's been put into Kubernetes' scheduling queue, but can't be deployed to a node. This can be for a number of reasons, including:

  • The cluster not having enough CPU or RAM available to meet the pod's requirements.
  • Pod affinity or anti-affinity rules preventing it from being deployed to available nodes.
  • Nodes being cordoned due to updates or restarts.
  • The pod requires a persistent volume that's unavailable, or bound to an unavailable node.

Although the reasons vary, an unschedulable pod is almost always a symptom of a larger problem. The pod itself may be fine, but the cluster isn't operating the way it should, which makes resolving the issue even more critical.

Unfortunately, there is no easy direct way to query for unschedulable pods. Pods waiting to be scheduled are held in the "Pending" status, but if the pod can't be scheduled, it will remain in this state. However, pods that are being deployed normally are also marked as "Pending." The difference comes down to how long a pod remains in "Pending."

Find out how to detect and resolve unschedulable pod issues.

9. Application version non-uniformity

Version uniformity refers to the image version used when declaring pods. When you define a pod or deployment in a Kubernetes manifest, you can specify which version of the container image to use in one of two ways:

  • Tags, which are created by the image's creator to identify a single version of a container. Multiple container versions can have the same tag, meaning a single tag could refer to multiple different container versions over time.
  • Digests, which are the result of running the image through a hashing function (usually SHA256). Each digest identifies one single version of a container. Changing the container in any way also changes the digest.

Tags are easier to read than digests, but they come with a catch: a single tag could refer to multiple image versions. The most infamous example is latest, which always points to the most recently released version of a container image. If you deploy a pod using the latest tag today, then deploy another pod tomorrow, you could end up with two completely different versions of the same pod.

Unfortunately, this means you might end up with two different versions running side-by-side: one with the latest fix, and one without it.

Learn more about version non-uniformity and how to resolve it.

10. Init container errors

An init container is a container that runs before the main container in a pod. They're often used to prepare the environment so the main container has everything it needs to run.

For example, imagine you want to deploy a large language model (LLM) in a pod. LLMs require datasets that can be several GB large. You can create an init container that downloads these datasets to the node so that when the LLM container starts, it immediately has access to the data it needs.

Init containers are incredibly useful for setting up a pod before handing it off to the main container, but they introduce an additional point of failure.

Init containers run during the pod's initialization process and must finish running before the main container starts. To add to this, if you have multiple init containers defined, they'll all run sequentially until they've either completed successfully or failed.

If an init container fails and the pod's restartPolicy is not set to Never, the pod will repeatedly restart until it succeeds. Otherwise, Kubernetes marks the entire pod as failed with the status Init:CrashLoopBackOff.

Find out more about init container errors, how to detect them, and how to troubleshoot them.

Kubernetes is complex, but detecting risks can be simple

While any of these risks can be enough to take down your Kubernetes applications, most of them can be fixed relatively easily once you know they’re there.

That makes it essential to have a method for automatically detecting these risks and surfacing them so your team can remediate the issue before it causes an incident or outage.

Find out more in the Critical Kubernetes Risks and How to Detect Them at Enterprise Scale eBook.

You’ll learn more about what determines if a reliability risk is critical, categories of critical reliability risks, and methods for detecting them at an enterprise scale.

No items found.
No items found.
Gavin Cahill
Gavin Cahill
Sr. Content Manager
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.