How to monitor Kubernetes reliability risks

Automate Kubernetes risk monitoring to detect reliability risks before they cause incidents or outages

Editor’s Note: Metrics and reporting are only one part of the Kubernetes Resiliency Framework. Find out more about the framework in The Ultimate Guide to Kubernetes High Availability.


What is Kubernetes reliability risk monitoring?

While resiliency testing is necessary for uncovering some reliability risks, the nature of Kubernetes makes it possible to scan for key misconfigurations, bad default values, or anti-patterns that create reliability risks within the cluster. You can deploy a tool across your cluster to detect Kubernetes resources and analyze configurations across the deployment. This makes it possible to automatically detect key reliability risks and surface them before they start causing behavior that could lead to an outage.

This kind of automated risk monitoring is different than observability or resiliency testing. With observability, these risks will present themselves when they create unexpected behavior in your systems. Instead of finding out about the risk ahead of time, you’re reacting after it’s already caused an incident or outage. Resiliency testing, on the other hand, artificially injects faults that would trigger the risk. This allows you to uncover the risk proactively before it causes problems, but the tests themselves have to be run.

Kubernetes risk monitoring uses the cluster, node, and pod data to uncover critical reliability risks automatically without testing or waiting for an observability alert. Many of these are caused by configuration issues or require small changes to images that can be relatively quick to address. By setting up a system to monitor for these key risks, you can proactively surface them without the delay of other methods.

The nature of Kubernetes and the complexity of deployments has the potential to create a large number of risks, but there’s a core group of ten that should be included in any risk monitoring practice.

These are the most common critical risks that could cause major outages if left unaddressed. When building out your Kubernetes reliability tooling and standards, start by making sure these ten reliability risks are being detected and covered. From there, you can add other reliability risks to your monitoring list. 

Kubernetes cluster resource risks

Running out of resources directly impacts system stability. If your nodes don’t have enough CPU or RAM available, they may start slowing down, locking up, or terminating resource-intensive pods to make room.

Setting requests is the first step towards preventing this, because they specify the minimum resources needed to run a pod. Limits are somewhat the opposite and set an upper cap on how much RAM a pod can use, preventing a memory leak from taking all of a node’s resources.

1. Missing CPU requests

A common risk is deploying pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your pod from running. 

Requests serve two key purposes:

  1. They tell Kubernetes the minimum amount of the resource to allocate to a pod. This helps Kubernetes determine which node to schedule the pod on and how to schedule it relative to other pods.
  2. They protect your nodes from resource shortages by preventing over-allocating pods on a single node.

Without this, Kubernetes might schedule a pod onto a node that doesn't have enough capacity for it. Even if the pod uses a small amount of CPU at first, that amount could increase over time, leading to CPU exhaustion.

Find out how to detect missing CPU requests and how to resolve the reliability risk: How to ensure your Kubernetes Pods have enough CPU

2. Missing memory requests

A memory request specifies how much RAM should be reserved for a pod's container. When you deploy a pod that needs a minimum amount of memory, such as 512 MB or 1 GB, you can define that in your pod's manifest. Kubernetes then uses that information to determine where to deploy the pod so it has at least the amount of memory requested.

When deploying a pod without a memory request, Kubernetes has to make a best-guess decision about where to deploy the pod.

If the pod gets deployed to a node with a limited amount of free memory remaining, and the pod gradually consumes more memory over time, it could trigger an out of memory (OOM) event that terminates the pod. This could even make the pod unschedulable, which manifests as the CrashLoopBackOff status.

Learn more about finding and resolving memory request risks: How to ensure your Kubernetes Pods have enough memory

3. Missing memory limits

A memory limit is a cap on how much RAM a pod is allowed to consume over its lifetime. When you deploy a pod without memory limits, it can consume as much RAM as it wants, just like any other process. If it continually uses more and more RAM without freeing any (known as a memory leak), eventually the host it's running on will run out of RAM.

At that point, a kernel process called the OOM (out of memory) killer jumps in and terminates the process before the entire system becomes unstable.

While the OOMKiller should be able to find and stop the appropriate pod, it's not always guaranteed to be successful. If it doesn't free enough memory, the entire system could lock up, or it could kill unrelated processes to try and free up enough memory.

Setting a limit and a request creates a range of memory that the pod could consume, making it easier for both you and Kubernetes to determine how much memory the pod will use on deployment.

Find out how to set memory limits and prevent memory leaks: How to detect and prevent memory leaks in Kubernetes applications

Kubernetes cluster redundancy risks

Unfortunately, containers often crash, terminate, or restart with little warning. Even before that point, they can have less visible problems like memory leaks, network latency, and disconnections. Liveness probes allow you to detect these problems, then terminate and restart the pod.

On the node level, you should set up Kubernetes in multiple availability zones (AZs) for high availability. When these risks are remediated, your system will be able to detect pod failures and failover nodes if there’s an AZ failure.

These two reliability risks directly affect your deployment’s ability to have the redundancy necessary to be resilient to pod, node, cluster, or AZ failure.

4. Missing liveness probes

A liveness probe is essentially a health check that periodically sends an HTTP request (or sends a command) to a container and waits for a response. If the response doesn't arrive, or the container returns a failure, the probe triggers a restart of the container.

The power of liveness probes is in their ability to detect container failures and automatically restart failed containers. This recovery mechanism is built into Kubernetes itself without the need for a third-party tool. Service owners can define liveness probes as part of their deployment manifests, and their containers will always be deployed with liveness probes.

In theory, the only time a service owner should have to manually check their containers is if the liveness probe fails to restart a container (like the dreaded CrashLoopBackOff state). But in order to restart the container, a liveness probe has to be defined in the container’s manifest.

Learn how to detect missing liveness probes and make sure they’re defined: How to keep your Kubernetes Pods up and running with liveness probes

5. No Availability Zone redundancy

By default, many Kubernetes cloud providers provision new clusters within a single Availability Zone (AZ). Because these AZs are isolated, one AZ can experience an incident or outage without affecting other AZs, creating redundancy—but only if your application is set up in multiple AZs.

If a cluster is set up in a single AZ and that AZ fails, the entire cluster will also fail along with any applications and services running on it. This is why the AWS Well-Architected Framework recommends having at least two redundant AZs for High Availability.

Kubernetes natively supports deploying across multiple AZs, both in its control plane (the systems responsible for running the cluster) and worker nodes (the systems responsible for running your application pods).

Setting up a cluster for AZ redundancy usually requires additional setup on the user's side and leads to higher cloud hosting costs, but for critical services, the benefits far outweigh the risk of an incident or outage.

Find out how to set up Availability Zone redundancy and scan for missing redundancy: How to deploy a multi-availability zone Kubernetes cluster for High Availability

Kubernetes container deployment risks

If a container crashes, Kubernetes waits for a short delay and restarts the pod. Kubernetes will retry a few times before eventually giving up and giving the container a CrashLoopBackOff status. Similarly, when Kubernetes fails to pull the container image, it will retry for a few minutes until it gives up, then give the container a status of ImagePullBackOff.

There are also times when a pod simply can’t be scheduled to run. Commonly, this happens because the cluster doesn’t have the resources, or your pod requires a persistent volume that isn’t available. 

Containers in these states should be able to be restarted when a failure occurs, but are unable to, creating a risk to the resiliency of your deployment.

6. Pods in CrashLoopBackOff

CrashLoopBackOff is the state that a pod enters after repeatedly terminating due to an error. Normally, if a container crashes, Kubernetes waits for a short delay and restarts the pod.

The time between when a pod crashes and when it restarts is called the delay. On each restart, Kubernetes exponentially increases the length of the delay, starting at 10 seconds, then 20 seconds, then 40 seconds, continuing in that pattern up to 5 minutes. If Kubernetes reaches the max delay time of 5 minutes and the pod still fails to run, Kubernetes will stop trying to deploy the pod and gives it the status CrashLoopBackOff.

CrashLoopBackOff can have several causes, including:

  • Application errors that cause the process to crash.
  • Problems connecting to third-party services or dependencies.
  • Trying to allocate unavailable resources to the container, like ports that are already in use or more memory than what's available.
  • A failed liveness probe.

There are many more reasons why a CrashLoopBackOff can happen, and this is why it's one of the most common issues that even experienced Kubernetes developers run into.

Get tips for CrashLoopBackOff troubleshooting, detecting it, and verifying your fixes: How to fix and prevent CrashLoopBackOff events in Kubernetes

7. Images in ImagePullBackOff

Before Kubernetes can create a container, it first needs an image to use as the basis for the container. An image is a static, compressed folder containing all of the files and executable code needed to run the software embedded within the image.

Normally, Kubernetes downloads images as needed (i.e. when you deploy a manifest). Kubernetes uses the container specification to determine which image to use, where to retrieve it from, and which version to pull.

If Kubernetes can't pull the image for any reason (such as an invalid image name, poor network connection, or trying to download from a private repository), it will retry after a set amount of time. Like a CrashLoopBackOff, it will exponentially increase the amount of time it waits before retrying, up to a maximum of 5 minutes. If it still can't pull the image after 5 minutes, it will stop trying and set the container's status to ImagePullBackOff.

Learn about detecting and troubleshooting ImagePullBackOff, then verifying your fixes: How to fix and prevent ImagePullBackOff events in Kubernetes

8. Unschedulable pod errors

A pod is unschedulable when it's been put into Kubernetes' scheduling queue, but can't be deployed to a node. This can be for a number of reasons, including:

  • The cluster not having enough CPU or RAM available to meet the pod's requirements.
  • Pod affinity or anti-affinity rules preventing it from being deployed to available nodes.
  • Nodes being cordoned due to updates or restarts.
  • The pod requires a persistent volume that's unavailable, or bound to an unavailable node.

Although the reasons vary, an unschedulable pod is almost always a symptom of a larger problem. The pod itself may be fine, but the cluster isn't operating the way it should, which makes resolving the issue even more critical.

Unfortunately, there is no easy direct way to query for unschedulable pods. Pods waiting to be scheduled are held in the "Pending" status, but if the pod can't be scheduled, it will remain in this state. However, pods that are being deployed normally are also marked as "Pending." The difference comes down to how long a pod remains in "Pending.”

Find out how to detect and resolve unschedulable pod issues: How to troubleshoot unschedulable Pods in Kubernetes

Kubernetes application risks

Whenever you update your application, there are hidden reliability risks. Updates typically roll out gradually, not all at once. What happens if your team releases another update before the first rollout finishes? What happens if you push a release while Kubernetes is upgrading itself? You might end up with two different versions running side-by-side.

Another common application risk is introduced by using init containers. These are handy for preparing an environment for the main container, but introduce a potential point of failure where the init container can’t run and causes the main container to fail.

Both of these risks occur at the application level, which means infrastructure or cluster-level detection could miss them.

9. Application version non-uniformity

Version uniformity refers to the image version used when declaring pods. When you define a pod or deployment in a Kubernetes manifest, you can specify which version of the container image to use in one of two ways:

  • Tags, which are created by the image's creator to identify a single version of a container. Multiple container versions can have the same tag, meaning a single tag could refer to multiple different container versions over time.
  • Digests, which are the result of running the image through a hashing function (usually SHA256). Each digest identifies one single version of a container. Changing the container in any way also changes the digest.

Tags are easier to read than digests, but they come with a catch: a single tag could refer to multiple image versions. The most infamous example is latest, which always points to the most recently released version of a container image. If you deploy a pod using the latest tag today, then deploy another pod tomorrow, you could end up with two completely different versions of the same pod running side-by-side.

Learn more about version non-uniformity and how to resolve it: How to ensure consistent Kubernetes container versions

10. Init container errors

An init container is a container that runs before the main container in a pod. They're often used to prepare the environment so the main container has everything it needs to run.

For example, imagine you want to deploy a large language model (LLM) in a pod. LLMs require datasets that can be several GB. You can create an init container that downloads these datasets to the node so that when the LLM container starts, it immediately has access to the data it needs.

Init containers are incredibly useful for setting up a pod before handing it off to the main container, but they introduce an additional point of failure.

Init containers run during the pod's initialization process and must finish running before the main container starts. To add to this, if you have multiple init containers defined, they'll all run sequentially until they've either completed successfully or failed.

If an init container fails and the pod's restartPolicy is not set to Never, the pod will repeatedly restart until it succeeds. Otherwise, Kubernetes marks the entire pod as failed with the status Init:CrashLoopBackOff.

Find out more about init container errors, how to detect them, and how to troubleshoot them: How to fix Kubernetes init container errors

Next steps in your Kubernetes high availability journey:

Download the comprehensive eBook

Learn how your own resiliency management practice for Kubernetes in the 55-page guide Kubernetes Reliability at Scale: How to Improve Uptime with Resiliency Management

Thanks for requesting

Kubernetes Reliability at Scale:

How to Improve Uptime with Resiliency Management.

View the guide here.

(A copy has also been sent to your email.)