Reliability risks are potential points of failure in your system where an outage could occur. And if you can find and remediate reliability risks, then you can prevent incidents before they happen.
In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more.
In this short reference, we'll take a look at critical Kubernetes reliability risks common in most systems, as well as how you can operationalize detection of these reliability risks at scale.
What are critical reliability risks?
Kubernetes deployments are complex with a lot of moving pieces, so it can be hard to determine where the risks are, which risks are critical, and which are minor.
You can start by looking at the most important reliability features of any Kubernetes stack:
-
Scalability
Can the system quickly respond to changes in demand?
Can it scale up and scale back down?
Do you support scaling of your pods and clusters?
-
Redundancy
Can your applications keep running if part of your cluster fails?
Do you have replication across multiple availability zones (or even multiple regions) so your services aren't impacted if you lose a data center or an AZ or a rack?
-
Recoverability
Can Kubernetes recover if something fails?
Can your pods and nodes auto-restart after an outage?
How long does it take for Kubernetes to detect and restart failed pods?
-
Consistency/Integrity
Are your pod replicas all using the same container image?
Do you have any obsolete code running anywhere?
Do you have any failed rollouts or deployments that may have left you with different versions of the same container image running at the same time?
To find critical reliability risks, you have to be able to find answers to these questions.
The most common Kubernetes reliability risks
Working closely with customers, Gremlin found critical reliability risks in nearly every organization.
By cross-referencing the data with the definition of a critical reliability risk above, we arrived at a list of the most common Kubernetes critical reliability risks. To prevent these problems, you'll need a way to automatically detect each of these risks and determine whether you're still vulnerable to them. We'll explain more about how to do this in the next section.
Each of the risks below includes links to blog posts to help you learn more about finding and fixing them.
Resource risks
Running out of resources directly impacts system stability. If your nodes don't have enough CPU or RAM available, they may start slowing down, locking up, or terminating resource-intensive pods to make room.
Setting requests is the first step towards preventing this, because they specify the minimum resources needed to run a pod. Limits are somewhat the opposite and set an upper cap on how much RAM a pod can use, preventing a memory leak from taking all of a node's resources.
- Missing CPU requests: Is enough CPU reserved for your pod?
- Missing memory requests: Does your pod have the minimum amount of RAM it needs?
- Memory limits: Have you set the maximum amount of RAM your pod can use?
Redundancy risks
Unfortunately, containers often crash, terminate, or restart with little warning. Even before that point, they can have less visible problems like memory leaks, network latency, and disconnections. Liveness probes allow you to detect these problems, then terminate and restart the pod.
On the node level, you should set up Kubernetes in multiple availability zones (AZs) for high availability. When these risks are remediated, your system will be able to detect pod failures and failover nodes if there's an AZ failure.
- Missing liveness probes: Do you have liveness probes deployed to restart your pods in case they fail?
- No AZ redundancy: Do you have replicas of your pods running in at least two different AZs?
Container deployment risks
If a container crashes, Kubernetes waits for a short delay and restarts the pod. Kubernetes will retry a few times before eventually giving up and giving the container a CrashLoopBackOff status. Similarly, when Kubernetes fails to pull the container image, it will retry for a few minutes until it gives up, then give the container a status of ImagePullBackOff.
There are also times when a pod simply can't be scheduled to run. Commonly, this happens because the cluster doesn't have the resources, or your pod requires a persistent volume that isn't available.
- CrashLoopBackOff: Are your pods in a crash loop?
- ImagePullBackOff: Can your nodes access and retrieve the image they need to create a container?
- Unschedulable pod errors: Do you have enough cluster capacity to run all of your pods?
Application risks
Whenever you update your application, there are hidden reliability risks. Updates typically roll out gradually, not all at once. What happens if your team releases another update before the first rollout finishes? What happens if you push a release while Kubernetes is upgrading itself? You might end up with two different versions running side-by-side.
Another common application risk is introduced by using init containers. These are handy for preparing an environment for the main container, but introduce a potential point of failure where the init container can't run and causes the main container to fail.
- Application version non-uniformity: Is every replica of a pod running the same version?
- Init container errors: Do your init containers complete successfully before your primary container starts?
