A common risk is deploying Pods without setting a CPU request. While it may seem like a low-impact, low-severity issue, not using CPU requests can have a big impact, including preventing your Pod from running. In this blog, we explain why missing CPU requests is a risk, how you can detect it using Gremlin, and how you can address it.
In Kubernetes, you can control how resources are allocated to individual Deployments, Pods, and even containers. When you specify a limit, Kubernetes won't allocate more than that amount to the Pod. Conversely, when you specify a request, you're specifying the amount that the Pod requires to run.
Kubernetes measures CPU request values as CPU units. For example, 1 CPU unit is the same as 1 physical or virtual CPU core. This value can be fractional: 0.5 is half of one core, 0.1 is one tenth of a core, etc.
Requests serve two key purposes:
- They tell Kubernetes the minimum amount of the resource to allocate to a Pod. This helps Kubernetes determine which node to schedule the Pod on and how to schedule it relative to other Pods.
- They protect your nodes from resource shortages by preventing over-allocating Pods on a single node.
Without this, Kubernetes might schedule a Pod onto a node that doesn't have enough capacity for it. Even if the Pod uses a small amount of CPU at first, that amount could increase over time, leading to CPU exhaustion.
To mitigate this risk, specify an appropriate resource request for each of your containers using
spec.containers.resources.requests.cpu. If you're not sure what to set as a value, you can get a baseline estimate using this process:
- Run your Pod normally.
- Collect metrics using the Kubernetes Metrics API, an observability tool, or a cloud platform. An easy way to do this is by running
kubectl top pod. Ideally, you should gather these metrics from a production system for the most accurate results.
- Find the CPU usage for your Pod, then use that value as the CPU request amount. You might want to increase this amount to leave some overhead, especially if the Pod isn't under any load.
For example, imagine we have a Pod running Nginx that we want to set CPU requests for. After some testing, we determined that the container uses
200m of CPU time. To be safe, we'll request
250m by adding it to our Kubernetes manifest (see lines 10—12 below):
1# nginx-manifest.yaml2apiVersion: v13kind: Pod4metadata:5 name: nginx6spec:7 containers:8 - name: nginx9 image: nginx:1.25.210 resources:11 requests:12 cpu: '250m'13 ports:14 - containerPort: 80
Then, apply the change and wait for Kubernetes to re-deploy your Pod:
1kubectl apply -f nginx-manifest.yaml
Once your Pod finishes restarting, you can use the Kubernetes Dashboard (or
kubectl describe node <node name>) to list each Pod running on the specified node, along with their resource requests and limits. If your memory request applied successfully, then the Nginx Pod should have a value listed in the "Memory Requests" column:
1Non-terminated Pods: (23 in total)2 Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age3 --------- ---- ------------ ---------- --------------- ------------- ---4 default nginx-767687bc57-b4g6w 250m (3%) 0 (0%) 1Gi (7%) 0 (0%) 143d5 kubevirt virt-api-66859f4c8d-4c2pn 5m (0%) 0 (0%) 500Mi (3%) 0 (0%) 10d6 kube-system coredns-59b4f5bbd5-ns25b 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 124d7 kubevirt virt-controller-8545966675-2fjd9 10m (0%) 0 (0%) 275Mi (1%) 0 (0%) 10d8 kubevirt virt-operator-6c649b9567-9l7g4 10m (0%) 0 (0%) 450Mi (3%) 0 (0%) 10d9 kubevirt virt-handler-g9phl 10m (0%) 0 (0%) 325Mi (2%) 0 (0%) 10d
You can also use Gremlin to verify your mitigation. Gremlin's Detected Risks feature immediately detects any high-priority reliability concerns in your environment. These can include misconfigurations, bad default values, or reliability anti-patterns. If you've addressed this risk, then the CPU requests risk will show as "Mitigated" instead of "At Risk".
A more thorough way to validate this is by seeing how Kubernetes responds when the Pod grows beyond its request. For example, what happens when our Pod uses exactly 250m of CPU time? What about 300m? This requires an active approach to testing using a method called fault injection.
With fault injection, you can consume specific amounts of CPU time within a Pod or container to ensure your Pod doesn't get evicted or moved to a different node. In Gremlin, an ad-hoc fault injection is called an experiment.
To test this scenario:
- Log into the Gremlin web app at app.gremlin.com.
- Select Experiments in the left-hand menu and select New Experiment.
- Select Kubernetes, then select our Nginx Pod.
- Expand Choose a Gremlin, select the Resource category, then select the CPU experiment.
- Change CPU Capacity to the percentage of CPU we want to consume. We want to use 250m of CPU time, which equates to 1/4 of a single core. In other words, we want to use 25%. In Gremlin, we'll set CPU Capacity to 25 and keep the number of cores set to 1.
- Click Run Experiment to start the experiment.
Now, we keep an eye on our Nginx Pod. We'll see usage increase above 250m, but the Pod itself will keep running just fine. If it gets evicted or rescheduled, this tells us one of several things:
- We're requesting an unnecessarily high number of CPU units.
- We don't have enough capacity to run our workloads, and we need to scale our cluster vertically.
- We're not leaving enough overhead for this Pod to let it grow, and so we should increase our minimum requested CPU.
1kubectl top pod
1NAME CPU(cores) MEMORY(bytes)2...3frontend-6f7b5f7f88-cn5xr 293m 66Mi4...
You can use these same methods to test for memory requests. In fact, Gremlin's Detected Risks automatically finds Kubernetes resources that don't have memory requests defined, just like how it finds resources without CPU requests. We'll go into detail on this in a future blog, but in the meantime, you can read our Detected Risks announcement blog to learn more.
Ready to find out which of your Kubernetes resources are missing CPU request definitions? Sign up for a free 30-day trial, install the Gremlin agent, and get a report of your reliability risks in minutes.