- 4 min read

How to ensure consistent Kubernetes container versions

One of Kubernetes' killer features is its ability to seamlessly update applications no matter how large your deployment is. Did a developer make a code change, and now you need to update a thousand running containers? Just run kubectl apply -f manifest.yaml and watch as Kubernetes replaces each outdated pod with the new version.

Unfortunately, like with many Kubernetes features, there are hidden risks here that could impact the reliability of your applications. Updates typically roll out gradually, not all at once. What happens if your team releases another update before the first rollout finishes? What happens if you push a release while Kubernetes is upgrading itself? Depending on how you identify container image versions, you might end up with two different versions running side-by-side: one with the latest fix, and one without it.

In this blog, we'll explore the container version uniformity problem, what the risks are, how you can avoid them, and how Gremlin helps ensure consistent versioning across your environment.

What is version uniformity and why is it important?

Version uniformity refers to the image version used when declaring pods. When you define a pod or deployment in a Kubernetes manifest, you can specify which version of the container image to use in one of two ways:

  • Tags, which are created by the image's creator to identify a single version of a container. Multiple container versions can have the same tag, meaning a single tag could refer to multiple different container versions over time.
  • Digests, which are the result of running the image through a hashing function (usually SHA256). Each digest identifies one single version of a container; changing the container in any way also changes the digest.

Tags are easier to read than digests, but they come with a catch: a single tag could refer to multiple image versions. The most infamous example is latest, which always points to the most recently released version of a container image. If you deploy a pod using the latest tag today, then deploy another pod tomorrow, you could end up with two completely different versions of the same pod.

As an example, imagine we have a Kubernetes application called the Bank of Anthos. One of the deployments in our application is the "userservice," which handles actions like authenticating users and storing personal data. We want this service to have plenty of redundancy and overhead, so we deploy 20 replicas of it across our clusters:

1apiVersion: apps/v1
2kind: Deployment
4 name: userservice
6 replicas: 20
7 selector:
8 matchLabels:
9 app: userservice
10 template:
11 metadata:
12 labels:
13 app: userservice
14 spec:
15 containers:
16 - name: userservice
17 image: gcr.io/bank-of-anthos-ci/userservice:v0.5.10

Now, imagine we need to make a quick hotfix to the userservice. It's not a major change, so we push the updated image directly to our image repository without updating the label. The updated image has a new digest, and now the label points to it instead of the original version. If we schedule another pod to a different node (e.g. adding a new replica or scaling up), then the newly created pod will use the updated version, but the already-running pods will continue using the old version. We'll have two different versions of the same container running side-by-side.

You can imagine what kind of problems this could cause if we changed the way user data was stored in the database, or the way passwords were hashed in response to a critical security bug. Users might see strange errors or might not be able to log in at all. Worse yet, only a percentage of users might be impacted, making it even harder to troubleshoot the problem.

How do I prevent version mismatches?

Start by checking your manifest (YAML) files. This is where you define the parameters for your pod(s) including the container image location and version. When specifying an image, always use a digest. This locks the deployment to a specific version, whereas using a general tag like latest could pull different versions depending on when you deploy the pod.

For example, when we're deploying the userservice, we should specify it as userservice:sha256:d33e608c24821613713e8b85ce5fbec118a18076140c1b3ee39359d606ce20ef. The default (risky) choice is just to use userservice:latest. However, latest is a floating tag that references whatever the current version of the image is. If we add another container instance to this deployment, it could end up using a different version of the container and deploying it alongside older, potentially incompatible versions.

Another area where version mismatches often occur are updates. When rolling out an update, Kubernetes doesn't replace every container at once, as this could cause service downtime, failed requests, and a poor experience for users. Instead, it gradually replaces individual pods while making sure to keep a minimum percentage (at least 75%) of pods in the deployment running. Using digests will also prevent this problem, but an alternative approach would be to use a different deployment strategy. For example, in a blue/green deployment, the new version (green) is released alongside the old version (blue), but traffic continues going to the old version. Once the new version is ready, traffic is instantly switched over, then the old version is taken down. We cover a few of these different methods in our blog on testing in production.

How do I validate that my fix works?

The most direct way to check for mismatched container versions is by using kubectl to query every container image. For example, the official Kubernetes documentation provides this command for listing each container image across all namespaces, along with the number of Pods actively using that image:

1kubectl get pods --all-namespaces -o jsonpath="{.items[*].status.containerStatuses[*].imageID}" |\
2tr -s '[[:space:]]' '\n' |\
3sort |\
4uniq -c

This prints a report like the one below:

11 gcr.io/bank-of-anthos-ci/accounts-db@sha256:04da06045c2ce2d9fd151fda682907eecb8eb9faeb84d0a60ea2a221e0b85441
22 gcr.io/bank-of-anthos-ci/balancereader@sha256:164ef93c47334e0c5ce114326397abbe730e8114398072f48fb63ffe447237ad
32 gcr.io/bank-of-anthos-ci/contacts@sha256:5f28ba99be16ac8173ac73d22f72b94e34c3b33b8d0497b8b05364fcbd1a161b
42 gcr.io/bank-of-anthos-ci/frontend@sha256:2317dfa4351d6cb63b9b52161c39feaf84e4f3e9460ac601175ffc5e1774d354
51 gcr.io/bank-of-anthos-ci/ledger-db@sha256:73e6f191dccc5344ee795470db676dd107f62a40d5425f47d116609dadf5efa4
62 gcr.io/bank-of-anthos-ci/ledgerwriter@sha256:bc8263483ea15427fe4ee06a67dea42811177c62fb68cefcab843d14dd54dc25
72 gcr.io/bank-of-anthos-ci/loadgenerator@sha256:6aaed05ef6342c8476fed2b32224fdace0ff6403688112cb816867b110dae0ac
82 gcr.io/bank-of-anthos-ci/transactionhistory@sha256:578eee3c7a84a6dceae1c0a8823fd0ab091fa32a216e47f4c7f8691adc2ba1ce
91 gcr.io/bank-of-anthos-ci/userservice@sha256:1d0e45ca69fed59a1fa4c5c3ea356b0e47779149b47e45f8d3ec422a61560909
101 gcr.io/bank-of-anthos-ci/userservice@sha256:d33e608c24821613713e8b85ce5fbec118a18076140c1b3ee39359d606ce20ef

You'll notice there are two versions of userservice, each one having a different digest. To fix these, we'd want to lock down the specific image by adding the correct digest to our manifest, then re-deploying it.

What other Kubernetes risks should I be looking for?

We covered the more common Kubernetes reliability risks—such as resource requests and limits, liveness probes, and high availability—in our ongoing Detected Risks blog series. We're also launching a brand new set of Detected Risks this year, followed by more blog posts like these. When you're ready to start uncovering risks, sign up for a free 30-day trial and get a complete report of your reliability risks in minutes.