Many cloud infrastructure providers make deploying services as easy as a few clicks. However, making those services high availability (HA) is a different story. What happens to your service if your cloud provider has an Availability Zone (AZ) outage? Will your application still work, and more importantly, can you prove it will still work?

In this blog, we'll discuss AZ redundancy with a focus on Kubernetes clusters. We'll explain why it's important, how you can implement it, and how Gremlin automatically checks your systems to ensure they're AZ redundant.

What is a High Availability (HA) Kubernetes cluster and why is it important?

By default, many Kubernetes cloud providers provision new clusters within a single Availability Zone (AZ). An AZ is a location in a cloud provider's network that's generally isolated from others. Because these locations are isolated, one AZ can experience an incident or outage without affecting other AZs, creating redundancy. However, if a cluster is set up in a single AZ and that AZ fails, the entire cluster will also fail along with any applications and services running on it.

Kubernetes natively supports deploying across multiple AZs, both in its control plane (the systems responsible for running the cluster) and worker nodes (the systems responsible for running your application pods). Setting up a cluster for AZ redundancy usually requires additional setup on the user's side and leads to higher cloud hosting costs, but for critical services, the benefits far outweigh the risk of an incident or outage. This is so important that we built an AZ redundancy into our Detected Risks feature, which automatically scans environments for reliability risks. What we found is that nearly half of the services deployed to Kubernetes don't have AZ redundancy.

Looking for more Kubernetes risks lurking in your system? Grab a copy of our comprehensive ebook, “Kubernetes Reliability at Scale.”

How do I make my Kubernetes cluster Highly Available?

To make a Kubernetes cluster Highly Available, we'll need to distribute its control plane and worker nodes across at least two AZs. The control plane is the brain of the cluster, so focus on distributing those nodes first. Some managed Kubernetes providers like AWS will provide you with an HA control plane automatically, while tools like kubeadm let you create HA unmanaged clusters.

As an example, let's focus on Amazon Elastic Kubernetes Service (EKS). Amazon manages the control plane for you, so running an HA cluster in EKS means deploying the worker nodes across multiple AZs. EKS leverages Auto Scaling Groups (ASGs) to add and remove nodes in response to demand, and ASGs can span multiple AZs.

For example, if you wanted to create a HA cluster in the <span class="code-class-custom">us-west-2</span> region using all three AZs in that region, you can run this <span class="code-class-custom">eksctl</span> command:

BASH

eksctl create cluster --region us-west-2 --zones us-west-2a,us-west-2b,us-west-2c

Alternatively, you could manually provision worker nodes in different AZs and connect them to your cluster. This isn't recommended, as it doesn't allow for autoscaling, but it's an option for static/non-scaling clusters. You could also create separate ASGs in each AZ and allow each AZ to act as a sort of independent cluster, but this adds management overhead and deployment redundancy.

Having an AZ-redundant HA cluster is just the first step. You'll want to make sure your applications and services are redundant as well. When deploying an application or service to your cluster, make sure you have at least two replicas running in at least two different zones. One way to do this is by using topology spread constraints to control how Kubernetes distributes pods across a cluster. Cloud providers pass topology information such as zone and region, which Kubernetes can use when scheduling pods. For example, if you want a pod to be evenly spread across each zone in a cluster, you could use this specification. Note the <span class="code-class-custom">topologySpreadConstraints</span> section, which specifies this pod must be evenly distributed according to the <span class="code-class-custom">zone</span> topology key:

YAML

kind: Pod
apiVersion: v1
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: nginx
  containers:
    - name: nginx
      image: nginx:latest

How do I validate that my cluster is Highly Available?

Once you've set up your HA cluster, you can use Gremlin to confirm that your cluster remains available even if an entire AZ goes down. Of course, we can't really bring down an AZ (AWS won't allow it, and for good reason), but we can simulate an AZ failure by dropping network traffic to our instances within an AZ.

After deploying an HA Kubernetes cluster:

  1. Deploy Gremlin to your Kubernetes cluster by following the instructions in our installation docs.
  2. Log into the Gremlin web app at app.gremlin.com.
  3. Select Experiments in the left-hand menu and select New Experiment.
  4. Select Infrastructure, then Hosts.
  5. Under Choose Hosts to target, expand Zone, then select one of the Zones listed. If only one Zone is listed, try scaling up your cluster to two or more nodes.
  6. Expand Choose a Gremlin, select the Network category, then select the Blackhole experiment.
  7. Increase Length to 300. This runs the experiment for five minutes, giving the cluster more time to respond to the outage and giving us more time to test.
  8. Click Run Experiment to start the experiment.

While the experiment is running, monitor your cluster using a tool like <span class="code-class-custom">kubectl</span>, the Kubernetes Dashboard, or your cloud provider's console. If the cluster is truly HA, you'll be able to interact with the cluster and its applications normally. If the cluster becomes unresponsive or your applications are unavailable for an extended amount of time, then your cluster isn't HA.

Keep in mind that if Kubernetes needs to reschedule pods from the unavailable region, it may take a minute or two. Extending the length of the experiment to 300 seconds/5 minutes should leave enough time for this to happen. If the pod still isn't available by the time the experiment finishes running, you'll want to re-check your <span class="code-class-custom">topologySpreadConstraints</span> configuration.

What similar availability risks should I be looking for?

Your cluster isn't the only resource that needs to be configured for cross-AZ redundancy. You'll likely also have a load balancer in front of your cluster that you'll need to ensure is AZ-redundant. Otherwise, it may fail to route network traffic to working nodes. Services like Amazon Elastic Load Balancing natively support load balancing across multiple AZs, or you can use a standalone tool like Traefik or Nignx.

While this blog focused on Kubernetes AZ redundancy specifically, the same concepts apply to other systems running in your environment. Compute instances, data storage, and serverless functions all benefit from AZ redundancy. How to set these up will vary depending on your cloud provider and deployment methods. If you want a comprehensive look at all the risks that could be lurking in your system, grab a copy of our ebook, “Kubernetes Reliability at Scale.”

In the meantime, the best way to start is by signing up for a Gremlin account, deploying the Gremlin agent to your environment, and defining your services in the Gremlin web app. Gremlin will automatically detect whether each service is AZ redundant, and if not, will flag it in our web app. Once you've successfully made a service AZ redundant, Gremlin will automatically update its list of detected risks, and you can confirm its redundancy by running a blackhole experiment or a pre-built AZ redundancy test. Sign up for a free 30-day trial and get a complete report of your reliability risks, or read our documentation to learn more about how you can mitigate AZ redundancy risks.

No items found.
Categories
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL