How to build zone-redundant cloud instances and clusters

Redundancy is a core tenet of cloud computing. While major cloud platforms have high targets for reliability, they can still fail, and it’s important for teams to have a plan for when they do. But how can you build services that can withstand something as disruptive as a datacenter outage?

In this blog, we’ll show you how to prepare for availability zone outages by proactively detecting services operating in a single zone. We’ll show how Gremlin detects this reliability risk for you, how you can mitigate it using commonly available cloud computing tools, and how you can simulate zone and region outages to prove your resilience.

‍

What is availability zone redundancy, and why is it important to address?

Availability zone (AZ) redundancy is when a computing resource—a server, a database, a container, etc.—is replicated in two separate AZs. An AZ is an isolated region of a cloud provider’s datacenter, and is functionally independent of other regions. An AZ can be a standalone datacenter, or it can be one section of a larger datacenter. The important takeaway is that a failure in one AZ shouldn’t directly impact other AZs or the services running in them.

While AZ failures are rare, they do happen. Misconfigurations in critical services, hardware failures like cut cables, and other unexpected failures can suddenly make an entire zone unreachable. If your services are deployed to a failed region, you’re completely dependent on the cloud provider to bring the zone back online before you can even begin to recover. On some platforms, as many as 30% of IT decision makers have experienced disruptions due to outages in the past 12 months.

The good news is that distributing resources across zones is easier than you might think. The exact steps depend on your cloud provider, but the process is the same. For this blog, we’ll use AWS as an example.

‍

How do I ensure my services are running in multiple availability zones?

Imagine you want to deploy an EC2 instance, but you want to ensure it’s AZ-redundant. In EC2, your instance’s AZ is determined by its subnet. When you choose the subnet, you’re implicitly specifying the AZ it will run in (note the “Availability Zone” tag in the screenshot below. Our recent tutorial on simulating zone/region evacuations covers this in more detail.

‍

‍

Once this EC2 instance is created, you can create a new EC2 instance from a template (or use the Launch more like this option) and select a second subnet in a different AZ. For example, in the previous screenshot, we chose subnet-d3256e9f in us-east-2c for the first instance. For the new instance, we could choose subnet-a072d8cb, which is in us-east-2a.

From here, you’d need a way to direct customer traffic to these instances. In AWS, you can create a Target Group and register both instances to it. Then, you can create a Load Balancer to send traffic to ‌instances in the Target Group. Application Load Balancers (ALBs) enable cross-zone load balancing by default, but you can disable it. Autoscaling systems like Kubernetes clusters and EC2 instances require some additional work; for those, you can create Auto Scaling Groups (ASGs) with multiple subnets defined, and the ASG will distribute new instances between them.

There’s one other important consideration: data storage. This setup works fine for stateless services, where you don’t need to store persistent data (or if you’re already using a distributed data store, like Amazon S3). For stateful services, like databases, there are services like Amazon FSx for creating distributed file stores.

‍

How do I validate that my services are availability zone redundant?

With dynamic systems like autoscaling clusters, there’s always a risk that instances will randomly get deployed into a single AZ regardless of your settings. That’s why it’s important to watch for single-AZ risks, not just at the infrastructure level, but also at the service level.

Gremlin offers this functionality built-in. After you deploy a service to your instances—whether it’s a container, Kubernetes deployment, or a process running directly on the instance—Gremlin can detect which AZ it’s running in and whether there are other instances running in other AZs. If you have at least one other instance running in another AZ, then the service is AZ-redundant. If both instances are running in the same zone, or if you only deployed one instance, Gremlin will report this as a reliability risk. You’ll also get recommendations on how to fix the issue directly in the web app.

Detected Risks in the Gremlin web app, showing "Availability Zone Redundancy" at risk

‍

The great news is that once you define your services in Gremlin, Gremlin will keep monitoring them for these and other risks. You’ll also see reports for any other services that you (or someone in your Gremlin team) have added, along with their risks. These risks ultimately feed into the Score, which is a measure of how reliable the service is.

‍

A list of Services that have been defined in a team in the Gremlin web app

‍

If you want to verify that your service is AZ-redundant, you can also use one of Gremlin’s pre-built reliability tests. The default Gremlin reliability test suite comes with a zone redundancy test that simulates an AZ outage. It does this by running a blackhole experiment, which drops all network traffic to and from the AZ (excluding traffic from Gremlin) for 5 minutes. During this time, it uses Health Checks (which you can integrate with your monitoring or observability tools, including CloudWatch) to track the state of your service. If your service remains healthy and responsive throughout the test, it means it can successfully withstand losing an AZ. However, if it becomes inaccessible or unresponsive, Gremlin automatically halts the test, returns your service to normal, and records the test as having failed.

‍

Running a Zone Redundancy reliability test using Gremlin RM

‍

What about multi-region or multi-cloud redundancy?

Multiple AZs are all well and good, but what if you need greater redundancy? What if you’re performing business-critical applications, like processing high-volume financial transactions, tracking flights, or processing medical data?

The next step beyond AZ redundancy is region redundancy, or replicating services across entire regions. This is more difficult, since many cloud providers isolate their systems into different regions. For example, with Amazon EC2, all of your instances, launch templates, and ASGs are region-specific. Managing and orchestrating deployments across multiple regions requires a separate infrastructure-as-code tool, like Terraform or Anthos, and is beyond the scope of this blog.

However, you can reproduce a region failure using Gremlin. When you deploy the Gremlin agent to a cloud compute environment, it automatically detects tags from the instance it’s running on, such as the operating system, hostname, availability zone, and region. You can create a Chaos Engineering experiment or Scenario in Gremlin that targets all instances containing a specific region tag (or select the availability zones corresponding to a region), then use a network experiment to drop traffic, add latency, check for expiring TLS certificates, create packet loss or jitter, or simulate a DNS outage. You can also use the rest of Gremlin’s experiment types to recreate other failure modes, including CPU and memory load, disk exhaustion, process exhaustion, and desynchronized system clocks.

To get started with zone and region redundancy, log into your Gremlin account or sign up for a free 30-day trial.

‍

No items found.