In January of 2023, Google released its infrastructure reliability guide, which provides guidelines on how to build high-availability applications in Google Cloud. While it's written for Google Cloud, it provides some excellent general-purpose information on how to architect reliable applications on any cloud provider, including:
- What reliability means in the context of cloud computing.
- Steps that cloud providers take to provide reliable infrastructure, and what to do if that infrastructure fails.
- Factors that impact reliability on a zonal, regional, and global basis, and how to manage those you have control over.
- How to assess your reliability requirements and posture.
In this blog, we'll explain each of these factors and how you can use Gremlin to ensure you're meeting your reliability requirements.
Google Cloud defines reliability as "meeting your current objectives for availability and resilience to failures," where availability (also called uptime) is the percentage of time an application is usable. Availability is often measured in two ways:
- The minutes, hours, or days a system is down (e.g., 99% availability equates to 7 hours of downtime per month).
- The rate of successful requests (e.g., 99% availability equates to 10 out of every 1000 requests failing).
How many nines do you actually need? Gremlin CTO Kolton Andrus explains how to define availability in a way that makes sense for your organization.
While Google is responsible for maintaining its cloud infrastructure, it's up to you—the customer—to use it in a way that ensures the greatest availability for your applications. Even Google can't guarantee 100% uptime for all of their services, and the same is true of AWS, Azure, and other providers. Several other factors can affect availability, too, including:
- How reliable a specific service is, such as Google Compute Engine.
- How well your application is designed to handle errors (timeout and retry logic, exception handling, etc.).
- Whether you've provisioned enough capacity to handle your current workloads.
- What external dependencies your application communicates with, and their reliability.
- The DevOps processes you use to build, deploy, and maintain your workloads and infrastructure.
When working with any cloud provider, it's important to be aware of what Google calls failure domains, which are resource(s) that can fail independently of other resources. An example of a failure domain is a single standalone Compute Engine instance, but you can also think of a zone or region of a failure domain. While a zone is arguably less likely to fail than a single instance, it's still possible, so you need to factor it into your architecture design.
It's also important to remember that your reliability risk and tolerance can vary between workloads. For example, a banking service that processes customer transactions requires significantly more availability than a service that emails promotional materials. Consider your reliability needs for each workload, as that will help you identify where you should focus your reliability efforts.
More specifically, Google recommends the following design tips:
- Avoid single points of failure (SPOF). These risk taking down your entire application if they fail. For example, if you have replicated servers behind a single load balancer, the load balancer is now a SPOF.
- Distribute resources and create redundancy. Having redundant systems lets you maintain service even when one of those systems fails, and making them distributed ensures that a local outage won't take down your entire application.
- Use multi-zone and multi-region deployments. Zone and region failures are relatively uncommon, but they can still happen. Distributing your workloads across multiple zones or regions ensures that you can always redirect traffic to another location if one location becomes unavailable
Applying these tips will significantly increase the availability of your service. A standalone virtual machine instance might have 99.9% availability, but two instances replicated across multiple regions could have 99.999% availability. This might not look like much, but it's the difference between 5 minutes of downtime every three days and 5 minutes of downtime every year.
Google provided a framework for building resilient and reliable services, but how does it work in practice? How do you verify that your reliability work is paying off and that your Google Cloud deployments really are more resilient?
Gremlin's Reliability Management (RM) solution provides pre-built tests that validate your services against the best reliability practices from Google, AWS, and Azure, including scalability, redundancy, and resilience to dependency failures. Running these tests will help you:
- Identify single points of failure, whether individual hosts or entire zones.
- Test your service's ability to scale in response to high CPU and memory use.
- Test your service's ability to withstand third-party dependency failures, including SaaS dependencies.
After each test, Gremlin calculates and assigns a reliability score to each of your services. This score acts as a standardized, objective measure of reliability for your team so you can easily see which services are reliable and which services need additional attention. Gremlin also tracks scores over time, so you can see how reliability is trending. If you want to run more advanced reliability tests, Gremlin Fault Injection (FI) gives you access to Gremlin's full suite of faults, which let you test network latency, DNS outages, and more.
You can find the full Google Cloud infrastructure liability guide at https://cloud.google.com/architecture/infra-reliability-guide/. To see how Gremlin can help you meet your reliability targets, join our weekly demo or talk to one of our reliability experts.