Seven tests to measure and improve reliability: what matters and how it works

Note: this blog was originally published on September 2, 2022.

Legendary race car driver Carroll Smith once said, "Until we have established reliability, there is no sense at all in wasting time trying to make the thing go faster." Even though he was referring to cars, the same goes for technology: no amount of code optimization or new features can replace stable systems. Unfortunately, much like race cars, it's hard to know that a system is unreliable until it blows a tire, the brakes stop working, or the steering wheel comes off the column. By then, it's too late: you're panicking, you and other engineers are scrambling to fix the issue, and your users are angry.

The best way to prevent an incident like this is to prepare for it by testing your systems before they go into production. This means running tests explicitly designed to validate a system's reliability—in other words, how well you can trust it to remain available under less-than-ideal conditions. The challenge is knowing how to run these tests and which ones to run in the first place.

Gremlin provides pre-built reliability tests designed to validate resilience against common failure modes and ensure your services meet reliability best practices. In this blog, we explain why you should run these tests and how they can help assess your services' reliability.

‍

Testing the reliability of cloud-native distributed systems

For modern distributed systems to be considered reliable, they generally need to have these capabilities:

Automatically scale in response to demand to maintain high performance even under heavy traffic.
Have at least one redundant system ready to bring services back online if they fail.
Workaround or withstand being disconnected from other services, especially dependencies.
Workaround or withstand poor or unstable network performance (i.e., latency).

In other words, our systems must be scalable, redundant, resilient to network outages, and able to tolerate latency.

These are also the categories Gremlin uses to group reliability tests. The Scalability category tests CPU and Memory to ensure your services scale when CPU or RAM usage is high. Likewise, the Redundancy category tests host and availability zone (AZ) outages to validate that your service can fail over to a healthy host or zone, depending on the scale of the outage. For each dependency your service communicates with, the Dependency category tests your service by simulating dependency outages, slow dependencies, or dependencies with expiring TLS certificates.

Screenshot of a service's overview page showing the reliability tests available.

‍

Why did we choose these tests?

When deciding which reliability tests to include, we looked at the best practices teams adopt—and the common failure modes they experience—when working with modern distributed cloud-native systems.

For example, AWS Well-Architected is a framework designed by AWS to help its customers build resilient and scalable systems on AWS. It comprises six pillars, each tackling a specific challenge: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. Successfully implementing these pillars will better optimize your workloads for AWS, but the general principles can apply to any large-scale distributed system.

Learn how to use the Well-Architected Framework to make the best architectural decisions based on your business needs by watching our webinar, presented in partnership with Amazon Web Services: Continuous validation of the Well-Architected Framework.

Gremlin has years of experience working with some of the world's largest enterprises across many different industries. This gives us a unique look into the challenges that teams face when testing reliability at an enterprise scale, and we used this insight when designing our tests. We have and will continue adding tests over time as cloud-native organizations' needs and expectations change along with the technology landscape.

‍

What happens if we don't run regular reliability tests?

Like traditional QA tests, reliability tests catch defects before they make their way into production. Gremlin tests for failure modes common to modern distributed systems, including:

Are my services redundant enough to withstand one or more hosts failing?
Can I scale automatically to meet increasing demand?
Is my service resilient to slow or failed dependencies?

We must regularly test these assumptions or risk unexpected and undesired impacts on our customers. Much like regression testing, regular reliability testing ensures that future changes to our code and infrastructure don't introduce bugs or uncover old ones. Failing to run these tests regularly can lead to the following impacts.

Systems will scale less efficiently

Scaling is an inherent benefit of distributed systems and cloud-native applications. One of the main draws of cloud platforms like AWS, Azure, and GCP is that they support automatic scaling. Once demand reaches a certain threshold, the platform can automatically provision and deploy new systems to add capacity and lower your overall resource saturation. If you don't have autoscaling, you'll either need to monitor your systems' usage continually and manually deploy new systems when required or risk exhausting all available resources once demand reaches too high. Both situations can cost you time, money, effort, or even downtime due to crashes.

Services will be more susceptible to zone and region failures

The unfortunate reality about complex systems is that they can fail anytime and for any reason. Cloud providers work hard to ensure their systems are reliable—or, at the very least, can failover quickly—but it's up to us as engineers to ensure our services and applications can do the same. Some scenarios could negatively impact us in a way that a cloud provider can't immediately resolve, such as a zone or region outage.

The great thing about today's distributed systems is that redundancy can be mostly automated. For example, Kubernetes lets you easily replicate individual services using Deployments, and cloud computing platforms like Amazon EC2 let you set up completely redundant availability zones with automatic load balancing. This way, your systems and services will remain resilient even during large-scale outages.

Network outages will have a greater impact

Modern applications often depend on external dependencies. These can include:

Cloud provider services like load balancers, databases, and message queues.
Services that our team maintains.
Services that other teams in our organization manage.

Imagine that we have a web service that connects to an external database. Now imagine that a network problem or system crash causes our service to disconnect from the database. How will our service respond? Does it recognize that the database is unavailable and return an error message? Does it keep retrying the connection and eventually give up? Or does it not recognize the disconnection and time out or crash? These are all important to know, as they tell us how our service will behave if the same scenario happens in production.

Services will be more susceptible to unreliable dependencies

Microservice-based applications are much more susceptible to latency—delays in network traffic—than monolithic applications. This is especially true for traffic that services send over the Internet, which may travel hundreds of miles through dozens of routers, switches, gateways, and firewalls. These delays are usually minor (often only a few milliseconds), but when added up for each network packet, they can quickly make your application seem slow or sluggish.

This is one of the main drawbacks of distributed systems like Kubernetes. In Kubernetes, cross-service communication happens almost exclusively over the network. Even if we deploy a Kubernetes cluster on a high-speed network, any change in network saturation or reliability can significantly impact our service. This is why we must proactively test our service's ability to tolerate different latency levels, especially when communicating with critical dependencies like databases and file stores.

How to start running reliability tests

Gremlin makes it easy to run reliability tests on your services. Follow our Reliability Management (RM) quick-start guide, find the test you want to run, and click run. While a test is running, Gremlin continuously monitors your service via its Health Checks to ensure it's available, responsive, healthy, and stable.

If you want to run additional tests on your service or customize an existing test, you can choose a different Test Suite. A Test Suite is a collection of reliability tests that applies to your entire Gremlin team. Gremlin currently offers its default test suite, as well as the Well-Architected Cloud Test Suite for testing cloud environments.

If you’d like to see more of what Gremlin offers, sign upfor a demo.

No items found.