How to make your services resilient to slow dependencies

When discussing reliability, we tend to focus on the things that we have control over: applications, virtual machine instances, deployment patterns, etc. But this ignores a significant and ever-growing part of nearly all modern software: dependencies.

Dependencies are services that provide extra functionality for other services and applications. For instance, many websites depend on databases, caches, payment processors, and similar services in order to function. This puts website operators in a difficult position, though: if a dependency fails, what happens to the application? Can it keep working despite not having access to a critical resource? Or will a dependency failure cause a cascade of other failures?

In this blog, we’ll explain the role dependencies play in reliability, how they can fail, and how you can build resilience against unstable and unreliable dependencies.

‍

What are service dependencies, and what are the risks of them failing?

As mentioned earlier, a dependency is any computing component that provides functionality that is then used by a different service. In the context of this blog, we’re referring to service dependencies. These include:

Software as a Service (SaaS) services.
Databases like MySQL, Azure Cosmos DB, and Amazon DynamoDB.
Services owned and managed by other teams in your organization.

Importantly, we’re not talking about code dependencies. Code dependencies (or libraries) get imported into an application’s source code during development. Service dependencies aren’t a part of your application, but are hosted separately and are typically interacted with over a network connection.

When your application requires a dependency to perform a function, this is the typical flow:

The service sends the request to the dependency.
The dependency processes the request and returns a response.
The service processes the response and performs an action based on the response (e.g. updating the page and informing the user).

Depending on how many interdependent services you have, dependency chains can become extremely long and complex. Consider an online banking application: you might have a single frontend service, but you might also have an account management service, an authentication service, a ledger writing service, a fraud prevention service, etc. If any one of those services fails—especially a critical one like the authentication service—it could cause an incident or an outage.

‍

How to prepare your services for slow dependencies

Slow dependencies are particularly difficult to prepare for due to their variability. A dependency may perform fine 99% of the time, but all it takes is a poor network connection, traffic surge, or other unexpected event to reduce its performance. If that happens, it can have cascading effects on the performance and responsiveness of your application. There are a few techniques for handling slow dependencies, and we’ll cover some of them in this section.

‍

Multithreaded applications

In a single-threaded application, events occur sequentially. If your service requires a dependency to fulfill a user's request, it must wait for the dependency to respond before it can respond to the user. If the dependency is unreliable, your application needs a way to detect it and report the error. This is usually a timeout (like 10 seconds), or retrying a certain number of times. But even with error protection, the application can’t do anything else until it’s either gotten a response or given up on the request.

Languages like Node.js work around this with constructs like Promises, which let you create an asynchronous method that gets processed at some later time. This lets you perform other actions, like rendering a form or populating images with placeholder data, while waiting for the dependency to fulfill the Promise.

‍

Caching

A cache is a data store placed between two services to hold data temporarily. Instead of requesting data directly from a dependency, your service could instead request data from the cache. This reduces the amount of load on the dependency, and avoids outages caused by dependency failures. Caches can also be load-balanced and geographically distributed to further reduce the risk of outages.

One downside to caching is data freshness. Data stored in a cache is usually tagged with an expiration date (or TTL, Time To Live), after which it’s no longer valid. Once the TTL passes, the data is flagged as “expired”, and the cache retrieves an up-to-date copy from the dependency. But what happens if the cache serves data that hasn’t yet expired, but is different from what’s in the dependency? For instance, if we’re running an online banking service, what happens if a user checks their account immediately after making a transaction, but our cached data is five or ten minutes old? We’d need to exclude this data from the cache, force the cache to refresh under certain conditions, or just keep serving stale data until the TTL expires. These are all considerations engineers need to take when designing caching systems.

‍

Circuit breakers

A circuit breaker is a software development pattern that detects slow or unavailable dependencies before even making a request to them. Unlike a standard retry model—where the application resends failed requests a set number of times—a circuit breaker acts as a proxy. Your service sends a request to the circuit breaker, which then forwards the request to the dependency. If the dependency is unavailable, the circuit breaker detects this and notifies your service. The circuit breaker also remembers that the dependency is unavailable, and will reject future requests until a certain amount of time passes, or it re-establishes contact with the dependency.

The main benefit of circuit breakers is that they can immediately respond to requests when a dependency is unavailable, reducing wait time and load on the dependency. They do add some complexity, though, and any calls to the dependency have to be rewritten to include the circuit breaker.

‍

Exponential backoff

Exponential backoffs (also called “retry with backoff”) are another software pattern that relies on retrying failed requests. The difference from circuit breakers is that the wait time between each request increases with each attempt. As the name suggests, this increase is often exponential. For instance, the first failure might wait one second before retrying, then two seconds, then four seconds, etc. This gives the dependency extra time to recover, but it also has another important function. When many services share one dependency and that dependency fails, those services may make the problem worse by continuously trying to contact the dependency. This is known as a thundering herd. To prevent this, exponential backoffs often include jitter, which adds a small amount of randomness to the backoff times. Jitter, combined with exponential backoff times, limits the likelihood of a thundering herd by spreading requests out over a longer period. AWS claims that this can reduce wait times without overloading dependencies, but you may see different results depending on your architecture and environment.

‍

How do you test your resilience to slow dependencies?

Once you have a remediation in place, the next step is to verify that it works. How do you test a dependency resiliency mechanism? By testing it with a slow dependency.

Of course, that’s easier said than done: you don’t always know when a dependency will fail, and if it’s an external dependency like a SaaS dependency, you can’t just ask the provider to slow down your database for a test. Instead, you can use Gremlin to run a latency experiment on your service and only target network traffic to that dependency.

Gremlin comes with a pre-built Scenario called Dependencies: Latency Test that tests this exact situation. A Scenario is a workflow that runs one or more Chaos Engineering experiments sequentially. The Scenario works by adding a customizable amount of latency to network traffic between a service and a dependency. You can fine-tune Gremlin’s latency experiment (and other network experiments) to target specific network traffic by destination hostname, port number, network device, IP address, and/or protocol (TCP, UDP, or ICMP). This way, you can add latency to a specific dependency without impacting any others.

‍

Details of the Dependency Latency Test Scenario in Gremlin

‍

This Scenario only has one step: add 100ms of latency for five minutes. This might not sound like much, but keep in mind that this is 100ms per packet. If your service communicates with your dependency frequently, or constantly opens and closes connections to that dependency, the impact could be significantly more than 100ms. But if data transfers are small or infrequent, the impact will be more minimal. As an example, one of our tests found that adding just 20ms of latency between a WordPress website and MySQL database increased response times from 128ms to 719ms.

‍

Adding a Health Check to a Gremlin Scenario

Before we run this Scenario, we should add a Health Check. A Health Check is a periodic check that Gremlin makes to verify that a service is still responsive during testing. If the check returns back with a healthy status code with a response time that falls within your criteria, then the Scenario keeps running. If not, Gremlin immediately stops the Scenario and returns your service to its normal state to prevent an incident or outage. Gremlin natively integrates with several observability tools, including Amazon CloudWatch, Datadog, Prometheus, and Grafana. We recommend using your existing monitors or alerts as Health Checks instead of creating brand new ones.

‍

Note

If you want to follow along with the example in this blog, you can deploy your own copy of the application from the Bank of Anthos GitHub repository.

‍

To add the Health Check, first open the Gremlin web app at app.gremlin.com, or click this link to go directly to the Scenario. From this page, click Customize. This brings you to a page where you can change the steps in the Scenario. Under the “Health Checks” heading, click Add Health Check. If you or someone else in your team has already created a Health Check, you can select it here. Otherwise, you can create a new one. For this blog, I’ll be using a Health Check I already created called “Application frontend” that simply sends HTTP requests to our website’s frontend and waits for a response within two seconds.

‍

A Health Check added to a Gremlin Scenario

‍

The next step is to select targets for this experiment. Clicking on the three-dot menu icon on the right side of the experiment node and selecting Edit opens a pane where you can select which systems to target. Although the focus of this blog is on testing dependencies, remember: we’ll be running the experiment on the service calling the dependency. We can only test our own service’s ability to handle slow and failed dependencies—we can’t test the resilience of the dependency itself. That’s the responsibility of the dependency’s owner, whether that’s another team in our organization or a third party provider, like a cloud platform.

Since the target is a Kubernetes Pod, click the Kubernetes tab, then select your target Deployment, DaemonSet, StatefulSet, or standalone Pod. You can also use the search box just below the cluster name to find a specific target, or choose between different clusters and namespaces using the drop-down boxes. Here, I’ll select the applications frontend service, which hosts the public-facing website:

‍

Selecting a Kubernetes Deployment as a target for a Gremlin experiment

‍

Once you choose your target(s), you’ll see it highlighted to the right. This is your blast radius. This just represents the resources that will be directly impacted by the experiment. Below that, you can specify the number or percentage of Pods within the Deployment to target, assuming you don’t want to target all of them. If you reduce this number, Gremlin will randomly choose targets from the pool of available targets until it meets the number or percentage you specify. You can also choose whether to include Pods that are detected while the experiment is running. We do want to keep this enabled, because otherwise traffic might flow to our dependency through those new Pods and bypass the experiment.

We also want to specify which types of traffic to target. Our frontend depends on a downstream service called the userservice, which in turn depends on a database to serve user account data. We can target traffic to the userservice in one of two ways:

Finding the hostname that the frontend uses to communicate with the userservice.
Finding the local and/or remote port number(s) that the two services communicate over.

‍

Tip

If you’re using Gremlin Reliability Management and have your service defined in the services list, you can see your dependencies and their connection info by clicking on the service and scrolling down to Dependencies.

‍

We know that our frontend connects to the service at userservice.bank-of-anthos.svc.cluster.local, so we’ll enter that in the Hostnames field:

‍

Configuring the parameters of a latency experiment in Gremlin

‍

Click Update Scenario to save your changes, then click Save Scenario. When you’re ready to run the Scenario, click Run Scenario.

The first thing Gremlin does is run the Health Check to ensure the service is in a healthy state. We can monitor the status of the Health Check in real-time by clicking on it while the Scenario is running. Next, the latency experiment starts. Gremlin starts injecting latency into IP packets heading to our dependency. While this is happening, our Health Check is repeating every 12 seconds. Fortunately, there doesn’t appear to be any significant negative impact on performance or latency, as our response time hovers around 1000 ms:

‍

Health Check status during a Scenario run. All criteria are green.

‍

Once the Scenario finishes, we can record our observations:

‍

The Scenario results screen showing a succesful test and documented observations

‍

Next steps: testing your services against increasing amounts of latency

100ms of latency isn’t a lot, even for chatty services. Hubspot recommends a response time of 100 to 500 ms, with 1–2 seconds being the time users start noticing the delay. So, let’s tweak our Scenario to test higher amounts of latency.

We’ll keep the original 100ms experiment, but we’ll reduce its runtime to one minute. We’ll then duplicate it twice and increase the latency amounts to 500ms, then 1000ms (one full second). We’ll also add five-second delays between each experiment so traffic can ‌return to normal in between:

‍

Three sequential latency tests over a course of three minutes

‍

Even with these added delays, our website never exceeds the 2000ms threshold we set. We can confidently say we’re resilient to latency in our userservice dependency!

‍

A succesful Scenario results screen using the updated Scenario with three tests

‍

Other dependency tests you should run

This blog only covers dependency latency. What do you do if one of your dependencies outright fails? We’ll answer this in a future blog post by looking at the Dependencies: Failure Test Scenario. In the meantime, keep tweaking your latency Scenario until you find the point where the Health Check fails. This way, you’ll know the limits of your resilience and whether you need to modify your resilience mechanisms.

If you don’t yet have a Gremlin account and want to ensure your systems are scalable, sign up for a free 30-day Gremlin trial and run your first Recommended Scenario in minutes.

Dependencies: Latency Test

Delay all network traffic to this dependency by 100ms. Estimated test length: 5 minutes.

Length:

5 minutes

Attack Type

Latency