How to map out your application’s critical path

Andre Newman
Sr. Reliability Specialist
Last Updated:
October 12, 2020
Categories:
Chaos Engineering
,
SRE
,

In a recent blog post, we explained how every application has a critical path, which is the set of components that are essential to the application’s operation. A failure along the critical path makes your application unavailable, which means unhappy customers, reduced revenue, and a hit to your company’s reputation. For these reasons, we need to focus on making our critical path as reliable as possible. But to do that, we first need to know which components are part of it. We need to consider:

  • Tightly coupled dependencies, such as code libraries. Problems can occur when a library changes or disappears unexpectedly.
  • Internal dependencies owned by other teams, such as databases, security/authentication systems, and message pipelines. For example, LinkedIn has 10,000+ unique dependencies that interact in complex ways, requiring a complete dependency management and testing service.
  • External dependencies such as SaaS services and cloud platform providers. When services like Amazon S3 become unavailable, teams lose a core part of their applications.

In this tutorial, we’ll show you how to identify the parts of your critical path using Chaos Engineering. We’ll identify each of the components that make up our application, run chaos experiments to determine how essential they are, observe how our systems respond, and use this information to map out our path. This way, we can determine where to focus our efforts on improving reliability while learning more about how our applications work.

Prerequisites

Before starting this tutorial, you’ll need:

Step 1 - Identify your application’s components

First, we need to identify the different components, services, and dependencies that make up our application. If you already have an architecture diagram, great! If not, draft up a high-level diagram showing these components and how they’re related. Draw a visible line between services that are linked or networked together, as this indicates a dependency. For simplicity, only focus on application components and services, not infrastructure or network topology. For example, Online Boutique provides the following diagram:

Map of the Online Boutique microservice application

Based on this diagram alone, we can make some assumptions about the critical path. Remember that the critical path is the set of components that must be up and running for our application to perform its core function. For Online Boutique—as with any e-commerce site—this means letting customers browse products, add to cart, and place orders.

We can assume that the Frontend service is part of this path because it’s the point of entry for customer traffic. We can also assume that the ProductCatalogService, CheckoutService, PaymentService, CartService, and Redis are part of the critical path because they handle product interactions, checkout, payment processing, and shopping cart functionality respectively. If we highlight these services, our diagram now looks like this:

Application map of Online Boutique with the critical path highlighted

This is a good starting point, but how do we know that we’ve identified every critical service? We can assume that the AdService and EmailService aren’t required for customers to place orders, but how do we know this for sure? To test our assumptions, we’ll use Gremlin to simulate an outage in one of these services, try performing our application’s core function, and use our observations to determine whether the service is critical. By repeating this process with different services, we can create a clearly defined map of our critical path.

Step 2 - Choose a service to experiment on

Next, we need to choose which service to test. We could pick a service that we’re confident is part of our critical path, like the Frontend, but for this experiment let’s pick one that we’re not so sure about, like the EmailService.

EmailService is a backend service that gets called by CheckoutService whenever a customer places an order. CheckoutService sends the order details, and the EmailService sends a confirmation email to the customer. Ideally, this process should happen asynchronously; customers should be able to complete orders without first having to receive an email. We’ll design a chaos experiment around this assumption, then use Gremlin to test it.

Our hypothesis is this: if the EmailService is down, customers can place orders without noticing any changes in application performance or latency. We’ll simulate an outage by using a blackhole attack to block all network traffic between the CheckoutService and EmailService. For safety, we’ll abort the test if the attack causes orders to fail, as this tells us that EmailService is part of the critical path. Our chaos experiment looks like this:

  • Hypothesis: EmailService is not part of our critical path, and an outage won’t prevent customers from making purchases.
  • Experiment parameters: Run a blackhole attack on the ”emailservice” Pod.
  • Abort conditions: Halt the attack if customers can’t make purchases.

Step 3 - Run a blackhole attack using Gremlin

Now that we’ve defined our experiment, let’s run the attack. Before running the experiment, we’ll open our application in a web browser so that we can directly observe the impact by trying to place orders while the attack is running.

  1. Log into the Gremlin web app and select “Create Attack” from the Dashboard.
  2. Select “Kubernetes” and choose the cluster and namespace where your application is deployed.
  3. Expand “Deployments” and select “checkoutservice” as the target.
  4. Expand “Choose a Gremlin” and select “Network”, then “Blackhole”.
  5. In the “Hostnames” field, enter “emailservice”.
  6. Click “Unleash Gremlin” to run the attack.
Selecting the checkoutservice in the Gremlin web app
Configuring a blackhole attack in the Gremlin web app

While the attack is running, open your browser and place an order. When you click “Place Order,” the page will get stuck in a loading state, and the order will eventually complete after 20–30 seconds. If we check the logs for the CheckoutService, we’ll see connection errors related to the EmailService:

BASH

kubectl logs deploy/checkoutservice
{
  "message":"failed to send order confirmation to \"someone@example.com\": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.247.203:5000: i/o timeout\"",
  "severity":"warning",
  "timestamp":"2020-09-15T19:48:40.701358275Z"
}

If we look at the source code for the CheckoutService, we see that it makes an RPC call to the EmailService. This call is synchronous, meaning it will block execution until it receives a response. And while there is a timeout set, it’s long enough for customers to become frustrated while waiting for the site to load. EmailService is effectively part of our critical path even though we didn’t intend for it to be.

How do we fix this? Order verification emails are important, but they don’t need to be delivered immediately. We can make the call to EmailService asynchronous so that we can quickly return control to CheckoutService shortly after making the RPC call. The problem with this approach is that if the EmailService is down, we’d need to code a complicated retry mechanism or risk not emailing the customer.

A more advanced—but more effective—solution would be to use a message pipeline like Apache Kafka to broker messages between the two services. This decouples the EmailService from the CheckoutService, allowing both services to send and consume data at their own speed. This adds another layer of complexity to our stack, but it will let us tolerate dependency failures, guarantee message delivery, and ensure a great customer experience.

Next Steps

We found a hidden dependency in one service, but what about other services, like ShippingService, RecommendationService, or AdService? What about third-party dependencies like mailing services, managed databases, and external payment processors? We can repeat this same experiment across each of those services to test whether they’re also part of our critical path. We can also test services that we think are part of the critical path, such as the CurrencyService, since they might actually be non-critical. Testing them helps us ensure we can tolerate them failing. Using Gremlin, we can test all of these different scenarios across our entire stack quickly and safely.

Additionally, dependencies can fail in more ways than just being unavailable. Latency, packet corruption, and elevated resource consumption can lead to problems that are just as bad—if not worse—than it being unavailable. These will vary depending on the type of dependency and how our application interacts with it. For example:

  • Third-party dependencies, such as SaaS services and cloud platforms, can become unavailable due to network outages, DNS failures, high latency, and packet corruption. We can test these conditions using Blackhole, DNS, Latency, and Packet Loss attacks. Some cloud platforms experience clock drift, but we can prepare for this by using a Time Travel attack.
  • Databases are a common cause of poor performance due to unoptimized queries, which cause excessive CPU/disk consumption. We can test these by using resource attacks to see how our application and databases perform when the system is under stress.
  • Internal dependencies, such as services owned and operated by other teams, have their own risks of outages. We can prepare for this by using Blackhole attacks to drop network traffic to these services.

While we’d like to guarantee four or five-nines of uptime for every service we manage, we often don’t have the time or manpower. Not all of our services are equally critical to the user experience. Reliability is an incremental process, and by focusing on the applications and services that are the most essential to our business, we can greatly reduce the risk of an outage taking down our core operations.

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape