Modern applications are a web of interdependent services. As applications grow in size and complexity, and as more engineering teams adopt service-based architectures like microservices, this web becomes deeper and denser. Eventually, keeping track of the interdependencies between services becomes a complex and time-consuming task in and of itself. In addition, if any of these dependencies fails, it can have cascading impacts on the rest of your services and on the application as a whole.
In this blog, we explore the link between service reliability and dependency reliability and offer suggestions for making your services resilient even if your dependencies aren't.
A service is a discrete set of functionality provided by one or more systems in an environment. A service running on multiple hosts is called a distributed service. Microservice applications consist of multiple separate services communicating via a network.
The key benefit of a microservices architecture compared to a more traditional monolithic architecture is that it lets developers isolate different components of their application. Each service can be deployed, updated, and managed independently of other services. This makes it much easier for service owners (typically DevOps teams) to scale, deploy fixes, and even restart their services without impacting other services. In addition, because each service provides a discrete function, there's less duplication of effort and code, as teams can call other services for the functionality they need.
Services generally have three qualities:
- They're lightweight and can be spun up or torn down quickly.
- They have discrete, clearly defined boundaries, typically exposed via APIs.
- They can be managed independently of each other, including the ability to restart, update, and scale without impacting or depending on other services.
A dependency is a software component that provides features or functionality to another component. Dependencies make it possible to separate and compartmentalize functionality so that developers don't need to build and bundle everything into a single application.
A common example is a database. Countless applications store data in databases, so engineers commonly deploy database servers alongside their services or use a managed database service like Amazon RDS. This makes the database a dependency of the application, as the application relies on it in order to operate.
The main benefit of dependencies is that they let application developers focus on building their applications without also having to build the functionality provided by the dependency. Imagine if every development team had to build and integrate database logic into their applications—teams would spend so much time working on non-core features that they wouldn't be able to update their core application. The downside is that the dependency creates a new point of failure for the application. If the database fails and the developer built their application without ever anticipating a failure, then the application itself could fail. This is called a tightly coupled (or hard) dependency. On the other hand, if the application developer builds their application so it doesn't necessarily require the dependency—like adding an in-memory cache between the application and the database—then it's a loosely coupled (or soft) dependency.
You might be tempted to track dependencies using a method like a spreadsheet, document, or in-code documentation or comments. While this might work fine for small applications, larger applications need more comprehensive methods of tracking dependencies.
Think of a relatively simple deployment, like a Node.js application running on a managed computing service like Amazon EC2. The application has clear direct dependencies like Node.js and Amazon EC2, but it also has indirect (or transitive) dependencies like:
- Amazon VPC for networking
- Amazon Route53 for DNS
- Amazon IAM for security and access control
- Amazon RDS for SQL databases or Amazon DynamoDB for NoSQL databases
- Amazon S3 for data storage
Even this simple Node.js application could end up with a massive, sprawling dependency tree.
This isn't meant to criticize the AWS ecosystem, of course, but rather to illustrate just how complex and interdependent modern applications are.
So what does this mean for our applications? Can we only be as reliable as our dependencies? Not at all, but we need to be aware of our dependencies and how they could impact our application.
There are a few ways to identify dependencies. We'll look at three approaches: using service definitions, manually via code and infrastructure inspection, and using network traffic.
Service meshes like Istio use advanced, developer-defined networking rules to connect services to each other. For example, with Istio, you can deploy your services with routing rules to control the flow of network traffic to other services. When paired with a visualization tool like Kiali, you can relatively easily build a service mesh graph based on your traffic rules. This has the added benefit of saving you a step since these rules are ones you'd already write and deploy as part of your application.
The downside, of course, is that this requires a service mesh. If you don't already use a service mesh with these features, the cost of implementing one would likely outweigh the benefits they'd provide.
Manual dependency mapping relies on the knowledge and expertise of the developers and engineers who own each service. Developers can identify dependencies by reviewing their code for network requests to other services or that include third-party libraries referencing other services.
Meanwhile, SREs and operations teams can review their system architecture for dependencies. To know what to look for, consider which cloud providers and third-party services you currently use. This includes PaaS, IaaS, SaaS, and others.
The third method is using inbound and outbound network traffic to identify external endpoints that your service communicates with. This is how dependency discovery works in Gremlin Reliability Management. When you define your service in Gremlin—whether it's a Kubernetes resource, host, or container—and select its process name, Gremlin uses traffic data to identify network resources that your service communicates with. It then lists these services and generates reliability tests so you can see how your service responds if your dependencies are slow or unavailable.
While this method can't identify transitive dependencies (i.e. dependencies of dependencies), it identifies the most critical dependencies for your particular service.
As the number of dependencies increases, the difficulty of maintaining reliable services also increases. What makes this so difficult is that while we have control over our own code and infrastructure, we typically don't have control over our dependencies. We must rely on the dependency's owner to maintain their own reliability or, better yet, build our applications with the assumption that dependencies will eventually fail.
That's why Gremlin automatically generates reliability tests for each service's dependencies. You can test how your service behaves when a dependency is slow or unavailable. While the test runs, Gremlin monitors your services for any changes in error rate, availability, or functionality. If something unexpected happens, like a slow dependency causing a spike in service latency, Gremlin will immediately halt the test, mark it as failed, and flag latency as the cause of the failure. This way, you know which dependency caused the failure, why it failed, and where to begin troubleshooting. Once you've implemented a fix, simply repeat the reliability test to ensure your service can withstand the failure.
The result is a set of highly resilient services you can trust to keep running, even if a critical dependency goes down.
Dependencies are everywhere and incredibly difficult to track. Nonetheless, we should build our applications and services under the assumption that dependencies can and will fail. Reliability Management platforms like Gremlin help you not only uncover those dependencies you never knew existed but ensure that your services can reliably run without them.