When it comes to building reliable and scalable software, few organizations have as much authority and expertise as Google. Their Site Reliability Engineering Handbook, first published in 2016, details their practices to maintain reliability as Google scaled. But when you have over a million servers running thousands of services across more than twenty data centers, how do you monitor them in a consistent, logical, and relevant way? The answer is with the four Golden Signals: latency, traffic, error rate, and resource saturation.
In this blog, we explain what the Golden Signals are, how they work, and how they can make monitoring complex distributed systems easier.
First, let's define the Golden Signals:
- Latency is the amount of time it takes to serve a request.
- Traffic is the volume of requests that a system is currently handling.
- Error rate is the number of requests that fail or return an unexpected response.
- Resource saturation is the percentage of available resources being consumed.
Latency is best measured as the difference between when a user sends a request and when the user receives a response. From the user's perspective, it's the delay between performing an action and receiving feedback: for example, how long it takes a webpage to refresh after submitting a form. Lower latency is better because this means your systems are responding faster and not keeping users waiting.
Latency can also be used to measure the response time between services. For example, if we have a web service that talks to a database, the time it takes the database to respond to a query from the web service is also measured as latency. This type of latency is even more critical because too much latency between backend services can result in an exponentially slower user experience. Adding just 100 milliseconds of latency can reduce the performance of a web service by up to 140%. Measuring both kinds of latency is essential, but the latency between the user request and response is most relevant to your user experience.
At a high level, traffic is the amount of demand users place on your application. The actual definition depends on the service you're monitoring: for example, an e-commerce shop might measure traffic as orders per second, while a bank might measure transactions per second. As a starting place, consider measuring:
- HTTP requests or page loads per second for a web service.
- API requests per second for a backend service or public-facing API.
- Transactions per second for a database or file storage service.
Our services won't successfully process every user request. Maybe the user request was formatted improperly and contained invalid data, a bug in our backend code caused our service to crash, or a network problem dropped our connection to the user. A certain number of requests will fail for various reasons. When measured as a percentage of all requests, this is known as the error rate.
Tracking errors can be problematic depending on the cause and location of the error. Errors detected by backend systems are fairly easy to track since we can pull up logs, metrics, and traces on our systems to dig into the cause. Errors detected by users are more complicated since we don't have direct insight into those systems. We can also have errors buried within successful requests: for example, if a user successfully connects to our website but receives an HTTP 404 error on a webpage, should that count as an error? We need to differentiate between errors and successful requests on both an infrastructure and a service level to understand how our systems impact the user experience.
Saturation measures how much of a given resource is being consumed at a time. At an infrastructure level, this includes CPU, memory (RAM), disk, and network utilization. Reaching 100% saturation on any resource could lead to performance drops, higher latency, reduced throughput, and a higher error rate. It's also possible to measure saturation on a per-service level using a resource management system, like Kubernetes' pod resource limits.
A major benefit of distributed systems is the ability to add or remove capacity in direct response to saturation. If saturation gets too high, we can scale up by adding additional capacity, and likewise, if saturation gets low, we can scale down by removing capacity. This ensures our systems are always performant enough to meet user demand while saving costs during little or no demand.
The Golden Signals are the four key metrics teams should monitor on each user-facing system. User-facing means production systems that end users and customers interact with directly. This includes infrastructure like servers and networks, as well as applications and services.
Golden Signals come out of observability, which is the ability to measure the internal state of a system based on its external outputs. Observability is critical to understanding how a system works, whether it's meeting its operational requirements, and if it's at risk of failing. It essentially lets you "peek inside" a system to monitor for problems and troubleshoot and resolve issues.
However, modern systems have many moving parts, and each of these parts generates a ton of observability data. Collecting this data isn't too difficult, but making sense of it, identifying patterns, and removing extraneous noise is extremely difficult. As an example, open your computer's Task Manager or Activity Monitor. Note how much information is shown on this screen and how frequently it updates. Imagine recording all of this information in one location and repeating this process for hundreds of thousands of other computers. It's an incredible amount of data, and much of it is unnecessary since it doesn't tell us anything about the user experience. We can make assumptions, but without additional context, most of it is meaningless.
This is where the Golden Signals become valuable. The Golden Signals aren't just a starting point for observability. They paint a clear (albeit incomplete) picture of how our systems are operating. With these four data points, we can tell how a system is performing, how close it is to its max capacity, any unexpected problems or errors it generates, and how many users it can reasonably serve.
Golden Signals shouldn't just measure what your systems are doing. Ultimately what matters most is what your users experience when interacting with your services, and your Golden Signals should reflect that. For this reason, application performance monitoring (APM) tools are ideal for tracking metrics like Golden Signals since they can collect and categorize them on a per-service basis. This also makes it easier to define our metrics.
For example, imagine our service is a basic web application. We can clearly define:
- Latency as the difference in time between our web server receiving a request and sending a response.
- Traffic as the number of HTTP requests received in one hour.
- Error rate as the number of requests resulting in any HTTP status code other than 200.
- Saturation as the amount of CPU and RAM consumed by our web server from the total available capacity.
This gives us the context necessary to follow these metrics in a meaningful way and set relevant alerting thresholds. We can also use Golden Signals to proactively understand our reliability and the user experience by pairing them with reliability testing and scoring applications, such as those available in Gremlin.
So now you know what the Golden Signals are, how to use them to understand your systems better, and how they can help improve your user experience. The next question is: how do you get started using them?
The best option here is to see if your observability provider has any guides on setting up Golden Signals. For example, Datadog, New Relic, and SolarWinds provide tutorials for configuring and monitoring Golden Signals. Identify the relevant metrics, create monitors to track those metrics, then set up alerts to notify the service owners if the metrics indicate an unhealthy, unstable, or undesirable state.
Once you've set up your monitors, you can ensure they're working and accurately reflecting your service's state by running proactive reliability tests. Gremlin provides a suite of tests that integrate with your Golden Signals to identify failure modes in your systems without waiting for an incident. For example, you can:
- Consume CPU or RAM to make sure resource saturation is being properly monitored.
- Add a short delay to network packets to observe the impact on traffic and latency.
- Reboot or terminate a host to ensure your error rate monitor sends out an alert.
This ensures that your signals are set up correctly, are accurately reporting information about your systems, and that they're up-to-date.
To learn more about how Golden Signals can help you improve the reliability of your services, visit gremlin.com/demo.