Platform > Reliability Intelligence: Tune monitors for sensitivity to failures

Reliability Intelligence: Tune monitors for sensitivity to failures

Supported platforms:

N/A

If a failure occurs but your monitor doesn't trigger, it likely means that your monitoring thresholds, scope, or coverage need adjustment. Effective monitors usually focus on measuring what impacts your users, such as request latency or any of the other Golden Signals. In order for systems and teams to react to issues in a timely manner, it’s important to make monitors sensitive enough to catch failures early.

‍

Define Clear, Actionable Signals

Google’s SRE Book outlines four important signals of health for a user-facing system: latency, traffic, errors, and saturation. Gremlin recommends modeling your health checks on these signals to achieve test results that closely reflect system reliability. Similarly, health checks work best when they closely reflect the service level objectives (SLOs) of your service.

Examples of effective signals:

95% of requests to my service in the past five minutes completed in less than 200ms
99.9% of requests to my service in the past five minutes completed without an error

‍

Aggregate Metrics at the Right Level

Too much aggregation in a service metric can dilute failures. For example, a monitor that aggregates a service’s error rate across all availability zones may be unable to report on important errors occurring only within a single zone. Monitoring signals by individual availability zone, or by critical HTTP path can be an effective way to ensure monitors are sensitive to important failure modes.

‍

Tune Evaluation Windows to Trigger within a Reasonable Time

Sufficiently long evaluation windows (such as hourly) can be a poor fit for Gremlin reliability tests which expect monitors to trigger quickly in response to an introduced failure. During a real failure scenario, a long evaluation window can delay the time it takes for your team to notice a problem with the service.