Chaos Engineering Monitoring & Metrics Guide

Introduction

Prior to your first Chaos Engineering experiments it is important to collect a specific set of metrics. These metrics include infrastructure monitoring metrics, alerting/on-call metrics, high severity incident (SEV) metrics and application metrics. If you do not yet have an incident management program in place read our guide on How To Establish A High Severity Incident Management Program.

Why Is It Important To Collect Baseline Metrics For Chaos Engineering?

If you do not collect metrics prior to practicing your Chaos Engineering experiments it will be extremely difficult to determine if your experiments have created a successful impact. It will also be difficult to determine your success metrics and set goals for your teams.

When you do collect baseline metrics you will be able to answer questions such as;

  • What are the top 5 services with highest counts of alerts?
  • What are the top 5 services with highest counts of incidents?
  • What is an appropriate goal to set for incident reduction in the next 3 months? 2x? 10x?
  • What are the top 3 main causes of a CPU spike?
  • What are typical downstream/upstream effects when the CPU is spiking?

How Do You Collect Baseline Metrics For Chaos Engineering?

To ensure you have collected the most useful metrics for your Chaos Engineering experiments you will need to ensure you can cover the following;

Infrastructure Monitoring Metrics

Infrastructure monitoring software will enable you to measure trends, disks filling up, cpu spikes, IO spikes, redundancy and replication lag. You can collect appropriate monitoring metrics using monitoring software such as Datadog, New Relic and SignalFX.

Collecting these metrics involves two simple steps:

  1. Roll out your chosen monitoring agent across your fleet e.g. inside a Docker container
  2. Use the out-of-the-box monitoring metrics and dashboards provided by your software

Alerting and On-Call Metrics

Alerting and on-call software will enable you to measure total alert counts by service per week, time to resolution for alerts per service, noisy alerts by service per week (self-resolving) and the top 20 most frequent alerts per week for each service.

Software that you can use to collect alerting and on-call metrics include PagerDuty, VictorOps, OpsGenie and Etsy OpsWeekly.

High Severity Incident (SEV) Metrics

Establishing a High Severity Incident Management (SEV) Program will enable you to create SEV levels (e.g. 0, 1, 2 and 3), measure the total count of incidents per week by SEV level, measure the total count of SEVs per week by service and the MTTD, MTTR and MTBF for SEVs by service. Read our guide on How To Establish A High Severity Incident Management Program.

Application Metrics

Observability metrics will enable you to monitor your applications. This is particularly important when you are practicing application level failure injection. Software that you can use to collect application metrics includes Sentry and Honeycomb.

Collecting these metrics involves two simple steps:

  1. Install an SDK for your application language of choice
  2. Configure the SDK

Out of the box SDKs will attempt to hook themselves into your runtime environment or framework to automatically report fatal errors. However in many situations it’s useful to manually report errors or messages.

What Baseline Metrics Should You Collect For CE?

You should aim to collect the following metrics before you get started practicing Chaos Engineering:

Infrastructure Monitoring Metrics

  • Resource: CPU, IO, Disk & Memory
  • State: Shutdown, Processes, Clock Time
  • Network: DNS, Latency, Packet Loss

Alerting and On-Call Metrics

  • Total alert counts by service per week
  • Time to resolution for alerts per service
  • Noisy alerts by service per week (self-resolving)
  • Top 20 most frequent alerts per week for each service.

High Severity Incident (SEV) Metrics

  • Total count of incidents per week by SEV level
  • Total count of SEVs per week by service
  • MTTD, MTTR and MTBF for SEVs by service

Application Metrics

  • Events
  • Stack traces
  • Context
  • Breadcrumbs

Conclusion

By taking the time to first collect baseline metrics before you practice Chaos Engineering you will be setting up your company for success from Day 0. It is important to cover infrastructure monitoring metrics, alerting/on-call metrics, high severity incident (SEV) metrics and application metrics. The majority of these metrics can be captured by out-of-the-box software that is readily available.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Try Gremlin for free and see how you can harness chaos to build resilient systems.