Site Reliability Engineering

Running reliable production systems

Incident repro & playbook validation for SREs

Tammy Butow (Principal SRE @ Gremlin) and Robert Ross (CEO @ Firehydrant) discuss how SREs can being proactive with Chaos Engineering

SRE Best Practices for Incident Management

Learn about the rise of Site Reliability Engineering, and how the role of this type of incident management can not only coexist with, but also strengthen a DevOps approach to development.

Download White Paper

The SRE reliability hierarchy

SRE's primary job is making and keeping a service and an application reliable, and this involves a lot of moving pieces! The following graph shows the Service Reliability Hierarchy, according to Google. Scroll over each layer to see how Chaos Engineering can help.

Add to your CI/CD pipeline to test for failures before staging. Add good fault tolerance to your engineering skill set.
Add to your CI/CD pipeline to test for failures before staging. Add good fault tolerance to your engineering skill set.
Replicate high traffic events to plan for that level of traffic
capacity planning
Test for failures before pushing to production
testing + release procedures
Recreate past outages to prevent rollbacks
postmortem analysis
Run GameDays to train your team for IR protocols
incident response
Tune your monitoring using real world scenarios

SREs and Chaos Engineering

Site Reliability Engineers have a responsibility to quantify how confident they are in the systems that they maintain. Chaos Engineering is an important discipline to validate reliability with controlled experiments to test various attributes of your system, from Monitoring all the way up to the Product.


SRE's measure reliability in the following ways, and there are often SLOs for each.


How much uptime does your application have (measured in 9s)

There are 2 important KPIs of Availability.

  • SLA (defined and agreed to in a contact - e.g., 99.9%)
  • SLO (Internal objective, usually greater than the SLA - e.g., 99.99% )


How resilient is your system to data loss?

This can be measured in 9s as well. You have your systems and replicas under primaries, and then you have your backups. The more layers of backups, the more durable. Turtles all the way down!


There are 2 important KPIs of Availability.

  • Traffic
  • Error Rate
  • Saturation
  • Latency
  • Packet Loss
  • name several.

Capacity & Configuration

It's important to validate autoscaling rules.

In the cloud you may not need to buy new hardware to plan for a launch or big event, but you still need to make sure you're configured to scale when the time comes.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape