Incident repro & playbook validation for SREs
SRE Best Practices for Incident Management
The SRE reliability hierarchy
SREs and Chaos Engineering
Site Reliability Engineers have a responsibility to quantify how confident they are in the systems that they maintain. Chaos Engineering is an important discipline to validate reliability with controlled experiments to test various attributes of your system, from Monitoring all the way up to the Product.
There are 2 important KPIs of Availability.
- SLA (defined and agreed to in a contact - e.g., 99.9%)
- SLO (Internal objective, usually greater than the SLA - e.g., 99.99% )
This can be measured in 9s as well. You have your systems and replicas under primaries, and then you have your backups. The more layers of backups, the more durable. Turtles all the way down!
- Error Rate
- Packet Loss
- ...to name several.
Capacity & Configuration
In the cloud you may not need to buy new hardware to plan for a launch or big event, but you still need to make sure you're configured to scale when the time comes.