Site Reliability Engineering

Running reliable production systems

What is SRE? A primer for engineering leaders

Site Reliability Engineering (SRE) is the outcome of combining system operations responsibilities with software development.
Read more

Incident repro & playbook validation for SREs

Tammy Butow (Principal SRE @ Gremlin) and Robert Ross (CEO @ Firehydrant) discuss how SREs can being proactive with Chaos Engineering

SRE Best Practices for Incident Management

Learn about the rise of Site Reliability Engineering, and how the role of this type of incident management can not only coexist with, but also strengthen a DevOps approach to development.
Download white paper

The SRE reliability hierarchy

SRE's primary job is making and keeping a service and an application reliable, and this involves a lot of moving pieces! The following graph shows the Service Reliability Hierarchy, according to Google. Scroll over each layer to see how Chaos Engineering can help.
Product
Development
Capacity planning
Testing + release procedures
Postmortem analysis
Incident response
Monitoring

SRE vs DevOps: Can they coexist or do they compete?

Learn how DevOps and SREs can work together to create high performing, reliable sites.
Read more

The role and responsibilities of SREs in software engineering

Site Reliability Engineering teams are made up of people from diverse backgrounds who work together toward the common goal of keeping systems and services reliably available.
Read more

How to become a top notch SRE

Free resources and tools that will help you learn the skills you need to become an SRE.
Read more

How much money do SREs make?

We polled the industry to give you a sense of salary ranges for SREs.
Read more

SRE interview questions and job descriptions

As you're building your SRE team, here's some questions to find the best ones and some job descriptions you can use.
Read more

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Request a demo