WEBINAR

Improving Incident Management and Postmortem Analysis at Google

Are you curious how one of the world’s largest software companies learns from incidents? Maybe you’re trying to make sense of the outcomes of your existing incident retrospective process. Either way, this session is for you.

On-demand

Watch now

Thank you for registering for this on-demand event. You will receive an email momentarily with a link to watch the session.

About this webinar

In our live conversation, we are joined by Gustavo Franco, a Site Reliability Engineer at Google and co-founder of BreakFix, one of the teams responsible for thoughtfully breaking, analyzing all incidents and fixing production systems at scale.

Specifically, Gustavo will share how teams can scale their postmortem analysis through a tool that helps SREs author incident retrospectives, download and parse the important elements, and then upload that metadata to Google BigQuery for analysis.

You'll walk away with specific techniques to help you connect and improve your incident management, postmortem analysis, and fault injection processes.

You’ll also have the opportunity to have your questions answered by our experts during our Q&A segment.

Agenda
  • In this live session, we will discuss the limitations of typical incident management programs and postmortem analysis.
  • Then, we'll introduce BreakFix and its end-to-end approach to incident prevention through Chaos Engineering, incident management and insightful incident analysis.
  • Finally, you'll get a demo of programmatically analyzing postmortems to better understand the opportunities for improving reliability.
About the speakers

Gustavo Franco

Site Reliability Engineer
Google

Gustavo Franco is a Site Reliability Engineer at Google in the CRE team. He's been at Google for more than 11 years and is a co-founder of BreakFix. Gustavo's also managed and technically led several other SRE teams.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape