Improving Incident Management and Postmortem Analysis at Google

Are you curious how one of the world’s largest software companies learns from incidents? Maybe you’re trying to make sense of the outcomes of your existing incident retrospective process. Either way, this session is for you.


Watch now

About this webinar

In our live conversation, we are joined by Gustavo Franco, a Site Reliability Engineer at Google and co-founder of BreakFix, one of the teams responsible for thoughtfully breaking, analyzing all incidents and fixing production systems at scale.

Specifically, Gustavo will share how teams can scale their postmortem analysis through a tool that helps SREs author incident retrospectives, download and parse the important elements, and then upload that metadata to Google BigQuery for analysis.

You'll walk away with specific techniques to help you connect and improve your incident management, postmortem analysis, and fault injection processes.

You’ll also have the opportunity to have your questions answered by our experts during our Q&A segment.

  • In this live session, we will discuss the limitations of typical incident management programs and postmortem analysis.

  • Then, we'll introduce BreakFix and its end-to-end approach to incident prevention through Chaos Engineering, incident management and insightful incident analysis.

  • Finally, you'll get a demo of programmatically analyzing postmortems to better understand the opportunities for improving reliability.

About the speakers

Proactively improve reliability

Explore our tutorials to learn about the technologies and processes that help you manage reliability to a higher standard

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started
© 2023 Gremlin Inc.All rights reserved.Privacy Policy