Inside Gremlin: 2019 Gremlin GameDays Roadmap

Last Updated:

April 22, 2019

Topics:

GameDay

,

Inside Gremlin: 2019 Gremlin GameDays Roadmap

This is an older tutorial

This is an older tutorial and may not represent the latest or most up-to-date information. If anything in this tutorial is incorrect, please let us know.

GameDays were created with the goal of increasing reliability by purposefully creating major failures on a regular basis. GameDays facilitate Chaos Engineering. Here at Gremlin, we have created a roadmap for all of the GameDays we plan to run internally in 2019.

Openly sharing our internal GameDay strategy and recaps as we progress through the year is important to us. We believe the best way to improve as an industry is by transparently sharing our strengths and weaknesses based on real practical work. Please feel encouraged to re-use our GameDay strategy with your own team.

How did we create our GameDay Roadmap?

Our GameDay roadmap was created based on the Service Reliability Hierarchy in the Google SRE book (O’Reilly). Each level of this pyramid is important and this will ensure we have a holistic understanding of our strengths and weaknesses across our engineering organisation and services.

We decided to run the following GameDays based on the Google SRE book:

GameDay - Monitoring and Alerting on Staging
GameDay - Monitoring and Alerting on Production
GameDay - Product Launches
GameDay - Incident Response
GameDay - Postmortem / Root Cause Analysis
GameDay - Testing / Release
GameDay - Capacity Planning
GameDay - Development (Distributed Consensus)
GameDay - Development (Data processing pipelines)
GameDay - Development (Data Integrity)

Looking at this list you might be thinking, how can we run a GameDay for Postmortems? Postmortems are a critical component of your SRE practice. What tools does your team require to run effective postmortems? What systems do they need access to in a timely manner for SEV 0 postmortems?

We used the Task Tracking template in Jira to create this GameDay project.

Each GameDay epic will have 3 high priority associated tasks / action items we need to successfully accomplish. We add these items to the appropriate epic after the GameDay.

We will be sharing recaps for each of our GameDays publically. To keep an eye on our future GameDays and SRE work, you can find me on twitter @tammybutow.

No items found.

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started

Inside Gremlin: 2019 Gremlin GameDays Roadmap

How did we create our GameDay Roadmap?

Related

How to run a GameDay using Gremlin

Introduction to GameDays

Inside Gremlin: Staging Monitoring and Alerting GameDay

Avoid downtime. Use Gremlin to turn failure into resilience.