GameDays were created with the goal of increasing reliability by purposefully creating major failures on a regular basis. GameDays facilitate chaos engineering. Here at Gremlin, we have created a roadmap for all of the GameDays we plan to run internally in 2019.
Openly sharing our internal GameDay strategy and recaps as we progress through the year is important to us. We believe the best way to improve as an industry is by transparently sharing our strengths and weaknesses based on real practical work. Please feel encouraged to re-use our GameDay strategy with your own team.
Our GameDay roadmap was created based on the Service Reliability Hierarchy in the Google SRE book (O’Reilly). Each level of this pyramid is important and this will ensure we have a holistic understanding of our strengths and weaknesses across our engineering organisation and services.
We decided to run the following GameDays based on the Google SRE book:
- GameDay - Monitoring and Alerting on Staging
- GameDay - Monitoring and Alerting on Production
- GameDay - Product Launches
- GameDay - Incident Response
- GameDay - Postmortem / Root Cause Analysis
- GameDay - Testing / Release
- GameDay - Capacity Planning
- GameDay - Development (Distributed Consensus)
- GameDay - Development (Data processing pipelines)
- GameDay - Development (Data Integrity)
Looking at this list you might be thinking, how can we run a GameDay for Postmortems? Postmortems are a critical component of your SRE practice. What tools does your team require to run effective postmortems? What systems do they need access to in a timely manner for SEV 0 postmortems?
We used the Task Tracking template in Jira to create this GameDay project.
Each GameDay epic will have 3 high priority associated tasks / action items we need to successfully accomplish. We add these items to the appropriate epic after the GameDay.
We will be sharing recaps for each of our GameDays publically. To keep an eye on our future GameDays and SRE work, you can find me on twitter @tammybutow.