Ideas for How to Maintain and Validate Existing Runbooks
One thing that all technical documentation for software that is still in active development has in common is that if it is not already outdated, it will be. We must be intentional if we want information to stay current. This includes runbooks.
There are many kinds of runbooks (also called playbooks and sometimes disaster recovery plans) that are used in Site Reliability Engineering (SRE). Examples include a runbook for disaster recovery failover (incident management), critical dependency failover, and non-critical dependency failover. Different companies may use different terms, and that is okay. Some companies require runbooks because of established IT service management (ITSM) processes. Many start by having senior engineers create runbooks to help with regular maintenance task delegation to and onboarding of new team members.
What is important is that we are actively documenting useful diagnostic procedures or checklists of ordered steps for on-call engineers to use to reduce mean time to repair (MTTR) during an outage. Often it is useful to include server build documentation, architectural designs, and details about templates used for cloud server provisioning.
While we can quickly agree about the usefulness of documentation during an active alert or an outage, we don’t always take the time that we should to ensure that the content is current. We need to change this. This article provides some guidance to help us include runbook documentation updates in our processes.
Use Production Changes
Deploying microservices to the cloud has greatly enhanced our ability to deploy new code quickly. For some, that means weekly or monthly releases. In these cases, we can add a task into the release checklist for someone to make any needed runbook updates needed as a result of current changes. If your organization has a defined change management procedure, updating runbooks should be included as an important part of production changes.
For others, the microservices and cloud combination means daily or even more frequent releases, at least in a testing or stage environment, and maybe even into production. It may not be feasible to keep our documentation current as a part of this cadence, especially if we are documenting in deep, precise detail.
Many SRE teams have decided to write runbook information at a higher level, trusting that the included entries will be adequate to help engineers know what to check and which paths are the most likely to be useful or to rule out first. This also permits longer documentation viability and accuracy without updates.
For example, we might have an entry outlining aspects of our infrastructure to check when latency alerting reaches a level that we set, somewhere below where latency became an issue in a previous outage. Our engineers may not be given the exact steps needed to check those aspects, but will have a guidepost leading them to the parts of the system to check.
If runbooks are prescriptive to the level of “run this set of commands in this order,” we will get more benefit out of learning how to automate those steps. This can be a script to be run by our team or something that is triggered automatically as a result of monitoring data. Consider creating an automation deployment runbook while you do this, if you don’t already have one.
When we hire a new team member, part of our job is training them in how our system works. Some groups write runbooks specifically tailored for helping newbies get up to speed. We like to use our standard runbooks for this purpose, especially because the newcomer will not yet have acquired a store of tribal knowledge. This means they will quickly discover gaps in our documentation that veterans might not notice while they learn where to find the most accurate and useful dashboards, and so on. In either case, the new team member can make note of any outdated information that needs to be updated.
One efficient method is to run FireDrills with teams comprised of both rookie and senior engineers, but put the rookie in charge of guiding the team via the runbook. The experienced engineers can maintain a calm atmosphere and guide the process while the newer ones focus on the information.
Even in the case where you have written higher-level guidance, you must make time for scheduled, regular tests of your runbooks to see if the information included is adequate and correct. This is a great reason to run a quarterly or monthly FireDrill (intentionally spelled by us as one word in camel case to differentiate from events involving actual fire), where you stage an incident for the purpose of training your teams to respond to incidents.
As with fire drills in school, the focus is on learning the procedures so that they are familiar. This reduces stress and mean time to execution (MTTE) in the event of a real production incident. Some teams even run FireDrills as a surprise event to team members, drawing an even closer parallel to the classroom.
The more frequently we have FireDrills, the better prepared our teams are to handle incidents. Since runbooks are used during FireDrills, this is a great opportunity to note any needed changes and make plans for updating them. Sometimes, the location of a particular team’s or service’s runbooks are tribal knowledge, which can be lost. FireDrills help teams remember where to look in case of an emergency, along with what to do.
Use Issue Tracking
The biggest reason that documentation doesn’t get updated is that time was never scheduled to update it. Documentation is easily forgotten. We schedule development work. We schedule maintenance. We schedule product launches.
We can just as easily create a ticket in our issue tracker for each runbook (and not just during or after a retrospective). One idea is to have on-call people dedicate 30 minutes a week to checking runbooks and creating update tickets. Another is to have service owners own a periodic review. Perhaps you want to do both, so that reviews happen from multiple perspectives.
Included in the ticket is a full readthrough, testing, and all edits. Make the last thing to be done in the ticket be the scheduling of the next runbook update. Quarterly runbook reviews are a good benchmark to schedule. If your system changes are happening frequently enough that you discover that your runbooks are getting outdated before the reviews, schedule them monthly.
Runbooks are an important tool for Site Reliability Engineers and DevOps teams. To help keep them up-to-date, review, update, and validate them regularly. Put two dates in every runbook: when the contents were last updated and when the contents were last validated.
Keeping track of these two dates affords every member of the team the opportunity to notice when information is getting old and be proactive in doing something about it. Ultimately, we want everyone to buy in to the value of runbooks, but also to the process as co-owners. Anything we can do to foster this is a good thing.
Use Post-incident Reviews
Whether you call them retrospectives or postmortems or something else, during the blameless discussion of an incident, its causes, and how your team repaired the failures, talk about runbooks. Were they accurate? Useful? If there are gaps or needed updates, schedule that work.
Many teams make GameDays out of past incidents. Take any noted failures, figure out how to reproduce them using Chaos Engineering experiments, and as a team validate your hypotheses and fixes. Then, update the runbooks accordingly at the same time.
Chaos Engineering (CE) is a great way to build reliability into our systems. It is also a great way to validate whether the guidance in our runbooks is valid. With CE you can precisely target a part of your system for failure and then follow the runbook documentation to see if the information is adequate to help you solve the problem.
Ultimately, what we all want is reliable systems. Runbooks help us as a part of a wider set of practices, all designed to build reliability. Keeping runbooks up-to-date is a vital part of site reliability engineering. Use the ideas we have given and combine them with your own to enhance your team’s ability to respond to system failures and also to help you prevent similar failures in the future. Use Gremlin's reliability calculator to help you prioritize your efforts.