Use remote FireDrills to test your team’s incident response preparedness and ensure business continuity.
Incident response requires clear communication and coordinated teamwork. COVID-19 has forced us to move from in-office to remote. This has dramatically impacted how we communicate and collaborate. We cannot wait for an incident to discover that your processes don’t adapt to remote incident response. Running remote FireDrills validate that our teams can handle an incident from their homes as well as from the office.
A FireDrill (Gremlin uses the camel-case single word format to distinguish this from events involving preparation for actual fires) is a planned event that validates people and processes. Specifically, it is designed to run a team through the proper actions to take when a specific problem arises. As with building fire drills, like those we participated in during our school years, the goal is to practice the proper responses frequently enough so that we have developed muscle memory to act appropriately should a real emergent situation arise.
This is different from a GameDay, where we test and validate systems and applications (and people and processes as a side effect). In a GameDay, the technology part is the focus. In a FireDrill, the people and plans and processes are what is tested.
Like business continuity plans, FireDrills should be a regular and expected facet of our incident management preparation. Where continuity plans outline our mitigation schemes for loss of the workplace, facility, or technology, running a FireDrill provides an opportunity to practice those plans.
A FireDrill is all about validation and practice. First, we want to validate our incident response plans are complete, appropriate, and work as intended. We want to validate that our team is able to do the things we have outlined in our plans and that our plans are appropriate in coverage and steps.
If we find something that doesn’t work, this is our chance to take notes and make changes. If things work, this is our opportunity to begin to make our incident response second nature so that when real problems arise we can solve them quickly.
Both improving plans and practicing plans work to improve team responsiveness. Pay attention to metrics like mean time to detect (MTTD) and mean time to resolve (MTTR) and watch how knowing what to do and practicing it frequently improves the team’s ability to take care of problems efficiently.
Running weekly drills to ensure uptime, reliability and bandwidth, helps decrease the need for foresight. “You’ve already done your homework and tested how your systems respond to failure,” [Yee] said.Jason Yee
Complications of guaranteeing uptime during a pandemic in CIODIVE
Here are some of the ways a team can validate their remote incident response by running regular FireDrills:
- Validate playbooks (runbooks) are accurate and complete and up-to-date
- Validate escalation policies
- Validate alerting and notifications
- Validate engineering knowledge
- Validate dashboards
- Validate access requirements for monitoring tools, systems, etc.—can the necessary team members login?
- Validate or discover team dependencies—I didn’t know I needed Team Cobra to help
- Validate team resources, especially as members are now working from home—are power cords plugged in, does everyone have all the correct support tools installed, do they have the right passwords, is the current equipment at home adequate (can they be efficient with just one monitor), do they feel comfortable and ready?
- Validate expectations and actions from leadership and co-workers in regards to response and recovery (Engineer: Sorry, it’s taking a long time! Manager: It’s okay, I respect you and know it’s a tough time right now. Just keep working on it.)
- Learn to be blameless—for example, perhaps our infrastructure and scaling weren’t set up to handle “peak traffic” every day—that is not the engineers’ fault during these trying times; fix it and move forward
When all of us are in the same physical space and running an incident response FireDrill, we are able to communicate face to face, which can be helpful. Everything else is done using computers. If we solve the communication gap, most of our efforts and results should end up to be similar whether team members are working co-located or remote. So that is number one.
Plan and implement good communication methods across team members now. Do it before you run a FireDrill and keep those methods available. This removes the only real hurdle most people have when working remotely in terms of effectiveness (in fact, most people end up being more productive when working remotely once they adapt to the new paradigm). The big three communication options that you need to have are:
- Instant messaging- for quick and easy notifications across a team or quick and easy Q&A between specific team members—don’t use this for detailed issues, but rather just to inform or get quick responses to simple questions
- Video conferencing- for things that are urgent and require more than a sentence or two to resolve—send an instant message to team members or individuals you need to talk to to find their availability and then jump on a call
- Email- for detailed things that are not urgent
Plan what you will verify in advance. Inform all team members. Don’t start with an unannounced FireDrill! There is no need for the additional stress when you are just starting out. You can have unannounced FireDrills in the future, once everyone is confident that the processes and incident response plans are solid. Solid, not perfect; the plans and processes will never be perfect because our systems and teams are constantly changing, but once we have established how to run FireDrills and have done so several times, things should be solid enough that the occasional blip will be trivial to overcome.
Begin by having each team run a small FireDrill that replays a prior incident, preferably something from recent memory. See how the team works together fixing something familiar while working under different conditions. Use a simulation or a Chaos Engineering tool to reproduce the incident.
Next, run a cross-team FireDrill reproducing a prior incident that affected more than one team. Again, plan this in advance with the knowledge of all teams involved. This is a chance to verify cross-team communication and processes, and is not intended to produce stress. Rather, the goal is to make real incidents less stressful through good preparation and training.
Run as many of these types of FireDrills needed to get team members comfortable finding documentation, communicating, and resolving issues remotely. Then, consider using a non-critical incident as an opportunity to run an unannounced FireDrill, treating the non-critical incident as a critical one. Doing something real with the knowledge and practice already gained will help keep FireDrills interesting and make it a little more real. It also builds confidence in the practice, making it even more likely that the team will be efficient and effective dealing with a critical incident when one appears.
Our goal is reliability. Building reliable systems. We want all of our customers to be successful, and FireDrills are an important tool in the toolbox. Gremlin also wants to give some knowledge and ideas back to the industry that we have learned as a remote-first company that runs regular FireDrills internally, especially during this COVID-19 crisis.
It’s tempting to use the Gremlin Chaos Engineering tool less frequently during this time, because change is stressful and many of us are experiencing much higher traffic volumes than ever before. Development teams that are not used to working remotely are more likely to introduce software bugs and configuration drift because of the stress and distractions of this new normal.
Running regular FireDrills can help everyone come together and get settled into proper processes and procedures as a team. Just like a daily standup is standard for most development teams, a weekly or bi-weekly FireDrill is useful and should become standard as well, because of the confidence and competence they build. Using the Gremlin Chaos Engineering tool as a part of these can be extremely helpful in making practice incidents feel as real as possible while also providing assurance to leadership that tests and practice are being conducted in a standard, appropriate manner.
- It’s the time of year when teams at our favourite brands are gearing up for the Black Friday and Cyber Monday shopping…Tammy ButowPrincipal SRE
- Failure mode and effects analysis ( FMEA ) is a decades-old method for identifying all possible failures in a design, a…Matthew HelmkeTechnical Writer