Understanding The Role Of The Incident Manager On-Call (IMOC)

Tammy Butow
Principal SRE
Last Updated:
February 27, 2018
Categories:
SRE
,

This is the second post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 1, How To Establish a SEV Management Program, and part 3, Understanding The Role Of The Technical Lead On-Call (TLOC).

Introduction

The primary role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. IMOCs lead and coordinate the SEV team through the SEV lifecycle. The SEV lifecycle encompasses detection, diagnosis, mitigation, prevention, and closure. The IMOC role is also commonly referred to as the Call Leader.

Why is the IMOC role needed?

Having only one person in the company responsible as the leader of a SEV will result in faster resolution times. This is measured as MTTR, mean time to resolution.

Having an IMOC responsible for a SEV provides many benefits:

  • The IMOC is focused on TTR (time to recovery) for the SEV
  • The IMOC will aim to resolve SEV 0s within 15 minutes
  • Everyone in the company knows who to follow as a top priority
  • The IMOC uses their leadership skills to keep the team calm
  • Engineers working on SEV mitigation can remain focused on the technical work required to recover. They will only be required to provide status updates to the IMOC.
  • The IMOC will keep the entire company updated on the progress of the SEV

What is the role of an IMOC?

As we explained in “How To Establish A High Severity Incident Management Program” a SEV 0 is classified as “Catastrophic Service Impact”. The IMOC plays a critical role during these SEV 0s. The entire company will be notified and the external status page will be updated to notify customers. The IMOC will aim to resolve SEV 0s within 15 minutes.

The IMOC promotes the following principles:

  • Stay calm
  • Work as a team
  • Follow the lead of the IMOC
  • Communicate clearly and concisely
  • Prioritize, focus on the right things at the right time

The IMOC will create and facilitate any required chat channels, conference calls, video calls or in-person SEV rooms. They will use the most suitable communication methods that enable them to work effectively with everyone actively working on the SEV.

The IMOC will keep everyone on the same page by creating and updating a SEV timeline during the SEV. The timeline will include what actions are happening and who is responsible. The IMOC makes sure to identify and raise anything that has changed during the SEV.

An example SEV timeline appears below:

Action ItemWho Is Responsible?Estimated Delivery (UTC)
Create PR for emergency fixAnnie10:30

How does the IMOC ensure the SEV team and wider company operates effectively during SEVs?

The IMOC has a wide knowledge of services and engineering teams. They have an understanding of all major changes that are happening across all services. They are aware of product launches, of migrations and of changes to team services and structure. The IMOC stays calm and collected at all times. They have an ability to focus and drive the entire company towards mitigation and resolution.

The IMOC is supported by the IMOC rotation team. This is usually a team of less than 10 engineers across the company. The IMOC rotation team will work together to proactively ensure the entire company understand how SEVs are managed and categorised. The more effective the entire company is at categorising and communicating SEVs, the quicker the IMOC can grok the priority and impact of SEVs.

How do you create an effective IMOC rotation?

The IMOC rotation is a small rotation of engineering leaders. One person is on-call in this role at any point in time. It is a A 24/7 rotation with one Primary IMOC and one Secondary IMOC on-call for a week.

An example of a five person rotation appears below:

Week 1Week 2Week 3Week 4Week 5
Primary IMOC APrimary IMOC BPrimary IMOC CPrimary IMOC DPrimary IMOC E
Secondary IMOC ASecondary IMOC BSecondary IMOC CSecondary IMOC DSecondary IMOC E

It is useful for the IMOC rotation to hold a monthly sync where all IMOCs are allocated time to share feedback, raise, and review action items.

How do you train IMOCs for SEV 0s?

IMOC training is best conducted in a one hour face-to-face training session with time for questions. It involves gaining an understanding of the following:

  • SEV levels
  • The full lifecycle of SEVs
  • Examples of previous SEV 0s that have occurred within your companyAccess to an IMOC runbook with communications templates
  • GameDays

It is important to train IMOCs before they do their first rotation because they are the sole person responsible for leading the entire company towards resolution of the SEV 0.

What are the recommended tools for IMOCs?

  • A SEV reporting tool which collects SEV details
  • Setup the SEV reporting tool to automatically page the Primary IMOC for SEV 0s
  • Set up automatic escalation to the Secondary IMOC if the Primary IMOC does not acknowledge (ack) the page in 1 minute
  • Access to Monitoring Dashboards, especially a critical services dashboard
  • Automatically page the Primary IMOC for SEV 0s
  • An IMOC runbook with comms templates to enable clear and concise communication with the entire company

How does the IMOC prevent SEVs?

The IMOC will make sure to gather everything needed for the SEV review and will present the SEV during the in-person meeting.

IMOCs will work with service teams to focus on on improving MTBF (mean time between failure) and MTTP (mean time to prevention).

Conclusion

The role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. By following this guide you will be able to define the IMOC role within your company and establish an IMOC rotation.

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial
GET THE FREE EBOOK

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape