Understanding The Role Of The Incident Manager On-Call (IMOC)

Last Updated:

February 27, 2018

Topics:

,

The primary role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. IMOCs lead and coordinate the SEV team through the SEV lifecycle. The SEV lifecycle encompasses detection, diagnosis, mitigation, prevention, and closure. The IMOC role is also commonly referred to as the Call Leader.

This is an older tutorial

This is an older tutorial and may not represent the latest or most up-to-date information. If anything in this tutorial is incorrect, please let us know.

This is the second post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 1, How To Establish a SEV Management Program, and part 3, Understanding The Role Of The Technical Lead On-Call (TLOC).

Introduction

The primary role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. IMOCs lead and coordinate the SEV team through the SEV lifecycle. The SEV lifecycle encompasses detection, diagnosis, mitigation, prevention, and closure. The IMOC role is also commonly referred to as the Call Leader.

Why is the IMOC role needed?

Having only one person in the company responsible as the leader of a SEV will result in faster resolution times. This is measured as MTTR, mean time to resolution.

Having an IMOC responsible for a SEV provides many benefits:

The IMOC is focused on TTR (time to recovery) for the SEV
The IMOC will aim to resolve SEV 0s within 15 minutes
Everyone in the company knows who to follow as a top priority
The IMOC uses their leadership skills to keep the team calm
Engineers working on SEV mitigation can remain focused on the technical work required to recover. They will only be required to provide status updates to the IMOC.
The IMOC will keep the entire company updated on the progress of the SEV

What is the role of an IMOC?

As we explained in “How To Establish A High Severity Incident Management Program” a SEV 0 is classified as “Catastrophic Service Impact”. The IMOC plays a critical role during these SEV 0s. The entire company will be notified and the external status page will be updated to notify customers. The IMOC will aim to resolve SEV 0s within 15 minutes.

The IMOC promotes the following principles:

Stay calm
Work as a team
Follow the lead of the IMOC
Communicate clearly and concisely
Prioritize, focus on the right things at the right time

The IMOC will create and facilitate any required chat channels, conference calls, video calls or in-person SEV rooms. They will use the most suitable communication methods that enable them to work effectively with everyone actively working on the SEV.

The IMOC will keep everyone on the same page by creating and updating a SEV timeline during the SEV. The timeline will include what actions are happening and who is responsible. The IMOC makes sure to identify and raise anything that has changed during the SEV.

An example SEV timeline appears below:

Action Item	Who Is Responsible?	Estimated Delivery (UTC)
Create PR for emergency fix	Annie	10:30

‍

How does the IMOC ensure the SEV team and wider company operates effectively during SEVs?

The IMOC has a wide knowledge of services and engineering teams. They have an understanding of all major changes that are happening across all services. They are aware of product launches, of migrations and of changes to team services and structure. The IMOC stays calm and collected at all times. They have an ability to focus and drive the entire company towards mitigation and resolution.

The IMOC is supported by the IMOC rotation team. This is usually a team of less than 10 engineers across the company. The IMOC rotation team will work together to proactively ensure the entire company understand how SEVs are managed and categorised. The more effective the entire company is at categorising and communicating SEVs, the quicker the IMOC can grok the priority and impact of SEVs.

How do you create an effective IMOC rotation?

The IMOC rotation is a small rotation of engineering leaders. One person is on-call in this role at any point in time. It is a A 24/7 rotation with one Primary IMOC and one Secondary IMOC on-call for a week.

An example of a five person rotation appears below:

Week 1	Week 2	Week 3	Week 4	Week 5
Primary IMOC A	Primary IMOC B	Primary IMOC C	Primary IMOC D	Primary IMOC E
Secondary IMOC A	Secondary IMOC B	Secondary IMOC C	Secondary IMOC D	Secondary IMOC E

‍

It is useful for the IMOC rotation to hold a monthly sync where all IMOCs are allocated time to share feedback, raise, and review action items.

How do you train IMOCs for SEV 0s?

IMOC training is best conducted in a one hour face-to-face training session with time for questions. It involves gaining an understanding of the following:

SEV levels
The full lifecycle of SEVs
Examples of previous SEV 0s that have occurred within your companyAccess to an IMOC runbook with communications templates
GameDays

It is important to train IMOCs before they do their first rotation because they are the sole person responsible for leading the entire company towards resolution of the SEV 0.

What are the recommended tools for IMOCs?

A SEV reporting tool which collects SEV details
Setup the SEV reporting tool to automatically page the Primary IMOC for SEV 0s
Set up automatic escalation to the Secondary IMOC if the Primary IMOC does not acknowledge (ack) the page in 1 minute
Access to Monitoring Dashboards, especially a critical services dashboard
Automatically page the Primary IMOC for SEV 0s
An IMOC runbook with comms templates to enable clear and concise communication with the entire company

How does the IMOC prevent SEVs?

The IMOC will make sure to gather everything needed for the SEV review and will present the SEV during the in-person meeting.

IMOCs will work with service teams to focus on on improving MTBF (mean time between failure) and MTTP (mean time to prevention).

Conclusion

The role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. By following this guide you will be able to define the IMOC role within your company and establish an IMOC rotation.

No items found.

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

start your trial

See how companies like Amazon, LinkedIn, and Twitter reduce MTTD for high-severity incidents.

GET THE FREE EBOOK

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started