This is the second post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 1, How To Establish a SEV Management Program, and part 3, Understanding The Role Of The Technical Lead On-Call (TLOC).
The primary role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. IMOCs lead and coordinate the SEV team through the SEV lifecycle. The SEV lifecycle encompasses detection, diagnosis, mitigation, prevention, and closure. The IMOC role is also commonly referred to as the Call Leader.
Having only one person in the company responsible as the leader of a SEV will result in faster resolution times. This is measured as MTTR, mean time to resolution.
Having an IMOC responsible for a SEV provides many benefits:
As we explained in “How To Establish A High Severity Incident Management Program” a SEV 0 is classified as “Catastrophic Service Impact”. The IMOC plays a critical role during these SEV 0s. The entire company will be notified and the external status page will be updated to notify customers. The IMOC will aim to resolve SEV 0s within 15 minutes.
The IMOC promotes the following principles:
The IMOC will create and facilitate any required chat channels, conference calls, video calls or in-person SEV rooms. They will use the most suitable communication methods that enable them to work effectively with everyone actively working on the SEV.
The IMOC will keep everyone on the same page by creating and updating a SEV timeline during the SEV. The timeline will include what actions are happening and who is responsible. The IMOC makes sure to identify and raise anything that has changed during the SEV.
An example SEV timeline appears below:
|Action Item||Who Is Responsible?||Estimated Delivery (UTC)|
|Create PR for emergency fix||Annie||10:30|
The IMOC has a wide knowledge of services and engineering teams. They have an understanding of all major changes that are happening across all services. They are aware of product launches, of migrations and of changes to team services and structure. The IMOC stays calm and collected at all times. They have an ability to focus and drive the entire company towards mitigation and resolution.
The IMOC is supported by the IMOC rotation team. This is usually a team of less than 10 engineers across the company. The IMOC rotation team will work together to proactively ensure the entire company understand how SEVs are managed and categorised. The more effective the entire company is at categorising and communicating SEVs, the quicker the IMOC can grok the priority and impact of SEVs.
The IMOC rotation is a small rotation of engineering leaders. One person is on-call in this role at any point in time. It is a A 24/7 rotation with one Primary IMOC and one Secondary IMOC on-call for a week.
An example of a five person rotation appears below:
|Week 1||Week 2||Week 3||Week 4||Week 5|
|Primary IMOC A||Primary IMOC B||Primary IMOC C||Primary IMOC D||Primary IMOC E|
|Secondary IMOC A||Secondary IMOC B||Secondary IMOC C||Secondary IMOC D||Secondary IMOC E|
It is useful for the IMOC rotation to hold a monthly sync where all IMOCs are allocated time to share feedback, raise, and review action items.
IMOC training is best conducted in a one hour face-to-face training session with time for questions. It involves gaining an understanding of the following:
It is important to train IMOCs before they do their first rotation because they are the sole person responsible for leading the entire company towards resolution of the SEV 0.
The IMOC will make sure to gather everything needed for the SEV review and will present the SEV during the in-person meeting.
IMOCs will work with service teams to focus on on improving MTBF (mean time between failure) and MTTP (mean time to prevention).
The role of the Incident Manager On-Call (IMOC) is to resolve high severity incidents (SEVs) in a safe and fast manner. By following this guide you will be able to define the IMOC role within your company and establish an IMOC rotation.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started