Understanding The Role Of The Technical Lead On-Call (TLOC)

Tammy Butow
Principal SRE
Last Updated:
June 6, 2018
Categories:
SRE
,

This is the third post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 1, How To Establish a SEV Management Program, and part 2, Understanding The Role Of The Incident Manager On-Call (IMOC).

Who are TLOCs?

TLOCs are technical experts from different service areas. They’re charged with diagnosing, mitigating, and resolving SEVs as quickly and safely as possible. But they aren’t burdened with keeping engineers calm or keeping management in the loop—that’s the IMOCs’ job. Rather, a TLOC settles in the trenches and stays laser-focused on technical problem solving, calling up to the IMOC for help—or to give status updates—only when necessary.

Other engineers respect the TLOC’s need to focus, but are ready to jump in and help when called on—the TLOC works heroically, but not alone!

After a SEV, the TLOCs work with their service teams to determine its root cause and create action items—for example, fixing a bug or deprecating some legacy system. After any fixes, the TLOCs lead chaos experiments to ensure the SEV doesn’t recur. These experiments are like integration tests, but for your entire application stack.

Over time, this post-SEV practice improves MTBF (mean time between failure) and MTTP (mean time to prevention).

Creating Rotations

TLOCs take turns being on-call, of course. If your engineering team is small (e.g. 5 engineers), you’ll create a single TLOC rotation that covers all service areas. If your engineering team is larger (e.g. 50 engineers), you’ll create one TLOC rotation for each service area. How those service areas break down depends on the size of your team.

Suppose you have 10 engineers. That’s enough for two rotations, given that an ideal rotation has five TLOCs. (Any more, and no TLOC will be on-call often enough to stay sharp; Any fewer, and the TLOCs may burn out.) With two rotations, you need to break down your services into two buckets. For example:

TLOC Rotation 1 - Infrastructure Engineering Services: Responsible for internal services such as MySQL, Memcache, Amazon S3, Kafka, Monitoring, and Self-Healing Software.

TLOC Rotation 2 - Product Engineering Services: Responsible for customer-facing services such as UI, Billing, Web Apps, Desktop Apps, and Mobile Apps.

During any given week, each rotation designates a Primary and a Secondary TLOC. At any given moment, however, each rotation has only one acting TLOC. Letting a single engineer take charge keeps everything moving forward, which improves your mean time to diagnosis (MTTD) and mean time to resolution (MTTR).

Your two rotations might look like this:

Week 1Week 2Week 3Week 4Week 5
InfraPrimary:
Prima
Primary:
Sylvain
Primary:
Atul
Primary:
Diane
Primary:
Eric
Secondary:
Sylvain
Secondary:
Atul
Secondary:
Diane
Secondary:
Eric
Secondary:
Prima
ProductPrimary:
Christophe
Primary:
Gillian
Primary:
Hank
Primary:
Isabel
Primary:
Juan
Secondary:
Gillian
Secondary:
Hank
Secondary:
Isabel
Secondary:
Juan
Secondary:
Christophe

Notice that each TLOC serves first as Secondary, then as Primary the week after. This lets a likely-rusty TLOC warm up as he or she returns for duty. The TLOCs should meet weekly so the most recent on-calls can share lessons learned and hand off action items to the next on-calls.

As your engineering team grows, you’ll add more 5-person rotations and redefine your service buckets in whatever way makes sense for your company. For example, if your Mobile App is more complex and less stable than other parts of your stack, it may deserve its own TLOC rotation as your team grows to 15. A different company may give their Web App its own rotation.

Preparing New TLOCs

Since TLOCs are solely responsible for driving technical resolution of SEVs, new TLOCs must receive training before their first on-call rotation. One or more experienced TLOCs should hold a one-hour, face-to-face training session, covering:

After training, add each new TLOC to the pager rotation for their service area—and test that they actually receive pages. Also test that pages roll over to Secondary TLOCs when the Primary doesn't answer within one minute.

Finally, give each TLOC full access to any monitoring, reliability, networking, and performance tools and dashboards.

Conclusion

The Technical Lead On-Call (TLOC) is a technical expert who diagnoses and resolves high severity incidents (SEVs) quickly but safely. This post has shown you how to think about the TLOC role and establish TLOC rotations at your company. If you want to become a TLOC at your company, just ask your Engineering Manager—it’s a fantastic opportunity for any engineer. If you’re already a TLOC, share your war stories with us in the comments!

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial
GET THE FREE EBOOK

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape