What does that really mean, though?

It means as an on-call hero, I’ve woken up at 3am to answer phone calls and acknowledge pages about (sometimes false) production alerts. It means I’ve lead incident calls to help mitigate customer impacting events. It means I’ve participated in countless post-mortems and reviews about “what we can do better next time?”

It also means I’ve also received a fair share of :thumbsup: and :highfive: emojis from my peers and leaders.

Incidents and outages are going to happen, and there could be at least one superstar that emerges...we want to set up that person for success.

In most professions, we reward people who solve hard problems when everything is on the line. Awards and medals are bestowed on first responders to emergency situations. Trophies and plaques are awarded to sports team members for (arguably) subjective accomplishments such as “6th Man of the Year” or “Most Improved Player”. In some cases, these awards also come with a contractual bonus, sometimes into the hundreds of thousands of dollars.

In the world of software development and production operations, we turn our on-call teams into heros for “fighting the fire” and “being a great on-call”. We’re rewarding those individuals for “fixing someone else’s (or our own) mistakes”. As on-call engineers in an on-call rotation, we operate just like software release cycles: Fix fast, fix often.

Unfortunately, while we like the attention, the excessive late night calls can often result in burnout, and sometimes even turnover. We need to appreciate our heroes while also helping them be something greater, someone who prevents issues altogether. How do we get there?

Here’s the problem

We concentrate our appreciation on the heroes, but we should also celebrate the contributions that help prevent the problems from ever happening. The fact that someone had to be a superhero and save the day means that something could have gone better much earlier in the process.

We recognize diplomats for avoiding costly wars. We show value in explaining proper fire safety to homeowners so they do not set their homes on fire. As software engineering and operations leaders, shouldn’t we measure and reward our teams and engineers for proactively avoiding outages before they impact end users?

It’s sometimes simple to quantify the work and results of engineering efforts: from the number of lines of code to the number of features released or projects completed. Operationally, we could also quantify the number of pages or alerts acknowledged and cleared, or the number of Sev-1 incidents we’ve mitigated.

But what about those proactive efforts and metrics? Things such as:

  • Time to Resolution/Downtime Reduced
  • Time Between Failures reduced
  • Alert Fatigue Diminished
  • Turnover/Burnout lowered
  • Escalations Reduced
  • On-Call training schedule shortened
  • Unknown Single Points of Failure discovered/mitigated

Rewarding us isn’t that hard: just treat us like we’re six

In the second grade, my teacher would hand out “coupons” for accurately completing a task or scoring well on our homework. At the end of every quarter, we’d be given the opportunity to redeem those coupons for small novelties such as pencils, stickers, etc. I got a few over the first quarter, but when I saw other kids trading their coupons for more “fun stuff” than I did, I wanted to earn more for next time. I made it a priority to try to earn as much as I could, because there was this Transformers lunchbox on the shelf that really wanted...

This “elementary solution” applies directly to us in the working world. Rewarding the good behavior in a public way not only reinforces the practices of the “good engineer/team”, but also encourages others to want to try to achieve the same status (or even exceed it). If an engineer cannot earn a promotion from their superiors, or at least some version of recognition from their peers, what motivation would there be to deviate from the status quo of “fix fast, fix often”?

Share the wins

Let’s be real - we’re not all in a position to throw around promotions and gifts to everyone for all the good work that they do. What we can do, however, is help communicate our successes and wins, and provide an opportunity for ourselves, our teams, and our practices to be recognized and established.

Different teams and companies utilize different methods of sharing their successes. Some share testimonials of their on-call experiences, for instance. Other teams might share positive changes in their on-call metrics. A few teams I have worked with reduced their number of pages month-over-month and year-over year thanks to applying proactive practices. Their oncall teams were appreciative. Release velocity is another example that could easily be publicized. Teams that make the effort to decrease the ratio of rollbacks to releases are increasing their productivity, making more time for more features and enhancements, and ultimately making more money for the company.

These types of wins, successes, and stories should be shared. Let’s talk about how and what to share to show your appreciation of your teams’ otherwise invisible wins. Sharing these stories recognizes the behind-the-scenes preventative work, and not just the day-saving disaster recovery efforts.

Every good story has…

Your team knows your team and your services the best. You also know your struggles and victories the best. In operations, sharing methods and successes across teams is mutually beneficial as practices are learned and iterated on. When you want to bring focus to the good things that are happening, it often helps to frame the data and wins you’re presenting to them in the form of a story that other teams can relate to. Here are the parts to a good story.

Characters

These are the folks who spend time on-call and/or responding to incidents and are responsible for services they probably didn’t write. As we continue to share our stories and wins, the characters might remain the same, or they might rotate, but at the end of the day, we all have some shared experience (or conflict) that people can all relate to. Some common characters include:

  • On-call Operations / Engineers / Developers - use their knowledge and skills to solve problems, usually problems they did not cause, and work to prevent those problems from reoccuring
  • Incident Managers and Call-Leaders - manage the team response to major issues, coordinating responsibilities across team members to prevent duplicated efforts as well as ignored needs
  • Engineering / Release / Change Management - give sanity checks for potential fixes while overseeing that the process of deploying reviewed fixes is smooth and can be rolled back if unforeseen problems arise
  • Product Managers - know how the product should work and what end users need from both a business logic standpoint as well as a user experience standpoint
  • Public Relations - communicate to customers about the efforts and resolutions of your teams, making the company look good even in the midst and aftermath of crisis
  • End Users - report problems and are often willing to test deployed solutions to confirm success

Conflict

The conflict is any struggle that our characters have faced or are currently facing. Maybe we had a recent outage that lasted an excessive amount of time and the post-mortem revealed a significant number of gaps that needed to be addressed. It’s likely if you’ve faced this issue, others may have faced this issue, or could face it in the future. Maybe by running some chaos experiments the team found a potential point of failure.

Resolution

What you’re going to do, or what you have done about it. These are the actual wins and results that your team or management should be proud (or humbled) to present.

Maybe one of the following applies:

  • Year-over-Year Year-to-Date total number of Sev-1 Incidents were reduced
  • Engineering Hours spent responding to on-call incidents reduced month-over-month
  • Lost transactions/revenue due to availability-related incidents decreased
  • Each one of these examples can be tied to an actual dollar amount. Most of the time, the good guys win. :)

Conclusion

The intention of this message is not to encourage engineers and teams to “toot their own horn”, but rather to expose their often unseen efforts and thereby help leadership appreciate the benefits that come from taking a proactive approach to increasing uptime and availability.

Minimizing downtime is not just a result of a great Incident Management process or team. It can also be directly attributed to the dedicated efforts of the teams and individuals that prepare us for the failures in the first place.

No items found.
Vince Huang
Vince Huang
Reliability Architect
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL