What does reliability look like at a company that has thousands of employees and provides critical communication services to over 150,000 customers?
We talked with Tyler Wells, Senior Director of Engineering at Twilio, to learn how he and his team created a culture of reliability at Twilio. He talked in depth about his experiences developing reliability goals, building reliability practices, and aligning engineering teams on these objectives. In this post, we’ll look back at some of the key takeaways from his talk.
If Twilio’s reliability efforts could be summed up in a single word, it would be "trust". Customers need to trust that their services will be available and responsive, and Twilio’s culture is built on this foundation. Reliability has a direct impact on the customer experience, and part of Tyler’s responsibility is ensuring that the entire company understands this relationship and the benefit of focusing on reliability.
[The answer to ‘why do we need to be reliable’] is a single word: trust! Trust is the most important thing that we can deliver. For our platform to be viable, our customers have to trust that we will be available, and in order for us to earn the trust of our customers, we have to be reliable.
Twilio has thousands of engineers, and aligning this many people takes time, education, and practice. To manage this, Tyler breaks down his approach into three key elements: culture, customer empathy, and accountability.
Reliability starts with fostering and promoting a company culture that’s grounded in the customer experience. This is especially true in engineering teams, whose work directly influences usability, availability, service quality, and other aspects of the product. Making engineers aware of the impact that incidents and design decisions can have on the customer experience fosters a mindset where engineering effort is focused on creating the best possible experience, not just delivering features.
A large part of building this culture also meant using Chaos Engineering to anticipate and proactively address failures. Originally, the team used an open source failure injection tool to simulate poor network conditions. They found that Gremlin fit into their goals and allowed them to build out their Chaos Engineering practice more effectively.
We adopted Gremlin to help foster a culture of being prepared for anything and building reliable services. We were able to take [Gremlin], embed it inside of our teams, and use it to simulate the types of experiences our customers could experience when using our APIs and our products. We really found that utilizing Gremlin made the efforts to simulate these types of failures a heck of a lot easier.
Empathizing with customers is a crucial part of Twilio’s culture. Tyler discussed two traditions Twilio uses to help its engineers understand the customer experience.
First, new engineers are placed into a support queue for one week after onboarding and training. Handling support tickets gives them a direct look into customer pain points and the effect that their decisions can have on the customer experience. “Wearing the customer’s shoes,” as Tyler puts it helps, relate the work that engineers do back to the customer.
Second, Twilio encourages and rewards employees who build applications using the Twilio API. Employees who successfully build and demo their applications earn a signature red track jacket. Besides rewarding creative employees, the program also helps employees understand the experience customers have when using Twilio, and can help surface issues. Several of these “track jacket apps” are even published on the Twilio blog so that customers can benefit directly.
Last is accountability, which Tyler breaks down into two key elements: measuring, and psychological safety.
When discussing reliability, teams often make the mistake of focusing on incident counts. At Twilio, this is only part of the full story. More important are metrics that directly reflect the customer experience, such as mean time to detection (MTTD) and mean time to resolution (MTTR). Not only does Twilio measure MTTD and MTTR, but teams that successfully reduce their MTTD and MTTR are encouraged to share their learnings, practices, and strategies with other teams.
You can’t fix what you don’t measure...When you’re thinking about the measures that you want to put in place as you’re starting to drive towards a culture of accountability, think about what it is you’re going to measure that’s going to reflect that experience that your customers had.
In addition, Twilio practices a blameless culture by providing psychological safety. When an incident happens, engineers aren’t blamed, but empowered to resolve the problem, learn from it, and improve. Incident response teams need a high level of communication, support, and trust in order to provide psychological safety in order to create a more effective incident response process where the end goal is bettering the product, not finding out who’s at fault.
How do you create a culture that strives to learn from the unanticipated investments that can be incidents?
Twilio didn’t build its reliability culture overnight. Making organization-wide cultural changes is an ongoing, iterative process that takes practice. The key is to have a clear, unifying goal that teams can strive towards: in Twilio’s case, providing the best possible experience for their customers.
Here are some tips for helping your teams strengthen their reliability practice:
- Relate your core values back to your customers. Always focus on the customer and maintain customer empathy.
- Start early in the development life cycle. Start thinking about the customer experience in the design phase of a new product or service. Set aspirational goals for reliability in design, then continuously test against them as engineers build and test the product.
- Learn from failings. Use incidents as a learning experience, whether they're your own or those experienced by other companies.
There’s much more to the webinar than we covered here, including how Twilio runs GameDays, how to measure the customer impact of incidents, and how to demonstrate the value of reliability to management. Make sure to watch the full on-demand webinar here: How Twilio built a culture of reliability.