Podcast: Break Things on Purpose | JJ Tang: People, Process, Culture, Tools
For this episode we’re continuing to “Build Things on Purpose” with JJ Tang, co-founder of Rootly, who joins us to talk about incident response, the tool he’s built, and his many lessons learned from incidents. Rootly is aiming to automate some of the more tedious work around incidents, and keeping that consistency. JJ chats about why he and his co-founder built Rootly, and the problems they’re trying to fix and eliminate when it comes to reliability. JJ reflects on what sets Rootly apart, how they handle chaos engineering, and more!
In this episode, we cover:
- 00:00:00 - Introduction
- 00:00:57 - Rootly, an incident management platform
- 00:02:20 - Why build Rootly
- 00:06:00 - Unique aspects of Rootly
- 00:09:50 - How people should use Rootly
- Rootly: https://rootly.com/demo
JJ: How do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist and Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it.
Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build modern applications, or help us fix them when they break. In this episode, JJ Tang, co-founder of Rootly, joins us to chat about incident response, the tool he’s built, and the lessons he’s learned from incidents.
So, in this episode, we’ve got with us JJ Tang, who’s the co-founder of a company and a tool called Rootly, welcome to the show.
JJ: Thank you, Jason, super excited to be here. Big fan of what you guys are doing over at Gremlin and all things Chaos Engineering. Quick intro on my side. I’m JJ, as you mentioned. We are building Rootly, which is an incident management platform built on top of Slack.
So, we help a bunch of different companies automate what we believe to be some of the most manual and tedious work when it comes to incidents, like creating virtual war rooms, Zoom Bridges, tracking your action items on Jira, generating your postmortem timeline, adding the right responders, and generally just helping build that consistency. So, we work with a bunch of different fast-growing tech companies like Canva, Grammarly, Bolt, Faire, Productboard, and also some of the more traditional ones like Ford and Shell. So, super excited to be here. Hopefully, I have some somewhat engaging insight, I hope. [laugh].
Jason: Yeah, I think you will because in our discussions previously, we’ve always had fantastic conversations. So, you’ve kind of covered a lot of the first question that I normally ask, and that’s what did you build? And so as you explained, Rootly is an incident management tool; works with Slack. But that naturally leads into the other question that I asked our Build Things guests, and that’s why did you build this? Was it something from your experience as an engineer that you’re just like, “I need a tool to solve this?” What’s the story behind Rootly?
JJ: Yeah, definitely. Sorry to jump the gun on the first question. I was a little bit too excited, I think. But yeah, so my co-founder, and I—his name is Quinton—we both used to work at Instacart, the grocery delivery startup. He was there super, super early days; he was actually one of the first SREs there and kind of built out that team.
And I was more on the product side of things, so I helped us build out our enterprise and last-mile delivery products. If you’re curious what does [laugh] grocery have to do with reliability, actually, not that much, but the challenges we were dealing with were at very great scale. So, it all started back when the pandemic first started getting kicked off. Instacart was growing rapidly at the time, we were scaling really well, we were heading the numbers where we want it to be, but with suddenly the lockdowns occurring, everyone overnight who didn’t care about grocery delivery and thought, “Well, why don’t I just drive to Walmart,” [laugh] suddenly wanted to order things on Instacart. So, the company grew 5, 600%, nearly overnight.
And with that, our systems just could not handle the load. And it’d be the most obscure incidents you wouldn’t think would break, but under such immense stress and demand, we just couldn’t keep the site up all the time. And what that really exposed on our end was, we don’t have a really good incident management process. What we were doing was, we kind of just had every engineer in a single incident channel on Slack. And if you got paged, you just kind of ping in there. “I just got woken up. Did anyone else? Does this look legit?”
And there was no formal way, so there was no consistency in terms of how the incidents were created. And then, of course, from that top-of-funnel into the postmortem, there wasn’t too much discipline there. So, we really thought about, you know, after the dust kind of settled, there must be a better way to do this. And like most organizations that we work with, you start thinking about how can I build this myself?
I think there’s probably a little bit of a gap right now in this space. People generally understand monitoring tools really well, like New Relic, Datadog, alerting tools super well, PagerDuty, Opsgenie, they do a really good job at it. But everything afterwards, the actual orchestration and learning from the incidents tends to be a little bit sparse. So, we started embarking on our own. And for my co-founder’s side of things, he was more at the heart of the incident than I was. I think I was the one complaining about and breathing down his neck a little bit about why things [laugh] sometimes weren’t working.
And—yeah, and, you know, as we started thinking about internal solutions, we took a step back and thought, “Well, you know, if Instacart is facing this problem then I think a lot of companies must be as well.” And luckily, our hypothesis has proven to be true, and yeah, the rest is just history now.
Jason: That’s really fascinating, particularly because, I mean, it is such a widespread issue, right? And I think I’ve experienced that as well, where you’ve got a general on-call or incidents channel, and literally everybody in the organization’s in there, not just engineers, but—like yourself—product people and customer success or support folks are all in there. And the idea is this, sort of—it’s a giant, giant crowd of folks who are just, like, waiting and wondering. And so having a tool to help manage that is extremely useful. As you started building out this tool, I’m starting to think there are starting to become a lot more incident management tools or incident response management tools, so talk to me about what are the unique points about Rootly?
Because I suspect that a lot of it is influenced from, “These are the pain points that I had during my incidents,” and so you pulled them over? And so I’m curious, what are those that you brought to the tool that really help it shine during an incident?
JJ: Yeah, definitely. I think the space that we’re in right now is certainly heating up as you go to the different conferences and the content that’s put out there. Which is great because that means everyone is educating the broader audience of what’s going on and just makes my job just a little bit easier. There’s a couple, you know, original hypothesis that we had for the product that just ended up not being as important. And that has really defined how we think about Rootly and how we differentiate a lot of what we do.
How we did incidents at Instacart wasn’t all that unique, you know? We used the same tools everyone else did. We had Opsgenie, we used Slack, Datadog, Jira, we wrote our postmortems on Confluence, stuff like that, and our initial reaction was, “Well, people are using the same tools, they must be following a very similar process.” And we also looked and worked a lot with people that are deep into the space, you know, Google, Stripe, the Airbnbs of the world, people that have a very formal process. And so we actually embarked on this journey building a relatively opinionated tool; “This is how we think the best incidents can be run.” And that actually isn’t the best fit for everyone.
I think if you had no incident management process whatsoever, that’s great. You know, we give you super powerful defaults out of the box, like we do today, and you kind of can just hit the ground running super fast. But what we found is despite everyone using basically the same kind of tools, the way they use it is super different. You might only want to create a Zoom Bridge for, you know, high severity incidents, whereas someone else wants to create it for every single incident, for example. So, what we did was really focus on how do we balance between building something that’s opinionated versus flexible, where should customers be able to turn the knobs and the dials.
And a big part of it is we built what we call our workflows, and that allows customers to create a process that it’s very similar to theirs. And a part of that we didn’t anticipate at the very beginning was, although the tool is super simple to use, I think or average install time is probably 13 minutes, all the integrations and everything on a quick call with our customers, the really heavy lifting comes with, how do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist in Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it. So, we can’t be just ripping tools out of their existing stack, we can’t be wildly changing every process; everything has to happen progressively, almost, in a way. And that is a lot more digestible than saying you’re going to replace everything.
So, I think that’s probably one of the key differences is we tend to lean more on the side of playing with your existing stack versus changing everything up.
Jason: That’s a really good insight, particularly because coming from Chaos Engineering, and that is almost entirely changing the way that people work, right, is Chaos Engineering is a new practice, so I definitely empathize with you, or sympathize with you on that struggle of, like, how do you change what people are doing and really get them to embrace it? That said, being opinionated is also a really good thing because you have a chance to lead people, and so that leads me to our final question that we always ask folks—and this is where being opinionated is good—but if folks were to use Rootly, or just even wanted to improve their incident response processes in general, what are some of those opinions that you had about how people should be doing that, that they should consider embracing?
JJ: Yeah, that’s an awesome question. So, a couple things, a little bit related to your second question that we initially thought but just proved to not be as important for us, everything that we build at the beginning—and still build—is relatively laser-focused on helping you get to that resolution as fast as possible. But from an organizational perspective, what we found is, people don’t think about incident management success as how quickly they can resolve an incident. A lot of it’s actually just having that security and framework and consistency around the incident. So ironically, as a tool in incident management, the most important things are actually around your people and the process and the culture that you can develop around the tool.
No matter how good of something that we build, you know—let’s say you’re an organization, you just bring in Rootly, you have a very blameful way of handling postmortems, no one generally understands how severities in organization work, you’re super laser-focused on, you know, tracking MTTR, which can not always be the best metric, but you still want to interpret it as such, it’s very difficult to make the tool successful. So, that’s the biggest advice that we give to our customers is when we see those type of red flags from, like, a process and culture standpoint, we’ll try to guide them the best that we can. And we’ll also do it from a product perspective. What you get out of the box today, we have companies as small as, you know, 20 for example, just kind of being able to hit the ground running; they’ll use workflow templates that are pre-built based on some best practices that we’ve seen to just kind of layer in that framework. So, I think that would be a really big one that we’ve noticed is it’s not all about us; it’s not all about the product and the benefits that we can provide; it’s about how we can actually enable our customers to get to that stage.
Jason: I love that answer. Well, JJ, thanks for being a guest on the show and sharing a bit more about your journey and the journey of Rootly. If folks are interested in trying out the product and getting better at incident response, where can they find more info about you and about Rootly?
JJ: Yeah. You can just visit rootly.com/demo. We do offer a 14-day trial if you want to sign up for free.
If you want to talk to one of us or partnerships team, you’re welcome to book a personalized session. I recommend that because then you get to see my super cute dog that isn’t with me right now and wouldn’t matter because this is audio only, but I love showing her off. That’s my favorite part of my job.
Jason: So, if you want to go see JJ's dog, or learn more about Rootly and incident management, go check it out. Thanks again.
JJ: Yeah, thanks for having me.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more
Treat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.Read more