Podcast: Break Things on Purpose | Ep. 11: Ryan Kitchens, Senior Site Reliability Engineer at Netflix

We’re excited to kick off Season 2 of Break Things on Purpose next month. In anticipation of our next season, here’s a bonus show from our archives!

‍

This episode we speak with Ryan Kitchens. Ryan is a Senior Software Engineer from the Core SRE team at Netflix.

Topics covered include:

Working on reliability for World of Warcraft compared to reliability at scale for Netflix.
Chaos Monkey and ironies of automation.
The optimal number of incidents.
False confidence that can be inferred from MTTX
Mental Models and post-mortems.
The myths of root cause analysis.
Conveying successes to management.

Ryan's Twitter

Transcript of the Episode

Rich Burroughs: Hi, I'm Richard Burroughs and I'm a Community Manager at Gremlin.

Jacob Plicque: And I'm Jacob Plique, a Solutions Architect at Gremlin and welcome to Break Things on Purpose, a podcast about chaos engineering.

Rich Burroughs: In this episode, we speak with Ryan Kitchens from the core SRE team at Netflix. Ryan has focused on researching incidents, which is a pretty fascinating job to have and we talked with him about a lot of aspects of reliability. Jacob, what sticks out to you from our chat with Ryan?

Jacob Plicque: So Rich, I expected to discuss a lot of things in relation to chaos engineering on this podcast, but never in my wildest dreams that I think we'd be discussing World of Warcraft and their tailored reliability and it blows my mind. What about you?

Rich Burroughs: Yeah, that was awesome. I love the idea that incidents aren't bad or good, but just things that happen and that we can use them to improve our mental models. Chaos engineering is a really great tool for learning about our systems and I always love talking about that.

Jacob Plicque: Agreed, same here. So just a reminder for everyone, you can subscribe to our podcast on Apple podcasts, Spotify, Stitcher, and other podcast players. Just search for the name, Break Things on Purpose.

Rich Burroughs: Great. Let's go to the interview.

Rich Burroughs: Today we're speaking with Ryan Kitchens. Ryan is a Senior Engineer on the Core SRE team at Netflix. Welcome Ryan.

Ryan Kitchens: Hi, thanks for having me.

Jacob Plicque: Yeah absolutely, super great to meet you and super excited to chat a bit with you today. So, uh, just to kinda kick things off, if you could tell us a little bit about your career leading up to Netflix where you are today, I'm especially super interested to hear about your experience at Blizzard. I read somewhere, you were a founding member of the SRE team. There is that right?

Ryan Kitchens: Yeah, that's right. So I like to explain it as I've had the privilege of working in a variety of different positions. In software. My first job was, at, a shared hosting company where I did sort of both responding to problems in the data center, like racking and stacking stuff, building out, enterprise clusters, for our clients and then also sort of like, you know, like tier three kind of CIS admin sort of work. And from there, I moved on to uverse.com, which was a video OTT streaming platform, which might sound familiar to where I work now at Netflix. but yeah, I also worked at Blizzard Entertainment. I got to work with the World of Warcraft team, which was a dream of mine. It was fantastic. I grew up playing the game. So getting to work with all the engineers over there was awesome. We got to form an SRE team. I was one of the first members that sort of laid the foundation for it. And it's still going today, which is awesome. I hear good things. So yeah.

Rich Burroughs: How do you compare your time working in gaming with what you do at Netflix?

Ryan Kitchens: I think there was a really interesting dynamic around what availability means for a video game, like World of Warcraft versus something like Netflix, at a lot of companies, particularly in web development, you treat your customer base. A little homogenous. When you think about outages and who's impacted, and it can be very different for something like World of Warcraft, where, each cluster is kind of unique because you have this concept of realms.Everyone has a home realm and people have guilds and they're, they have persistent connections to the servers that are long running. So, you can knock one high-tier rating Guild offline and the customer impact and the disposition of that, actually, there's a, a level of the players, competitive advantage in this race for world first boss kills and stuff like that, that gets affected where we're talking about, you know, 20 to 40 people and then you look at an outage for something like Netflix and you have, you know, thousands to millions of people affected. Right. The the way you gauge, what are the implications of availability? How do you measure that customer experience are pretty nuanced.

Rich Burroughs: Wow. That's super interesting. I did a little bit of time playing. WoW. I was, I was seriously addicted at one point in my life where I would just go to work all day and they'd go play WoW all night. Um, and yeah, it's super competitive. Like everybody is just trying to get further than those other people that you see hanging around in the town square or whatever.

Ryan Kitchens: Yeah. So when we think of things like the blast radius of an outage or an experiment that has wildly different parameters that go into it for experimenting on a video game infrastructure.

Jacob Plicque: That's, so fascinating to me as, I will admit as well, I'm the third of three that it was also addicted to World of Warcraft in my, in my time. Um, I'm fascinated. I'm fascinated by seeing your blast radius, when you say little Rich like, does that mean little in months or little years?Cause like, for me it was like a solid three or four years of playing and it's really funny cause I've never really considered the fact that life. I mean, obviously we think about, you know, uh, you know, very supply and demand and how, you know, demand is very driven in by the fact that, you know, people find a particular thing important and how, those 20 or 40 people that you're talking about Ryan are like, they're ravenous, right? Versus like the millions of people that are trying to watch Stranger Things, episode one. But like those people, that, those means that people are probably not as vocal as the 20 and 40 that, because now that those 20 to 40 people couldn't get in because of, you know, uh, scaling issues or latency and then, but person 51 is now number one and now that's all you heard about. It's the 40 people that missed out.

Ryan Kitchens: I'm really big on thinking about the product customer feedback loop. And so when you have a dimension of your customer base, that is incredibly, passionate, vocal power users. This can be interpreted in different ways, depending on your business. So you can think about the UI of Netflix, there are different, sections of our user base who may want different things or different, recommendations or different functionality in the UI. And so you have to think about the way that people engage with your service. When I, if you're talking about something like World of Warcraft, where like that rating Guild may have more weight put on it in terms of what they think, because, they're much more in tune to the way the game fundamentally works.

Rich Burroughs: I'm going to talk to you guys after we're done recording and we're going to set up a WoW podcast where we're going to make this happen.

Jacob Plicque: That's amazing. That's amazing.

Rich Burroughs: So Ryan, you did a really great talk at SRE Con Americas that was called, how did things go? Right. We're learning more from incidents and we're gonna want to dig into that some, because, I just loved it. I think Jacob did as well. Um, Yeah, you bet. You mentioned chaos engineering, some in the talk in terms of, eliminating some of the low hanging fruit that you can find with experiments.

And, we actually had Haley Tucker from Netflix on the show a couple months ago. And we talked about this some with Haley and, she talked about it in terms of Chaos Monkey, that causes Chaos Monkey has been running there for so long. For the most part, that lesson is already learned, and new people that show up right into it, but, but it's not for the most part preventing big incidents because you all have enough experience with it that, that sort of baked into the culture at this point. So I'm wondering what kinds of experiments you see that you all do at Netflix that are interesting and valuable?

Ryan Kitchens: Okay. So something peculiar with Chaos Monkey, let's start with that because we're at a level of maturity where Chaos Monkey, it is opt out at this point. It's, it's just, they're running all the time and there's an expectation that you have to deal with it. If you are a service owner, you have to architect your system to handle single node failures and, or be rescheduled at any given point on to, you know, like a container running on Titus.

Um, so. Uh, there are, you start to hit these like ironies where we, if we turned chaos monkey off, this may reveal some risks, like file descriptor leaks, or memory leaks in the longterm, because what Chaos Monkey kinda does is recycle your instances.

Rich Burroughs: I it, so we'll talk about that. Maybe it was you, we chatted a bit at redeploy comp and I think that maybe that came up and I thought that was a really fascinating thing that, because it is terminating all these instances, it's recycling them, like you said.

Ryan Kitchens: Yeah, thinks a lot about, uh, these kind of like irony of automation problem that is such a peculiar one that, that is really relevant to chaos engineering.

Rich Burroughs: Yeah, we'll have to link to that paper in the show notes, which is a brilliant one for folks who haven't read it, the ironies of automation.

Jacob Plicque: Well, that's what I haven't, I haven't read myself, so I'll have to take a look at it as well. Um, what, one thing, or actually there's a few things that really, really stood out to me, specifically from the talk and one was about how we're looking at failure, almost not, not quite like wrong, but where we're really focused on it.

And the fact that a lot of the experiments that we're doing are around, things that we're trying to prevent when, maybe we need to kind of zone in and focus more on. On actually how we're reacting to those types of failures, because I think, and what you were making was, okay, great I can go down the gamut of everything that can go wrong, but there's a possibility that, um, I'm missing out on a lot of the value because of the fact that I'm focused on running these experiments and checking boxes versus how does my team, respond to through that failure? Because I think the argument is really that failure is a good thing for us to learn from.

Ryan Kitchens: Yeah. Um, so one of my colleagues, J Paul Reed just, just recently wrote, "are you having the optimal number of incidents?" The things we learned through incident response actually build a kind of expertise and you're never really going to get rid of all of them. You, you, you can chase that forever and when we talk about things like continuous improvement, it's not a gradient of, from good to better.

The word continuous is like the things we are doing the problems we're solving always create a new kind of risk and we're forever chasing that and that flux of, of how, how we move on this spectrum is really what ideas like that are about.It's not in terms of maturity, it's not a good or bad it's life. And I think this resonates a lot with people in the chaos engineering space.

Jacob Plicque: For sure. And I think that also ties into like almost kind of a tongue in cheek, but I think a really interesting point is like, the success is kind of boring and we're not really looking at it at all, even though like, you know, the idea behind a successful business is that they're up and running a majority of the amount of time, but we're not even looking at that data.Right.

Ryan Kitchens: Right. Yeah. If you think about, the standard deviation or distribution of the incidents are having, and if you're at a point where you're that the measurement you have on the way you determine availability is like three to five nines, um, that the times are having incidents are actually a really small portion of what actually is going on.

So. The question gets posed. Well, when it seems like nothing bad's happening, what's going on. Like, what are people doing that is creating the safety around us? If we don't tap into those things, the in near misses are the easiest thing to think about this. Those almost incidents, those little problems, people are changing and adapting to all the time. If we can get a handle on those, we learn a little bit more with, without the pressure of the incident.

Rich Burroughs: You use the phrase or the word surprises as, as a way that you'd like to see incidents described.

Ryan Kitchens: Yeah. So my colleague Lauren Hochstein and I are trying to push people toward this phrase of operational surprises because it gets to a lot of what I think folks like John Allspaw are advocating for with, um, recalibrating our mental models is a really significant activity we have to do.And, uh, following incidents because the incidents we have are this window into the way things actually work and I can't stress enough this idea around, like, it's not a matter of good or bad. This is just the way things are. So the idea that John Allspaw has around incidents are unplanned investments and you've already spent that money. It's a sunk cost.

So your responsibility is to maximize the ROI on it. And you do that by learning as much as possible following an incident. And what I mean by learning, it's not like sitting in a class, I'm not talking about like education or training. I'm talking about how people construct knowledge between each other, how skill transfer happens, where expertise lies in your organization.

Jacob Plicque: And I think an interesting consequences of incidents like that are typically, and, you know, we run into, documenting the kind of the aftermath of it. Right. And then, uh, so-and-so that went through set incident. That incident happens again. All right, cool. Nailed it. I know exactly what to do, but I, but I, having not gone through that incident have no idea what's going on.

And then when I run into that, I have to either A, call that person and I hope they wrote it down somewhere.

Ryan Kitchens: Yeah, what you're describing. I find really fascinating because people who have been on call for a long time can sit through an incident review for an issue they've had, that's kind of similar, or maybe, maybe extensively a repeat incident. And they feel like they know everything and they're not learning anything yet. Everyone else in the room may be having incredible insights. When you interview and ask for feedback on an incident review, you might find that the people with the most tenure in your organization feel like they're getting less out of it. So in order to combat this, what you have to do is get that kind of tacit knowledge out of their head. This is really what good facilitators do by asking the naive questions and getting the experts articulate. Just what it is that those experts take for granted.

Rich Burroughs: So you mentioned earlier mental models, and I think that's what we're talking about here, right? That everybody who interacts with the system has their own, level of experience with it, their own areas of expertise that things they know the most about. And that's going to vary from individual to individual.

Ryan Kitchens: Right, so Nora Jones has brought up, a really good point with, when you find people disagreeing, even on the implementation of something of like, Oh, I think it works that way versus I think it works this way. That is actually really significant data you can use when you're following up on an incident, because what this is getting at is how people's mental models differ. How did it come to be that you thought the system worked that way? And how do I get these two people to have that aha moment, to collaborate and calibrate their mental model?

Rich Burroughs: This is one of the use cases I think that's really great for chaos engineering. If you're having game days and getting people together and experimenting together, they have a place for that shared learning to happen.

Ryan Kitchens: Right. It's so satisfying to go through the experience of developing the experiment and setting up all of your assumptions and expectations and the invariants of the things you think should never happen. And then you get to prove it out. That feels so good.

Jacob Plicque: It, it boggles the mind to me that it can, it. It never, it's never boring because that's absolutely so true. I've seen it day in and day and day out. It's, amazing because, you know, in some cases it's just not, not necessarily like, you know, mental model versus mental model, we're just proving an assumption right or wrong, or a hypothesis right or wrong. Um, you know, especially with things like, you know, retry and timeout logic when we're, when we probably architected that, you know, we probably didn't even know how to. Architecture with fire, you know, three or four years ago. So we kind of were like, yeah, that sounds about right.

Rich Burroughs: So Ryan, do you all use, chaos engineering when you're investigating and responding to incidents that are at Netflix?

Ryan Kitchens: It's mostly, we use the incidents to inform what we're going to do with, chaos experiments. I think the end game here is, following an incident where you, you have a written document of kind of. How we got here, what happened, the things we understand the consequences and implications of that incident.

And then like some of the work that's going to be done going forward. If you can take that and annotate it such that you've highlighted risks. And vulnerabilities to feed into something, like we have at Netflix monocle, for example, and, let teams know that they're kind of subject to these gotchas that we've identified.

That's the way that you get to amplify and scale the things you've learned from an incident, because so often, especially at large organizations you'll have an incident review. You'll file it away. And then. Uh, that feels kind of like the end of it. And the more you can do to keep that thread going to amplify what has been learned to get more people engaged in that information, you get to solve that problem for the whole company.

Rich Burroughs: Yeah. When we had a really good chat with Haley Tucker about monocle, and for anyone listening, who hasn't heard that episode, go back and listen to it after this one, because, it's pretty fascinating.

Ryan Kitchens: Yeah, Haley is awesome. We're a kind of sister teams in a way, I suppose.

Jacob Plicque: One thing that I, that actually go ties back into the talk a little bit is something that you've talked about in the fact that tying back to the, kind of the incident postmortems there's like, no like root cause. Right? And like, kind of got, was, there was actually like a camera in the room. Like I was almost curious if that was like a small extending ovation for a second. It was, it got really loud in that talk for a second.

Ryan Kitchens: Yeah. So This was the first talk I ever gave, by the way. So to -

Jacob Plicque: No way, really?

Ryan Kitchens: So to have that here for you was so incredibly rewarding and amazing, and, I was completely blown away by that.

Jacob Plicque: That's amazing.

Rich Burroughs: We're going to link to the video of the talk, in the show notes, people really should go watch it. And I heard you say that Ryan afterwards, that it was your first talk ever, and I could not believe it because it was so good, I've heard a lot of other people and you mentioned J Paul Reed earlier who's someone who's talked about this a lot, you know, the, the fact that there's not a root cause to an incident, do you think that that message is getting out there more? Because it seems like that root cause mentality is so baked into ITIL and, and silly the things that people have been doing for so long in the industry.

Ryan Kitchens: Yeah. And I think everyone generally has a sense of or like a belief in the way that failure works. Right? So, so there's just a common mentality that exists about this in the world. So to, to speak to this, to say, well, the things that the things you believe are a little more complicated than that is initially met with a lot of skepticism and defensiveness.

Um I feel like our field. In contrast to other fields like, you know, aviation and medicine and stuff, doesn't have to deal with a lot of that. We are pretty good about being at the forefront of things. So when we say "there is no root cause and someone goes, "well, I can point to this one little component that failed" that to say there is no root causes, not dismissing the idea that there was a triggering event.

What we're saying is like, stopping at a root cause is sort of the issue here. Like ironically, the idea around root cause is kind of surface level and shallow. But when people are kind of onboarding into industry, they're taught that like, no, this is really how you get to the bottom of it. And so they're coming to find out that no, you can actually kick it up a notch. Um, there there's more to go into. It's not only the things that triggered the incident, like that's kind of table stakes. Like you have, you kinda have to do that to understand what happened, but there are conditions, socio-technical conditions, both in the organization and the software we've designed that set up the conditions that allow this to happen and that's the nuance we're getting to of, there are systemic factors that a causal reconstruction of an incident is not able to explain.

Rich Burroughs: Yeah, agreed. I mean, it seems like the causal thinking is the problem there. And. That's sort of hard to get away from. I think that we, as people are sort of wired to think about things that way.

Ryan Kitchens: It's very interesting to think about the way that, uh, causal explanation is incredibly satisfying yet on the holistic plane of existence, is it, it limits what you learn in the long run.

Jacob Plicque: It w I think what's also interesting about that is like, I think it's also, uh, both, uh, uh, well, I don't know if it's a consequence or a really good thing that, I think it's also comes back to the fact that like, our systems are now more complex than ever, and that's just not good enough anymore.It's not as simple as all right. Let me walk into the data center and, uh, all right. Yep. That's not powered on anymore. That's weird. Okay, cool. And I'm done like we're, we're talking about latency and microservices and it's just not as simple as it is and I think that's, maybe that's the trap that folks are falling into is that as you know, our systems are becoming larger and more complex, we're kind of used to apply in that bandaid and it's just, and we, we need to, to, uh, figure out a way to get better.And I think that's really what it comes down to.

Ryan Kitchens: There's kind of a natural tendency to want to take the sort of reductionist, like how much bang for my buck. Can I get out of this thing? But the cool thing about software as, where we are today and its complexity is given how much around metrics and observability and auditing things exists. You can see so much, well, I say, see, That's that's also another thing and incidents, we actually kind of have to go in and ask people like, well, how did you really see it? Because you can't watch, you can't watch the instructions execute really. But, uh, you, you really get to annotate and articulate the, how this complexity is like what it looks like and when you see the death ball of microservice call spans, uh, it's, it's pretty easy for someone to go. Okay. Yeah, I get it.

Rich Burroughs: So Ryan, your focus is investigating incidents. That's your job. And, and I wonder if you could talk us through what you do as like that process. So say that there's an incident that you want to look at. What, what does that process look like?

Ryan Kitchens: Yeah. So, so you kind of have to determine the criteria of is this worth diving into deeper. And I think the only way to get us into this is by doing, Over time you will sort of develop this criteria to look at an incident. And I, uh, am still surprised sometimes where someone will DM me in Slack and say, this looks pretty interesting and I'm like, you know, I'm not sure. And I look into it and, Oh my gosh, the timeline goes back, you know, three or four months. I talked about a particular incident at a Redeploy involving a database that went down and lo and behold, you trace it all the way back to the early days and somebody just wanted names consistently formatted in a spreadsheet.

Jacob Plicque: What.

Ryan Kitchens: So, determining how to develop this criteria around , is, it worthwhile? Is there a lot of, uh, meat there, uh, is a really interesting, the other thing is we, for people wanting to get started, we suggest not using a really big visible incident because there's a lot of downward pressure from your stakeholders and management and that sort of thing,

Rich Burroughs: Oh, interesting.

Ryan Kitchens: And so if you're just trying to get started, that is not the kind of incident you want to do it with. You want to take something like a, a near miss something a little more subtle, or like a really novel bug to dig into. And so what you do is, um through. Thinking about this, this idea of second stories, this gets to sort of like the critical safety, literature, but, the typical write-ups we've seen of like, here's kinda what happened, here's what triggered it. Here's what we're doing about it is your first story. To get to the second story. You sit down with people and you chat with them and you say, show me how you usually do this thing.

And you ask probing questions to try to elicit all the things in their head that they kind of take for granted that that kinda like expert, uh, thing I was talking about earlier, um, So people don't have access to their cognition. They remember things slightly differently. But the goal here is to get as many diverse collective perspectives as possible to help tell the story of what happened and amplify everyone's voice in there. So the traditional way you see incident write-ups done is from this kind of like omniscient objective perspective. So what we try to do is tell this narrative story that includes all these different perspectives, um, to say, this is what people, this is what was hard. This is what people were thinking about. This is the actions they took at the time based on the information they had and, the response to this, a lot of times it is for people to say, "well, they ought to have done this", or "they could've, should've done that" or, "um, you know, why, why didn't we do it a different way?" And so, uh, the, these things are called counterfactuals. Those things did not actually happen. So when you hear those kinds of comments, what you want to do. Like write them down. They're really useful. What you're going to want to do with those is project them into the future.

So not that person should have done this thing. What you want to know is, what made sense to them at the time for them to do that? Otherwise they wouldn't have done it for all the coulda, shoulda, woulda stuff it's like in the future. How could we make those better? Like that that is a really productive way to take that. Not casting your judgment to say, you know, now that I have the benefit of hindsight, I can make the connection to say, this is what should have happened. Uh, so we try to rework a lot of those traditional ideas into something more productive. That's the thing, that's a concept that's generally applicable.

Rich Burroughs: You know when the big AWSs three outage happened in 2017, there were people on Twitter, you know, whose reaction was, "Oh, that person's going to get fired". You know, that, "wow. They really screwed up" and, you know, people who, uh, We're more thoughtful about it, understood that that operator we're making a mistake was, part of a system of things. Perhaps they shouldn't have been allowed to make that mistake by the system.

Ryan Kitchens: Right. Yeah. I guess the idea that, that it, the best perspective the mindset to take on this is it's always, an artifact of the design of the system. There's only so much you can do to not quote unquote, make mistakes, like any system that relies on people not making mistakes is fundamentally broken.

Jacob Plicque: Okay. It kind of reminds me of like, the fact that like empathy is a choice. Right. And like, I find that like a lot of the, you know, the incident write-ups that we see, you know, publicly are, are even less about, this is what broken, why a broken and more about like, this is what we, this is what we did in reaction to it.And this is how this affected us. And I think what's, what's something I haven't thought about it until you mentioned it was that, you know, it's not even like one person telling you that story. Right. It's the entire team. I find that like super super fascinating, because it's to me now, now that you mention it, that's, that's such a duh for me now. And so I think that's a big takeaway for me.

Ryan Kitchens: Yeah. So the goal of writing up a narrative of what happened, and capturing all the different perspectives is to generate new insight. So, once you have this narrative, what we do is we break it down into a few sections of, here are the risks we can generalize coming out of this, like, um, Uh, unbounded Q links is a pretty common example.I like to throw out there, where you have some service that on like a pub subsystem and the queue just grows and grows and grows, um, until everything falls over, um, uh, another one.

Rich Burroughs: I've never encountered that. I have no idea what you're talking about.

Jacob Plicque: Kafka anyone?

Ryan Kitchens: There's a, there's a really powerful idea that what made this incident not worse than it was? We refer to this as mitigators. This is our idea of how did things go well, because, um, if you want to talk about finding sources of resilience and all the things that went well, you can just point to something that's working and be like, well, that's preventing an incident right now because it's working.

Um, and so this, this could be a litany of infinite things. So we try to look at this in a, in a scoped way to say "what here really saved the day?" Or was, uh, an example of excellence and expertise. What, what things happened that stopped it from being worse. Um, and the, the other thing it is articulating, this is like breaking it down by line item and I had an example of this in my SRE con talk of the contributors and enablers. So this is not only what triggered the incident. It's all the things that existed to create those conditions, to allow it to happen. And you'll find. If you peel back the layer and go beyond what just triggered an incident that you'll find 10 or 15 things that you can sort of dig into.

Jacob Plicque: So I think what's interesting is that like the challenge I, I feel like as we talk through this is, is how do you then expose the value back to the business and say, Hey, this is why we're spending, because obviously these things take a good, I'd say great amount of time. So how does that then funnel upwards?

Ryan Kitchens: Right. This sort of gets to, a lot of the discussion around, metrics and reporting and stuff. And I think the responsibility here is to, take, what's been discovered post-incident and inform, people who are creating, Broadly applicable software in your organization for platform teams for, deployment systems, uh, any of this kind of stuff. I mentioned like, feeding the vulnerabilities into, to Monocle as a thing like this. So each of those teams who are looking at how their user engagement is and how their users are solving problems with the tools they're providing. So the thing that's happening here is that you're using deeply qualitative data to inform the quantitative data.

Because the problem with just looking at the numbers is you've dropped all the context. These numbers are mostly meaningful only if they have it, the context.

Jacob Plicque: Which ties directly back to the nines don't matter argument.

Rich Burroughs: So I want to ask you about this because I, I actually, we were at redeploy comp, which was awesome by the way. I think folks, uh, when that rolls around next year should think about attending. You and I had a little conversation where, you know, I asked you about this because, you know, you said the thing about how the nines don't matter, and I've heard a lot of other people, who talk about, well, you know, things like meantime to detect and meantime to resolve can be gained and they're kind of meaningless and. And, you know, I come from kind of more of a pragmatic background and, and like, if I'm somebody at a company and I want to spin up a new chaos engineering program, I'm going to need some sort of resources to do that, right? We're we're either going to be writing software or we're going to be, you know, implementing open source software or we're going to be buying licenses. There's going to be a cost associated with that. And for me to go to management and justify that cost, I'm probably going to have to provide them with some sort of metrics. I guess I'm wondering what, what you would say to those folks in that position.

Ryan Kitchens: So, um, particularly on using like MTTR, uh, as a Holy grail is, uh, It's the distribution that matters. So if you break out the histogram of all the TTRs, you're, you're going to end up finding pockets and outliers. And you're going to wonder what that's about because the, if like looking at the meaning of it is really the problem.Um, And so it's with the context of this particular bucket may be contextualized by this particular problem set. And we have a tool that solves that problem while using TTR for that subset. Of of, uh, context is probably pretty useful to make that argument. So, so this is what I mean by like, just looking at the meantime as one singular number is fraught with pitfalls, but like it's, when you start to bring the context into it, then it, it makes a little more sense.

I would hesitate to say like any of these metrics are meaningless, but they need to be informed by data. And it's a, it's a cycle. So, so. Once you've made changes to, to your tool that solves this problem. And you're looking at that, how it's improving the time to resolve for this classification of issues, that doesn't necessarily mean you are reducing risk because like, if you look at the outliers in your TTR for other incidents, you're, you're going to have things that are maybe in like 10 to 15 minute band, and you'll have some that are like two to three hours to resolve. Um, so the, you, you might think, Oh, well, let's just look at the outliers because those are the really gnarly ones and you may find that there are. Themes across incidents, even across these categories that lead to the conditions for those incidents to happen. So, so it, there there's a lot of nuance, even in classifying incidents, which we try to categorize like, Oh, this was a vendor problem. Oh, this was with the deployment process.

But you, you really have to look at the conditions that create the risk to, to, to try to outmaneuver that stuff. And. To simply say, we're getting better. Goes back to what I was talking about at the beginning where it's like, maybe like for some definition of better, uh, we can better that to an extent, but then at some point you're going to have a new set of problems to deal with.

So like you're continually chasing this stuff and, and, and driving it down. And, uh, simply reporting on the number is going to have diminishing returns at some point. So you've got to have this analysis that happens to keep informing you how to get ahead of what's coming at you.

Rich Burroughs: I had a really interesting conversation with Paul Ossmann, from Under Armour who was one of our earlier guests, but, uh, this was outside the podcast. We talked about this a little bit and he mentioned one thing he'd been doing with his team is actually, doing surveys on what their confidence in the system was and that was like another signal that they included in the stuff that they reported up to management.

Ryan Kitchens: That's cool.

Rich Burroughs: Yeah. Are there any other kind of metrics or signals that you would suggest people look at?

Ryan Kitchens: Um, particularly with regard to like, are you doing a good job at incidents kind of thing? Is that what you mean? Yeah. So, um, there, there are, there are signals and indicators around are how people are engaging with the artifacts that you're producing following an incident like, um, so some of this you can find with like the Etsy debriefing, facilitation guide asks you two questions of, did you learn anything that would change the way you work in the future and would you come to an incident review ever again?

I would add to this and I think, John Allspaw has had a blog post recently that, has many of these in there, which is, Like new hires or may start using these incident documents as onboarding material. People will start sharing links to this stuff across Slack. You can have analytics on their reports and all this stuff to see like who is actually reading it but then people start to refer to incidents as you know, that, that thing, like the 1806 or whatever, and they become reified in this way. Um, I, I find that really peculiar.

Rich Burroughs: Is 1806, a real number for it?

Ryan Kitchens: It is, yeah.

Rich Burroughs: I really did appreciate your talk at redeploy. We'll, we'll link to the video for that as well. Um, but it's, uh, you know, you shared some actual incidents that you'd had there at Netflix, and I appreciated the fact that you did that because not a lot of folks, I think talk publicly about their incidents to that level.

The thing that people post that is the thing to make their customers feel better. But the actual introspective kind of document that you're talking about, not many people share that sort of thing. And it was really cool to see you talk about those things on stage.

Was there any kind of push back against doing that at Netflix or are people totally comfortable with you all talking publicly about the kind of failures you've had internally?

Yeah, there was no pushback against this. It's ultimately it's to people's judgment, but, it depends on the details you share. I mean, not everything is in there, you know? It was well received here. People were like, Oh, we need to do this kind of mapping more. Uh, so it was really well received I think.That's awesome. It's, it's something that we, and by we, I mean, us at Gremlin, myself included have really wanted to see more of, of people, of sharing their failures because, there really are a lot of learnings that aren't just org specific, you know, and, and a lot of people can benefit from that.

Ryan Kitchens: Um, the, the internal versus external utility of it, matters a lot depending on your business. But, um, I think honeycomb found a really good balance there with the recent, uh, posts that they put out of an incident. Um, the, the other thing internally, I'll say one of the best utilities that people may not really realize is, um when you have all of, all of this information following an incident, in a very coherent, crisp way, you can take it to the group of engineers and their managers who were involved in it. And there's no one particular voice behind it. So everyone feels like the bar is kind of lowered to give their input and to articulate what they think is important or what they found difficult, or the particular like design decisions they may want to revisit. And, and I think this is incredibly useful, particularly if you have a disparity of like new hires and old hires, or, any kind of ways you would split your team based on functional areas or something like that because everyone kind of feels like they have a voice in it.

Jacob Plicque: And of course everyone's muscle memory is different or maybe in the case of a new hire, there isn't any, so just being on a, like a level playing field and just allowing that, that person that's been here, you know, forever, let's say to continue to have that voice while also allowing the new hire that's coming in with, uh, you know, doesn't maybe not, maybe doesn't have the same, what's the term Rose colored glasses or something like Rose gin, something like that.

So kind of ties into, uh, one of my favorite tweets I've read maybe in a long time is "what if, what if success was the incidents we made it along the way?" I'd love to kind of dive a little deeper into that. That was like, I got a good chuckle from it.

Ryan Kitchens: Right. Yeah. So what I will say to this is, just going back to the idea that like having incidents is not inherently a bad thing. Uh, and it's nothing to feel. Uh, ashamed about like nobody dropped the ball. Like, because if you're, if you're doing a good job learning from incidents, um, like you're gonna address that classification of problems, you'll have a new set of problems to deal with and you're gonna have more incidents. So in a way you're having incidents because you are successful.

Rich Burroughs: If you didn't have customers, if you didn't have traffic, none of those things would be happening.

Ryan Kitchens: Yeah. The only stable system is the one that went out of business.

Jacob Plicque: It's funny though. I think what's something that I've been thinking a lot about is not just like, you know, incidents in general, but like incidents that, maybe you've like you've caused, like, I. Like so practice what I preach. So I am fortunate enough to have brought down a multi-million dollar website by, with a proceeding space, uh, while doing a blue-green deployment because we were doing, we used to do it by hand and, flipping over the ELB, uh, copy paste, uh, enter. Goodbye website. But like, but, but like I, of course, you know, at the, in the, in the moment, like, Oh my God, like losing my mind. Right. But I learned from it because of the fact that like, all right, not only did I, I, I kind of knew that this was a bad practice. In general about process, I should say in general, now the onus is on me to not own this, but like, at least from my perspective, anyway, I was like, you know what, now I just prove that this is bad. What can I do about it? And then literally the next week I helped write some automation for it. And then that incident never happened again at that company.

So I'm curious as to like, if you've run into a situation similar to that,

Ryan Kitchens: Uh, there, there's one thing I really want to want to poke on that, that you said, which was, um, you had a sense that this was kind of kind of a bad practice. And what I would say is like, there, there were pressures that existed at the time that made sense to do this probably for you. And, uh, th this the sense that you had that, ah, something's a little off here. We call that operational spidey sense. And, uh, this, this happens across so many incidents, but it also happens way before, sometimes in design phases where you'll, you'll come to a team and say, Hey, there's this thing I want to implement. And the team is like, well, you know, we don't really support that, but we, we kind of do.

And, you know, I think I could make it work, but I'm not quite sure about it. And then you have this little migration and then it blows up and you're like, Oh, it was that thing I had, I had a pretty good sense of it. And. If you think if you can tap into that stuff, that is this like expertise that that will, uh, live for such a long time in your organization. It's huge. And it's incredibly hard to get at and you never will, unless you actually talk to people.

Rich Burroughs: I've talked about, um, instead of code smells, um, infrastructure smells.And I think that's very true that like anyone, especially after you've been in the industry for awhile, there are things that you just have that reaction where it's like, this is not going to go well,

Jacob Plicque: And then it does it and you're like, you're right. That was right.

Rich Burroughs: Or, Hey, listen, I think we're about out of time, Ryan. We've had such a great time talking with you, really appreciate you coming on the podcast. Where is it that people can find you on the internet if they want to learn more about you and the things you think about incidents and the rest of our work is SREs?

Ryan Kitchens: So two places, one is I'm on Twitter at this hits home. Those are separated by underscores. The other is, um, I'll be putting out a blog post soon on Nora Jones, learning from incidents.io website. Um, and I'm hoping to have all my contact information in there too. It's been awesome talking to you.

Rich Burroughs: Yeah, thanks so much, Ryan. We will link to those things in the show notes. And, thanks a lot for listening everyone.

Patrick Higgins: The music for break things on purpose is from Komiku. The song is called Battle of Pogs. For more of Komiku's music please visit loyaltyfreemusic.com or click on the link in the show notes. For more information on the chaos engineering community, visit gremlin.com/community. Thanks for listening and join us in the new year for season two.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Podcast: Break Things on Purpose | Ep. 11: Ryan Kitchens, Senior Site Reliability Engineer at Netflix

Transcript of the Episode

What is Failure Flags? Build testable, reliable software—without touching infrastructure

Introducing Custom Reliability Test Suites, Scoring and Dashboards