Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode of the Break Things on Purpose podcast, we speak with J Paul Reed, Sr Applied Resilience Engineer at Netflix.
- What is an Applied Resilience Engineer? (1:12)
- Facilitating emergent discussions in a remote world (11:14)
- Incentives and discretionary space (17:40)
- Shifting from Newtonian to Quantum thinking (24:59)
J Paul Reed: I'll shut up now. See, this is what happens. No sleep, lots of talking. You should be recording all of... Oh good, you're recording all of this. Great. There you go.
Jason Yee: Welcome to the Break Things on Purpose Podcast, a show about the triumphs and tragedies of complex systems. I'm Jason Yee. In this episode, Pat Higgins and I chat with J Paul Reed, who explains what a Senior Applied Resilience Engineer does, how his role has impacted reliability at Netflix, and shifting from Newtonian to Quantum thinking.
Patrick Higgins: How are you doing, Paul?
J Paul Reed: Doing well. As you might've heard earlier, a little punchy because I was up early doing an incident review but happy to be here.
Patrick Higgins: Yeah, absolutely stoked to have you, very happy to have you here. We did want to ask you off the bat. Your job title is Senior Applied Resilience Engineer. It's a job title we don't see a lot yet in the industry. I wanted to ask you about what your day-to-day looks like. What do you do as a Resilience Engineer?
J Paul Reed: Yeah. That's a great question. I'll talk a little bit about the title, and then I'll talk about kind of what the day-to-day is because the title story is a bit of an interesting one. Netflix has a resilience engineering team, and I am not on that team. When Netflix hired and kind of created this role and came up with the title of Senior Applied Resilience Engineer, I'm on the CORE team, which stands for Critical Operations and Reliability Engineering. They didn't have that title, and we actually had to chat with the resilience engineering team because it's like, "Are you okay if we use those words?" Those words have an academic meaning, and then the R's is... I've done a couple of presentations on the various R's: resilience, robustness, and that sort of thing.
When we were talking through, because words have meaning, right? Resilience Engineering is a field of study. My Masters in Human Factors and Systems Safety looks a lot at resilience engineering as a practice, but it is kind of a false hood to say, "I am a Resilience Engineer", like I engineer resilience into systems because resilience adds its practice. It's a thing that you do, right? We like to say that resilience is a verb, or sometimes I say, "We're all resiliencing together". When you look at trying to increase the resilience in a socio-technical system, like say Netflix, that's really a journey that you go on with the organization and you go on with the various teams that you're involved with. So, we wanted to be very careful about, in some sense, when you're doing this work on an ops team or a dev ops team, or whatever type of team you work on, we're all actually sort of resilience engineers if we're kind of practicing this together.
We wanted to give a nod that we're trying to emphasize the resilience engineering part, but I couldn't in good conscience, call myself a resilience engineer, so we kind of offset it by Applied. I think if I call myself a resilience engineer, I would have gotten a nasty tweet from John Allspaw, and it would have been well-deserved. That's why we kind of put that in there. Now, what does it look like day to day? As I mentioned, I was up really early this morning doing an incident review, and it was actually the reason it was early in the morning because it was a review with people around the globe, so it was a globe-friendly time.
The CORE team, what they do... It'd be good to sort of explain that because my mom asks, "What do you do at work, honey?" The easiest way to explain it is we hold the pager for Netflix. We look at the key metrics that we've developed over time and we continue to develop them, but the key metrics that really try to capture the customer experience for Netflix customers. When that experience is integrated in some way, our team gets paged. Usually, we're among the first teams to get paged often. We're the first. Although sometimes if you go, and we find this out in incident reviews, if you go back and look, there's one or two services that got paged three minutes before we did, so they're starting to look in when we get paged, but we get paid for very high level metrics, something we call SPS or stream starts per second.
There's literally a graph that we can see of how many people are pressing that play button, and if that deviates, that means people are pressing play but it's not playing for some reason. That's when we get paged. There's an on-call rotation for that. When there's an incident, I will often get called in. I'm actually on the emergency page everybody rotation. So, sometimes I'll show up for bigger incidents but after the incident is stabilized, I work with the incident commander, whoever that was on my team. We decide what to do. Sometimes, we'll hold an IR. Sometimes we won't. Sometimes we'll do a bunch of followups, but it's not kind of the formal IR. Netflix is a big place, and holding big IRs that have 40 or 60 people can be expensive in terms of time and all of that sort of stuff.
We don't just hold them to hold them. My manager, Dave Hahn, who's a personality if you've ever met him, has a great quote that "We do not hold IRS for spectacle. We entertain people with our content, not our incident reviews". So, I'll help with that. I'll help putting that stuff together, and then when we're not looking at incidents, one of the things that I do a lot of is sort of a thematic socio-technical system analysis. Sorry, risk analysis. What that means is we look at all of the incidents that we've had, all the IRs and we actually start to try to figure out thematic risk patterns. It's not just us. It's not like I sit there and read all of the ink tickets, and then say, "You should be worried about this thing".
A lot of times it's actually creating space, and carving out space and time for engineers to have those conversations in ways that actually promote discussions of risk in ways that you don't actually see in an incident. It would be great if we could do this before things actually blow up, but basically my day to day is doing incident follow-up, doing risk analysis for different parts of the organization, and then also actually helping teams level up, with Netflix of course, level up their own sort of incident analysis skills because the CORE team is relatively small for a company with 5,000 engineers.
We want to make sure that we provide resources so that teams can hold internal incident reviews, but do it in a way that is really high value, gets them what they want but also helps them talk about those things in a way that's healthy and productive, so we don't get kind of reports of counterproductive discussions because we take that stuff very seriously. We want to help folks have a good experience with that.
Patrick Higgins: When you're looking to figure out and isolate risk patterns, what does that look like? What is that process like for you?
J Paul Reed: Yeah. There's a couple of things that we look at, and you'll hear me probably use this for whatever, or to use it a few times, socio-technical system. It's this idea that our technology systems are, of course technology, but it's made of people. One of the sort of analysis vectors that you see a lot of people doing, and this is sort of well-established in the industry, is the technical risk analysis. This is where we do the thing where it's like, "Oh, we've got this RDS instance on Amazon, and we're always having problems with RDS, so let's re-architect it to use something else". To get us to that thematic realization, that may take three or four or five or a hundred incidents. So, there's that kind of technical analysis where we're looking at architectural patterns and systems, but also, and I enjoy this more, the socio part of that socio-technical system, there's socio-social system analysis.
That's things like on-call health. You find the on-call folks in an incident maybe making the incident worse because of something that they did because they're tired. That's an on-call health thing. We look at things like that. We look at ways that information propagates between teams through the system. This is one of those things where you see... We've all had this experience, right? We're walking through the hallway in the office and we might hear someone say, "Oh hey, we're going to flip that bid on that thing", and then somebody goes, "Whoa, you don't know that if you do, that's going to break my thing, and I didn't know you were going to do it except I just happened to walk by". They'll have that conversation. A lot of times we see subtle versions of that theme during incidents, right?
I'll tell you a funny story. I wrote an article about this and I can link everyone to it. We were talking about an example of this at Netflix. My colleagues and I have dubbed something somebody doesn't know, which seems like a kind of obvious risk, but one of the things we realized is how often vacations actually factor into an incident. In other words, somebody was on vacation and that's why the incident happened. Of course in the article, the punchline of the story is, "The action item is nobody can ever take vacations", which of course, nobody believes that. We shouldn't do that. We should let people rest and recuperate, but that was actually a realization for us. It's a pattern of how often somebody is on vacation during the incident.
It's like they were on vacation and came back, and did something, and because they came back... The way that it played out, it was an incident three days after they got back from vacation. We started finding that as a thematic pattern. The upshot is when you do these sort of deep, long form investigations where you start to get stories about people. They're not just about the database went down or, whatever. We start to get the stories with the people involved. That's where you get these thematic patterns. You see themes and all these stories, and then you get really juicy stuff about how your organization actually works in practice. I have never talked to a line engineer all the way up to senior executive where you'll share some of the stuff that we learn when we dig in there, and they're just blown away by, "Wow, this system actually works like that". It's like, "Yeah, I didn't know that either".
Jason Yee: I'm curious a lot on these lines. You mentioned a lot of that comes from things like being in the hallways and overhearing something, and obviously with COVID and the pandemic, that's not happening. I'm curious if along with this thematic discovery, what themes have you found because of COVID and people working remote now?
J Paul Reed: Yeah. It's funny. I think the entire industry has learned that a lot of organizations that were sort of remote first or built up muscle really strong early on are fairing, I think, a little better. They're starting to fare with... They're contending, not so much with the remote mechanics of how to do the work, but they're contending with... Your kids are there now with you, or those sorts of things. Netflix was famously... Everybody has to be in LA or Los Gatos, move you out, but remote is not a thing. That's something that we've actually talked about a lot. That is a muscle that Netflix is finally leaning into and starting to build because we had to. We had to do it in the context of not only was COVID going on for us, but people were leaning on Netflix too, in COVID.
That was kind of an interesting whole dynamic. One of the things though that we were doing before COVID and that we really leaned into, is creating sort of what we call an emergent space or a space for emergent discussion, where we actually try to recreate those hallway conversations in an emergent way that those conversations can just sort of happen. It's hard. It's actually super difficult to do that. So, one of the things that we do, we have a meeting actually called risk radar, and we used to do it monthly but we actually switched to every other month.
What we do is we've made it very public that if you see a risk in the system, technical, team health, whatever it is, let us know, let Cornell, let me know. One of the really fascinating things about that... Oh, sorry, let me finish. Let me know, and then at that meeting, we'll actually bring those up and it's just discussed. We don't always talk about what to do about them because the point is to get the people to discuss it and share that, "Yeah, that's a thing", and whatever our different viewpoints of that risk. It's not necessarily to solve it. It's more to let people know that this is the thing that's going on in the ecosystem. Some people then we'll go back to their teams and just, "Hey, let's try to figure out how to solve it", or "How can we contribute to the solution?" Other times, the solution comes up in, "Oh yeah, I remember at that risk radar two weeks ago, we talked about X and it's in the context of an incident. Maybe I shouldn't push that button I was thinking about pushing because they told me they were doing something".
Now, one of the really interesting things about doing it this way, there's a lot of times people are antsy about sharing risks that they see in a system. That makes sense, right? You're kind of like, "I have a gut feel. It's not based on a ton of data. It's just every time we talk about it, the hair on the back of my neck stands up and it just feels icky". One of the things that we realized when we started kind of moved this meeting to COVID, and started trying to collect things before the meeting and via email and kind of support the remote aspect of it, is we found that actually I would get the same risks submitted to me four or five or six or seven times.
Then, I would present it in the meeting, and people would think that I was the only one that said that, and I would have to clarify, "No, actually I heard this from five other people at this meeting". You are all worried about this in your system, so let's talk about it. That opens up a really interesting avenue of conversation when you create space for the people, not only to talk about risk, but to have their own feelings and heuristics about a specific risk validated by other folks.
Patrick Higgins: Yeah. That's wild.
J Paul Reed: Yeah. It's one of those things where, I have to say, this is one of the reasons I love my job and I love the socio system part of it. I'm kind of like the wizard of Oz, the guy behind the curtain. I got all the submissions. I collated them. I know kind of who said what, but to then introduce it and sort of just be like, "Hey, this is a thing". A lot of times too, that's a role. You asked sort of what my day-to-day is? Sometimes, my role is actually to go in and just be really, really, really curious and kind of be the straight man, the stupid person in the room and just ask lots of questions.
That actually is a method to sometimes facilitate. It happens in IRs. It happens in the risk radar I was talking about. That's a way to actually facilitate that discussion. Fun fact. I was talking about the Masters in Human Factors and System Safety that's based in the social sciences. A lot of the people reviewing the theses that we produce have their degrees in Social Science because it's people. It's people all the way down.
Patrick Higgins: Yeah. I've actually got a background in the social sciences, and I studied a lot about pathologies and organizations. That actually seems to correlate really well.
J Paul Reed: Yeah. When we talk about pathologies, a lot of times those things are related to incentives. So, when you start to kind of step back and try to just look at the socio system, obviously humans can't take all of it in, but when you start to see some of these connections, and then you can apply things like what are the incentive structures and that sort of thing, it's interesting when you can get to a place where you just look at an incident and you say there's something pathological about what happened here. When you start to dig into it and find all of the constraints and the incentives and the things that people were working under, it actually is a fun way. I know you probably had this experience where you actually can become more empathetic when you realize, "Oh, it wasn't a pathology. They had a specific incentive from this to this, to do that thing. That's actually from a personal journey perspective when you study this stuff. That's been fun. It's like working out your empathy muscle.
Patrick Higgins: Yeah. That's really interesting. Something I was really kind of ruminating on a bit was I read your post about climate change and resilience engineering. Something that came up for me talking about incentive structures, I was wondering about as this practice of thinking about socio-technical systems moves forward, when you map that to business incentives and PayPal's incentives within organizations, how was that going? Does it map well? What are the things you say? How does it go?
J Paul Reed: Yeah. That's a really good question. One of the things that is just good context to have when we talk about these things is a lot of the safety sciences, the things that they studied before us, software nerds, got into the safety sciences were aviation and medical, and all of nursing and air traffic control, and nuclear power, and those sorts of things where there are real consequences if somebody makes a mistake. Now, of course we all know certain software systems maybe like Netflix aren't safety critical. Remind me, we can come back to that because I love Nora Jones. She actually has a great take on that.
Speaking of Nora Jones, she tells this story where in the London program, they were doing an intro of who does what, and it's, "I'm a pilot", "I'm anair traffic [controller]", "I'm a doctor", "I'm a nurse". She's like, "Yeah, I work at Netflix", and they all laughed. They're like, "No, no, really". What's interesting to me is that a lot of these discussions, the reason of the positions that we have is there's that history of people went to prison because somebody or the system determined it was human error. When you're talking about incentives and companies and software systems, I think that's kind of actually one of the elephants in the room that we don't ever really talk about. It's one of my favorite elephants too, in that room, which is when we go through and do an IR, and then come up with a list of remediation items, the thing we often don't actually talk about is what actually happens with those action items and do they get done?
We all know we'll come up with a list like 10, and everybody in that room is like, "Oh, we're going to do that one and that one". No one's going to touch that one. We do that. I have some scribbles about this, and your question, Patrick, is going to make me write this blog post because it makes the scribbles. What's interesting is that the thing you're talking about, incentive structures, does the business find value in that? Why would we even pay someone like me to run around your company and do this sort of work? The reason is because when get into an IR and we talk about action items we're going to do, there's a spectrum of action items based on how much we're going to invest. On one end, our action items that are so valuable and so cheap. They're literally done probably 10 minutes after the incident. A person's like, "I'm just going to fix that bug right now", or they make that change before the IR even happens.
There's a bucket of those that nobody is going to... Obviously, we would do those. On the other end, real incident... By the way, this is before I was at Netflix. I was working with an organization that basically had a multi-million dollar outage. It was in the news. It was a big deal. Basically, the TL;DR is they found out that there was a setting in a certain set of day of machines in a data center that they were decommissioning anyway, and they didn't know which machines had which settings, and they couldn't really take them down. The point was they were already fixing that problem, but it wasn't going to be fixed for 18 months.
So, they're not going to do that. They're not going to go into that old data center and flip all those. They're just not. It's not going to happen. So, we've got this bucket of things that are so obvious and so high value and so cheap to implement that we would do. They're already done. We got this bucket of things we all know we're never going to do, and we've got this discretionary space of action items. That's where the incentive conversation actually happens because we've also had that list of member of those 10 remediation items I was talking about. How many of us have had that conversation where everybody says, "Item number three, we got to go do", and the VP of engineering says, "No, we're not going to do that", and the whole team's like, "We should really do that"?
That's where we go back to the socio stuff about figuring out the real incentives behind that. Super useful. Actually, this gets into... I think one of the things we might or we're going to touch on is like, "Okay, Paula, you're at a Senior Applied Resilience Engineer role". How is that different from what SREs, our dev ops engineers do? There's a thread to that, that there's actually a lot of value if you do a lot of incident reviews and that type of work. There's an argument to be made about, Patrick, to your point, standing back and understanding the incentives so that you can make a better argument about the risk, and why this trade-off is the right trade-off for somebody who has to justify it to someone in a spreadsheet. That's where you kind of see those things really playing out. The better that you can do that, in some sense, the better understanding you have of why the CFO would tell the VP of engineering you can't have that. It's going to make you more successful than trying to explain it really to the rest of the organization.
Jason Yee: That's a really interesting point about why it makes sense to have that as an independent role. To have you go in and help those other teams because no single team should hopefully have that many incidents to build up the muscle memory, and even if they did, they wouldn't have that ability to share it across the company.
J Paul Reed: Yeah. I think what's interesting is Netflix is very, forward-thinking on a lot of things. I give them a lot of credit for sort of doing what you said, Jason, deciding. I think it was a bet. I think it was experiment with Dave. My manager is listening. I hope it's paying off, that the experiment is working out, but I think it was a conversation to be like, "We're going to hire a person whose only job is to run around and not put out the fires, just bask in the glow of the fires, and then tell us how the fires felt". How do you explain that? "I'm going to pay someone money to go do that".
The good news is you are seeing that more and more in roles around companies. I think there's this interesting connection to what gremlin does, and how they think about it because I don't think we could be having those conversations if we didn't have folks talking about the chaos engineering aspect and all of those things. They're not the same thing, but they are very connected. They are very simpatico on how they think about the problem space.
Patrick Higgins: Right. I just got to follow up. I've got to ask, do you have folks that work in a similar capacity to you that you hear about in the community though? My business essentially said, "We want to make resilience engineering a priority", or "We want to invest in this process but anytime I look to facilitate in an incident review or something like this, they want takeaways up front. They want action items straight away. They're not willing to come in with any kind of equalization when they want to actually work through a process". Hierarchy is inherent in a way that these groups are actually playing out. How does that work?
J Paul Reed: Yeah. I'll say a few words, and then I'm going to ask you if I answered your question because I think I understand it. One of the things that I will say, when we look at organizations, you can kind of step back and say, "Okay, what is the function even if I'm an SRE? What's the function of me being here? If we have an incident like doing IRs and then we come up with action items, and then we communicate those out, we'll hold off on whether we do them but we communicate them out. This is what we're going to do". What's the function? At a very basic level, what's the function of that?
Patrick Higgins: What is the function from a technical perspective or from an organizational?
J Paul Reed: From an organizational perspective.
Patrick Higgins: In a purely cynical fashion, I'd say it's to make higher ups feel like the work is getting done to remediate an issue.
J Paul Reed: Here's the thing. That's not cynical at all. That's actually how socio-technical systems work. The point is those people higher up just want to know that the organization did something. They want to know, this bad thing happened, did you do something? By the way, this extends to society. When PG&E burned down half of California, people are like, "Are you doing something about that?" They were like, "Not really", and then well, the politician was like, "Maybe you should do something about it". This applies in sort of that safety pattern all over the place. The point is, when you understand that bit of it, it's actually really easy to get kind of empathetic about where the really higher-ups are going for. They've got five-year plans to deliver on and lay. You just had an incident that caused the stock to go down five points.
They want to know what happened. There's an aspect to it that's that. I think the big shift though, and this is hard, this is a real shift in perspective, is that having people talk to each other and understand how they wrangled the system in peril, back over the edge to a good or stable state, and got customers together sharing that information. Sharing the story about how that happened, that is doing something. That's just as valuable as, "Hey, we automated an XYZ thing". One of the other things, they would have other traps that you fall into, and again, once you see this, you can't unsee it. That's why it's useful to kind of talk about it. How many times have you like, "We had an incident, somebody fat-fingered a thing, and then we automated it"?
Then six months later, somebody was like, "Yeah, that automation went and just blew everything away because we forgot to do that right". So, the thing there is that a lot of times it's happened... I had this Twitter conversation last night. "Oh, well, human error. We automated, so it will never happen again". No. Automation is not bad but when we think automation automatically solves it, and asterisk, this is the dangerous part. When a leader, your manager all the way up to the CXO hears, "Oh, they automated it, so that will never happen again", that's the worst takeaway that they can take. We all know that. We've all seen automation go wonky. We know that. This is one of those things where sometimes we pass each other in the night as a leader in an org, and a team doing the work, a blunt end of the org and sharpening the org, that is actually really dangerous. That's kind of that weird genesis of how incidents happen again because they heard all is okay, and we're like, "We automated it, so it's better".
Patrick Higgins: This actually reminds me. Ryan Kitchens came on to the pod and was talking about how scheduling for Chaos Monkey, or Chaos Monkey itself has led to these bizarre kind of side effects that no one foresaw. Actually, now that everyone assumes that Chaos Monkey is going to run, that provides its own sets of side effects.
J Paul Reed: Right. Yeah. Again, I really liked the way you kind of framed it, and Kitchens probably talked about it. It's that those side effects are emergent side effects. When we look at the emergent side effects, what do they actually affect? They affect the behavior of the organization. The teams that opt into Chaos Monkey are going to make different engineering decisions that affect the resilience of the system. That's why we call resilience engineering an activity because the introduction of that Chaos Monkey created an emergent forcing function that changed our behavior that had the effect of increasing sort of the resilience or what we would call the adaptive capacity within the system.
Yeah. Lots of fun there. I will say this because you were asking. Again, I don't know if I answered your question about the business side of it.
Patrick Higgins: You did.
J Paul Reed: I will say this. The challenge that I see, and we see this in the safety sciences, there's this whole idea of Safety One versus Safety Two. I won't go into the whole huge discussion on it, but the TL;DR is Safety one is sort of this model that's very linear dominoes, so if you remove the domino or make a bigger domino, it's going to be fine. Also, dominoes always fall in the same pattern, and also once you stop a domino, it's not going to ever find another way to get to the failure point. Safety Two was like, "That's all bullshit". I mean, it's not exactly like that, but the point is there's that fundamental shift.
It's interesting. There's this kind of a connection. It's as big as the shift of Newtonian Physics to Quantum Physics. All that stuff in quantum works but when we look at it, it's like, "What? That makes no sense". Also, quantum, really, really small is similar to universe size, really big. That's weird just from a cognitive like, "What's going on there?" The point is though, I find that businesses and leaders that get it have already made that shift to the quantum universe. Quantum physics makes sense to them, and businesses that really struggle with, "Why would I invest in resilience engineering or an applied resilience engineer? Why would I waste time on an IR that has no action items?" The organizations that ask that are very Newtonian Physics, very safety one. The quantum stuff just doesn't make sense, and again, that's an opportunity for us to be empathetic and understand. They need to understand that level of shift in perspective about how their system actually works.
Patrick Higgins: That's a lot of food for thought. There's a lot to wade through there.
J Paul Reed: Yeah. It's a buffet of resilience, really.
Patrick Higgins: Yeah. 100%. If this is what our appetite for learning more about this subject, where would you send us both?
Jason Yee: Wait. Don't you mean that if we want a gorge ourselves on the buffet of resilience?
J Paul Reed: That wouldn't be very resilient. I think that there might be a little... Listen, I'm not going to come to that incident review, Jason, when that goes sideways. I'm just letting you know right now, but I was going to say, I have a couple of articles in O'Reilly. I just did a 97 Things Every SRE Should Know. Also, there's a 97 Things Every Cloud Engineer Should Know. There's two of them. I have an article in both of them, but I was perusing through the SRE one, and I actually just got my copy of the cloud one today right before the podcast, so I haven't had a chance to look at it, but the thing I actually did notice in the SRE one, there's a bunch of my colleagues like Loren Hochstein who has a couple of articles in there about resilience.
The way that they structured that book, the first section is zero to one, and the last section... The title of my article is something about how safety science nerds see the world or something like that. You go maybe steps from zero to one, and then crazy people over here that work at Netflix and do weird things, but the point is there's actually a lot of articles and wide variety of views and a number of articles that tackle chaos and resilience and all that kind of stuff. That's a good place to look for stuff. Both of those books, I think, they just recently came out, so I looked there. Oh, and the resilience engineering website and the LFI (Learning From Incidents) website, Patrick, I think you mentioned. The LFI site has a lot of blog posts. Kitchens and Nora Jones and all sorts of folks are blogging there. That's a safety science nerds talking.
Patrick Higgins: Really good stuff. Love it. Awesome. Thanks so much for that.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose Podcast on Apple podcasts, Spotify, or your favorite podcast app.