Podcast: Break Things on Purpose | Mandi Walls, DevOps Advocate at PagerDuty

‍

Take a trip down memory lane with Mandi Walls, to chat about the many changes over the years in chaos engineering and other ares of tech. We’ve also got the new addition to Gremlin’s Developer Advocacy, Julie Gunderson who has joined Jason to chat with their mutual friend. Mandi talks about her previous work alongside Julie, but also the variegated nature of her background in operations and systems administration. From the days of Moviefone, AOL disks, and #hacktheplanet to more recent innovations in chaos engineering and making empathy a central ethos, Mandi has plenty of fun and valuable insights.

Show Notes

In this episode, we cover:

00:00:00 - Introduction
00:04:30 - Early Dark Days in Chaos Engineering and Reliability
00:08:27 - Anecdotes from the “Long Dark Time”
00:16:00 - The Big Changes Over the Years
00:20:50 - Mandi’s Work at PagerDuty
00:27:40 - Mandi’s Tips for Better DevOps
00:34:15 - Outro

Links:

PagerDuty: https://pagerduty.com
Twitter: https://twitter.com/PagerDuty

Transcript

Jason: — hilarious or stupid?

Mandi: [laugh]. I heard that; I listened to the J. Paul Reed episode and I was like, “Oh, there’s, like, a little, like, cold intro.” And I’m like, “Oh, okay.”

Jason: Welcome to Break Things on Purpose, a podcast about reliability and learning from failure. In this episode, we take a trip down memory lane with Mandi Walls to discuss how much technology, reliability practices, and chaos engineering has evolved over her extensive career in technology.

Jason: Everybody, welcome to the show, Julie Gunderson, who recently joined Gremlin on the developer advocacy team. How’s it going, Julie?

Julie: Great, Jason. Really excited to be here.

Jason: So, Mandi is actually a guest of yours. I mean, we both have been friends with Mandi for quite a while but you had the wonderful opportunity of working with Mandi.

Julie: I did, and I was really excited to have her on our podcast now as we ran a podcast together at PagerDuty when we worked there. Mandi has such a wealth of knowledge that I thought we should have her share it with the world.

Mandi: Oh, no. Okay.

Julie: [laugh].

Jason: “Oh, no?” Well, in that case, Mandi, why don’t you—

Mandi: [crosstalk 00:01:28]. I don’t know.

Jason: Well, in that case with that, “Oh no,” let’s have Mandi introduce herself. [laugh].

Mandi: Yeah hi. So, thanks for having me. I am Mandi Walls. I am currently a DevOps advocate at PagerDuty, Julie’s last place of employment before she left us to join Jason at Gremlin.

Julie: And Mandi, we worked on quite a few things over a PagerDuty. We actually worked on things together, joint projects between Gremlin, when it was just Jason and us where we would run joint workshops to talk about chaos engineering and actually how you can practice your incident response. And I’m sure we’ll get to that a little bit later in the episode, but will you kick us off with your background so everybody knows why we’re so excited to talk to you today?

Mandi: Oh, goodness. Well, so I feel like I’ve been around forever. [laugh]. Prior to joining PagerDuty. I spent eight-and-a-half years at Chef Software, doing all kinds of things there, so if I ever trained you on Chef, I hope it was good.

Prior to joining Chef, I was assistant administrator for AOL.com and a bunch of other platform and sites at AOL for a long time. So, things like Moviefone, and the AOL Sports Channel, and dotcom, and all kinds of things. Most of them ran on one big platform because the monolith was a thing. So yeah, my background is largely in operations, and just systems administration on that side.

Jason: I’m laughing in the background because you mentioned Moviefone, and whenever I think of Moviefone, I think of the Seinfeld episode where Kramer decides to make a Moviefone competitor, and it’s literally just his own phone number, and people call up and he pretends to be that, like, robotic voice and has people, like, hit numbers for which movie they want to see and hear the times that it’s playing. Gives a new meaning to the term on-call.

Mandi: Indeed. Yes, absolutely.

Julie: And I’m laughing just because I recently watched Hackers and, you know, they needed that AOL.com disc.

Mandi: That’s one of my favorite movies. Like, it’s so ridiculous, but also has so many gems of just complete nonsense in it. Absolutely love Hackers. “Hack the planet.”

Julie: “Hack the planet.” So, with hacking the planet, Mandi, and your time working at AOL with the monolith, let’s talk a little bit because you’re in the incident business right now over at PagerDuty, but let’s talk about the before times, the before we practiced Chaos Engineering and before we really started thinking about reliability. What was it like?

Mandi: Yeah, so I’ll call this the Dark Ages, right? So before the Enlightenment. And, like, for folks listening at home, [laugh] the timeline here is probably—so between two-thousand-and-fi—four, five, and 2011. So, right before the beginning of cloud, right before the beginning of, like, Infrastructure as Code, and DevOps and all those things that’s kind of started at, like, the end of my tenure at AOL. So, before that, right—so in that time period, right, like, the web was, it wasn’t like it was just getting started, but, like, the Web 2.0 moniker was just kind of getting a grip, where you were going from the sort of generic sites like Yahoo and Yellow Pages and those kinds of things and AOL.com, which was kind of a collection of different community bits and news and things like that, into more personalized experiences, right?

So, we had a lot of hook up with the accounts on the AOL side, and you could personalize all of your stuff, and read your email and do all those things, but the sophistication of the systems that we were running was such that like, I mean, good luck, right? It was migration from commercial Unixes into Linux during that era, right? So, looking at when I first joined AOL, there were a bunch of Solaris boxes, and some SGIs, and some other weird stuff in the data center. You’re like, good luck on all that. And we migrated most of those platforms onto Linux at that time; 64 bit. Hurray.

At least I caught that. And there was an increase in the use of open-source software for big commercial ventures, right, and so less of a reliance on commercial software and caught solutions for things, although we did have some very interesting commercial web servers that—God help them, they were there, but were not a joy, exactly, to work on because the goals were different, right? That time period was a huge acceleration. It was like a Cambrian explosion of software pieces, and tools, and improvements, and metrics, and monitoring, and all that stuff, as well as improvements on the platform side. Because you’re talking about that time period is also being the migration from bare metal and, like, ordering machines by the rack, which really only a handful of players need to do that now, and that was what everybody was doing then.

And in through the earliest bits of virtualization and really thinking about only deploying the structures that you needed to meet the needs of your application, rather than saying, “Oh, well, I can only order gear, I can only do my capacity planning once a year when we do the budget, so like, I got to order as much as they’ll let me order and then it’s going to sit in the data center spinning until I need it because I have no ability to have any kind of elastic capacity.” So, it was a completely, [laugh] completely different paradigm from what things are now. We have so much more flexibility, and the ability to, you know, expand and contract when we need to, and to shape our infrastructures to meet the needs of the application in such a more sophisticated and almost graceful way that we really didn’t have then. So, it was like, “Okay, so I’m running these big websites; I’ve got thousands of machines.” Like, not containers, not services.

Like, there’s tens of thousands of services, but there’s a thousand machines in one location, and we’ve got other things spread out. There’s like, six different pods of things in different places and all this other crazy business going on. At the same time, we were also running our own CDN, and like, I totally recommend you never, ever do that for any reason. Like, just—yeah. It was a whole experience and I still sometimes have, like, anxiety dreams about, like, the configuration for some of our software that we ran at that point. And all of that stuff is—it was a long… dark time.

Julie: So, now speaking of anxiety dreams, during that long, dark time that you mentioned, there had to have been some major incidents, something that stands out that that you just never want to relive. And, Mandi, I would like to ask you to relive that for us today.

Mandi: [laugh]. Okay, well, okay, so there’s two that I always tell people about because they were so horrific in the moment, and they’re still just, like, horrible to think about. But, like, the first one was Thanksgiving morning, sometime early in the morning, like, maybe 2 a.m. something like that, I was on call.

I was at my mom’s, so at the time, my mom had terrible internet access. And again, this time period don’t have a lot of—there was no LTE or any kind of mobile data, right? So, I’m, like, on my mom’s, like, terrible modem. And something happened to the database behind news.aol.com—which was kind of a big deal at the time—and unfortunately, we were in the process of, like, migrating off of one kind of database onto another kind of database.

News was on the target side but, like, the actual platform that we were planning to move to for everything else, but the [laugh] database on-call, the poor guy was only trained up in the old platform, so he had no idea what was going on. And yeah, we were on that call—myself, my backup, the database guy, the NOC analyst, and a handful of other people that we could get hold of—because we could not get into touch with the team lead for the new database platform to actually fix things. And that was hours. Like, I missed Thanksgiving dinner. So, my family eats Thanksgiving at midday rather than in the evening. So, that was a good ten hour call. So, that was horrifying.

The other one wasn’t quite as bad as that, but like, the interesting thing about the platform we were running at the time was it was AOL server, don’t even look it up. Like, it was just crazytown. And it was—some of the interesting things about it was you could actually get into the server platform and dig around in what the threads were doing. Each of the servers had, like, a control port on it and I could log into the control port and see what all the requests were doing on each thread that was live. And we had done a big push of a new release of dotcom onto that platform, and everything fell over.

And of course, we’ve got, like, sites in half a dozen different places. We’ve got, you know, distributed DNS that’s, like, trying to throw traffic between different locations as they fall over. So, I’m watching, like, all of these graphs oscillate as, like, traffic pours out of the [Secaucus 00:11:10] or whatever we were doing, and into Mountain View or something and, like, then all the machines in the Secaucus recover. So, then they start pinging and traffic goes back, and, like, they just fall over, over and over again. So, what happened there was we didn’t have enough threads configured in the server for the new time duration for the requests, so we had to, like, just boosted up all of the threads we could handle and then restart all of the applications. But that meant pushing out new config to all the thousands of servers that were in the pool at the time and then restarting all of them. So, that was exciting. That was the outage that I learned that the CTO knew how to call my desk. So, highly don’t recommend that. But yeah, it was an experience. So.

Julie: So, that’s really interesting because there’s been so many investments now in reliability. And when we talk about the Before Times when we had to cap our text messages because they cost us ten cents a piece, or when we were using those AOL discs, the thought was there; we wanted to make that user experience better. And you brought up a couple of things, you know, you were moving to those more personalized experiences, you were migrating those platforms, and you actually talked about your metrics and monitoring. And I’d like to dig in a little on that and see, how did that help you during those incidents? And after those incidents, what did you do to ensure that these types of incidents didn’t occur again in the future?

Mandi:Yeah, so one of the interesting things about, you know, especially that time period was that the commercially available solutions, even some of the open-source solutions were pretty immature at that time. So, AOL had an internally built solution that was fascinating. And it’s unfortunate that they were never able to open-source it because it would have been something interesting to sort of look at. Scale of it was just absolutely immense. But the things that we could look at the time to sort of give us, you know, an indication of something, like, an AOL.com, it’s kind of a general purpose website; a lot of different people are going to go there for different reasons.

It’s the easiest place for them to find their email, it’s the easiest place for them to go to the news, and they just kind of use it as their homepage, so as soon as traffic starts dropping off, you can start to see that, you know, maybe there’s something going on and you can pull up sort of secondary indicators for things like CPU utilization, or memory exhaustion, or things like that. Some of the other interesting things that would come up there is, like, for folks who are sort of intimately tied to these platforms for long periods of time, to get to know them as, like, their own living environment, something like—so all of AOL’s channels at the time were on a single platform.—like, hail to the monolith; they all live there—because it was all linked into one publishing site, so it made sense at the time, but like, oh, my goodness, like, scaling for the combination of entertainment plus news plus sports plus all the stuff that’s there, there’s 75 channels at one time, so, like, the scaling of that is… ridiculous.

But you could get a view for, like, what people were actually doing, and other things that were going on in the world. So like, one summer, there were a bunch of floods in the Midwest and you could just see the traffic bottom out because, like, people couldn’t get to the internet. So, like, looking at that region, there’s, like, a 40% drop in the traffic or whatever for a few days as people were not able to be online. Things like big snowstorms where all the kids had to stay home and, like, you get a big jump in the traffic and you get to see all these things and, like, you get to get a feel for more of a holistic attachment or holistic relationship with a platform that you’re running. It was like it—they are very much a living creature of their own sort of thing.

Like, I always think of them as, like, a Kraken or whatever. Like, something that’s a little bit menacing, you don’t really think see all of it, and there’s a lot of things going on in the background, but you can get a feel for the personality and the shape of the behaviors, and knowing that, okay, well, now we have a lot of really good metrics to say, “All right, that one 500 error, it’s kind of sporadic, we know that it’s there, it’s not a huge deal.” Like, we did not have the sophistication of tooling to really be able to say that quantitatively, like, and actually know that but, like, you get a feel for it. It’s kind of weird. Like, it’s almost like you’re just kind of plugged into it yourself.

It’s like the scene in The Matrix where the operator guy is like, “I don’t even see the text anymore.” Right? Like, he’s looking directly into the matrix. And you can, kind of like—you spend a lot of time with [laugh] those applications, you get to know how they operate, and what they feel like, and what they’re doing. And I don’t recommend it to anyone, but it was absolutely fascinating at the time.

Julie: Well, it sounds like it. I mean, anytime you can relate anything to The Matrix, it is going to be quite an experience. With that said, though, and the fact that we don’t operate in these monolithic environments anymore, how have you seen that change?

Mandi: Oh, it’s so much easier to deal with. Like I said, like, your monolithic application, especially if there are lots of different and diverse functionalities in it, like, it’s impossible to deal with scaling them. And figuring out, like, okay, well, this part of the application is memory-bound, and here’s how we have to scale for that; and this part of the application is CPU-bound; and this part of the application is I/O bound. And, like, peeling all of those pieces apart so that you can optimize for all of the things that the application is doing in different ways when you need to make everything so much smoother and so much more efficient, across, like, your entire ecosystem over time, right?

Plus, looking at trying to navigate the—like an update, right? Like, oh, you want to do an update to your next version of your operating system on a monolith? Good luck. You want to update the next version of your runtime? Plug and pray, right? Like, you just got to hope that everybody is on board.

So, once you start to deconstruct that monolith into pieces that you can manage independently, then you’ve got a lot more responsibility on the application teams, that they can see more directly what their impacts are, get a better handle on things like updates, and software components, and all the things that they need independent of every other component that might have lived with them in the monolith. Noisy neighbors, right? Like, if you have a noisy neighbor in your apartment building, it makes everybody miserable. Let’s say if you have, like, one lagging team in your monolith, like, nobody gets the update until they get beaten into submission.

Julie: That is something that you and I used to talk about a lot, too, and I’m sure that you still do—I know I do—was just the service ownership piece. Now, you know who owns this. Now, you know who’s responsible for the reliability.

Mandi: Absolutely.

Julie: You know, I’m thinking back again to these before times, when you’re talking about all of the bare metal. Back then, I’m sure you probably didn’t pull a Jesse Robbins where you went in and just started unplugging cords to see what happened, but was there a way that AOL practiced Chaos Engineering with maybe not calling it that?

Mandi: It’s kind of interesting. Like, watching the evolution of Chaos Engineering from the early days when Netflix started talking about it and, like, the way that it has emerged as being a more deliberate practice, like, I cannot say that we ever did any of that. And some of the early internet culture, right, is really built off of telecom, right? It was modem-based; people dialed into your POP, and like, that was the reliability they were expecting was very similar to what they expect out of a telephone, right? Like, the reason we have, like, five nines as a thing is because you want to pick up dial tone, and—pick up your phone and get dial tone on your line 99.999% of the time.

Like, it has nothing to do with the internet. It’s like 1970s circuits with networking. For part of that reason, like, a lot of the way things were built at that time—and I can’t speak for Yahoo, although I suspect they had a very similar setup—that we had a huge integration environment. It’s completely insane to think now that you would build an integration environment that was very similar in scope and scale to your production environment; simply does not happen. But for a lot of the services that we had at that time, we absolutely had an integration environment that was extraordinarily similar.

You simply don’t do that anymore. Like, it’s just not part of—it’s not cost effective. And it was only cost effective at that time because there wasn’t anything else going on. Like, you had, like, the top ten sites on the internet, and AOL was, like, number three at the time. So like, that was just kind of the way things are done.

So, that was kind of interesting and, like, figuring out that you needed to do some kind of proactive planning for what would happen just wasn’t really part of the culture at the time. Like, we did have a NOC and we had some amazing engineers on the NOC that would help us out and do some of the things that we automate now: putting a call together, or when paging other folks into an incident, or helping us with that kind of response. I don’t ever remember drilling on it, right, like we do. Like, practicing that, pulling a game day, having, like, an actual plan for your reliability along those lines.

Julie: Well, and now I think that yeah, the different times are that the competitive landscape is real now—

Mandi: Yeah, absolutely.

Julie: And it was hard to switch from AOL to something else. It was hard to switch from Facebook to MySpace—or MySpace to Facebook, I should say.

Mandi: Yeah.

Julie: I know that really ages me quite a bit.

Mandi: [laugh].

Julie: But when we look at that and when we look at why reliability is so important now, I think it’s because we’ve drilled it into our users; the users have this expectation and they aren’t aware of what’s happening on the back end. They just kn—

Mandi: Have no idea. Yeah.

Julie: —just know that they can’t deposit money in their bank, for example, or play that title at Netflix. And you and I have talked about this when you’re on Netflix, and you see that, “We can’t play this title right now. Retry.” And you retry and it pops back up, we know what’s going on in the background.

Mandi: I always assume it's me, or, like, something on my internet because, like, Netflix, they [don’t ever 00:21:48] go down. But, you know, yeah, sometimes it’s [crosstalk 00:21:50]—

Julie: I just always assume it’s J. Paul doing some chaos engineering experiments over there. But let’s flash forward a little bit. I know we could spend a lot of time talking about your time at Chef, however, you’ve been over at PagerDuty for a while now, and you are in the incident response game. You’re in that lowering that Mean Time to Identification and Resolution. And that brings that reliability piece back together. Do you want to talk a little bit about that?

Mandi: One of the things that is interesting to me is, like, watching some of these slower-moving industries as they start to really get on board with cloud, the stairstep of sophistication of the things that they can do in cloud that they didn’t have the resources to do when they were using their on-premises data center. And from an operation standpoint, like, being able to say, “All right, well, I’m going from, you know, maybe not bare metal, but I’ve got, like, some kind of virtualization, maybe some kind of containerization, but like, I also own the spinning disks, or whatever is going on there—and the network and all those things—and I’m putting that into a much more flexible environment that has modern networking, and you know, all these other elastic capabilities, and my scaling and all these things are already built in and already there for me.” And your ability to then widen the scope of your reliability planning across, “Here’s what my failure domains used to look like. Here’s what I used to have to plan for with thinking about my switching networks, or my firewalls, or whatever else was going on and, like, moving that into the cloud and thinking about all right, well, here’s now, this entire buffet of services that I have available that I can now think about when I’m architecting my applications for the cloud.” And that, just, expanded reliability available to you is, I think, absolutely amazing.

Julie: A hundred percent. And then I think just being able to understand how to respond to incidents; making sure that your alerting is working, for example, that’s something that we did in that joint workshop, right? We would teach people how to validate their alerting and monitoring, both with PagerDuty and Gremlin through the practice of incident response and of Chaos Engineering. And I know that one of the practices at PagerDuty is Failure Fridays, and having those regular GameDays that are scheduled are so important to ensuring the reliability of the product. I mean, PagerDuty has no maintenance windows, correct?

Mandi: No that—I don’t think so, right?

Julie: Yeah. I don’t think there’s any planned maintenance windows, and how do we make sure for organizations that rely on PagerDuty—

Mandi: Mm-hm.

Julie: —that they are one hundred percent reliable?

Mandi: Right. So, you know, we’ve got different kinds of backup plans and different kinds of rerouting for things when there’s some hiccup in the platform. And for things like that, we have out of band communications with our teams and things like that. And planning for that, having that GameDay to just be able to say—well, it gives you context. Being able to say, “All right, well, here’s this back-end that’s kind of wobbly. Like, this is the thing we’re going to target with our experiments today.”

And maybe it’s part of the account application, or maybe it’s part of authorization, or whatever it is; the team that worked on that, you know, they have that sort of niche view, it’s a little microcosm, here’s a little thing that they’ve got and it’s their little widget. And what that looks like then to the customer, and that viewpoint, it’s going to come in from somewhere else. So, you’re running a Failure Friday; you’re running a GameDay, or whatever it is, but including your customer service folks, and your front-end engineers, and everyone else so that, you know, “Well, hey, you know, here’s what this looks like; here’s the customers’ report for it.” And giving you that telemetry that is based on customer experience and your actual—what the business looks like when something goes wrong deep in the back end, right, those deep sea, like, angler fish in the back, and figuring out what all that looks like is an incredible opportunity. Like, just being able to know that what’s going to happen there, what the interface is going to look like, what things don’t load, when things take a long time, what your timeouts look like, did you really even think about that, but they’re cascading because it’s actually two layers back, or whatever you’re working on, like that kind of insight, like, is so valuable for your application engineers as they’re improving all the pieces of architecture, whether it’s the most front-end user-facing things, or in the deep back-end that everybody relies on.

Julie: Well, absolutely. And I love that idea of bringing in the different folks like the customer service teams, the product managers. I think that’s important on a couple of levels because not only are you bringing them into this experience so they’re understanding the organization and how folks operate as a whole, but you’re building that culture, that failure is acceptable and that we learn from our failures and we make our systems more resilient, which is the entire goal.

Mandi: The goal.

Julie: And you’re sharing the learning. When we operate in silos—which even now as much as we talk about how terrible it is to be in siloed teams and how we want to remove silos, it happens. Silos just happen. And when we can break down those barriers, any way that we can to bring the whole organization in, I think it just makes for a stronger organization, a stronger culture, and then ultimately a stronger product where our customers are living.

Mandi: Yeah.

Julie: Now, I really do want to ask you a couple of things for some fun here. But if you were to give one tip, what is your number one tip for better DevOps?

Mandi: Your DevOps is always going to be—like, I’m totally on board with John Willis’s CALMS to, like, move to CALMS sort of model, right? So, you’ve got your culture, your automation, your learning, your metrics, and your sharing. For better DevOps, I think one of the things that’s super important—and, you know, you and I have hashed this out in different things that we’ve done—we hear about it in other places, is definitely having empathy for the other folks in your organization, for the work that they’re doing, and the time constraints that they’re under, and the pressures that they’re feeling. Part of that then sort of rolls back up to the S part of that particular model, the sharing. Like, knowing what’s going on, not—when we first started out years ago doing sort of DevOps consulting through Chef, like, one of the things we would occasionally run into is, like, you’d ask people where their dashboards were, like, how are they finding out, you know, what’s going on, and, like, the dashboards were all hidden and, like, nobody had access to them; they were password protected, or they were divided up by teams, like, all this bonkers nonsense.

And I’m like, “You need to give everybody a full view, so that they’ve all got a 360 view when they’re making decisions.” Like you mentioned your product managers as part of, like, being part of your practice; that’s absolutely what you want. They have to see as much data as your applications engineers need to see. Having that level of sharing for the data, for the work processes, for the backlog, you know, the user inputs, what the support team is seeing, like, you’re getting all of this input, all this information, from everywhere in your ecosystem and you cannot be selfish with it; you cannot hide it from other people.

Maybe it doesn’t look as nice as you want it to, maybe you’re getting some negative feedback from your users, but pass that around, and you ask for advice; you ask for other inputs. How are we going to solve this problem? And not hide it and feel ashamed or embarrassed. We’re learning. All this stuff is brand new, right?

Like, yeah, I feel old talking about AOL stuff, but, like, at the same time, like, it wasn’t that long ago, and we’ve learned an amazing amount of things in that time period, and just being able to share and have empathy for the folks on your team, and for your users, and the other folks in your ecosystem is super important.

Julie: I agree with that. And I love that you hammer down on the empathy piece because again, when we’re working in ones and zeros all day long, sometimes we forget about that. And you even mentioned at the beginning how at AOL, you had such intimate knowledge of these applications, they were so deep to you, sometimes with that I wonder if we forget a little bit about the customer experience because it’s something that’s so close to us; it’s a feature maybe that we just believe in wholeheartedly, but then we don’t see our customers using it, or the experience for them is a little bit rockier. And having empathy for what the customer may go through as well because sometimes we just like to think, “Well, we know how it works. You should be able to”—

Mandi: Yes.

Julie: Yes. And, “They’re definitely not going to find very unique and interesting ways to break my thing.” [laugh].

Mandi: [laugh]. No, never.

Julie: Never.

Mandi: Never.

Julie: And then you touched on sharing and I think that’s one thing we haven’t touched on yet, but I do want to touch on a little bit. Because with incident—with incident response, with Chaos Engineering, with the learning and the sharing, you know, an important piece of that is the postmortem.

Mandi: Absolutely.

Julie: And do you want to talk a little bit about the PagerDuty view, your view on the postmortems?

Mandi: As an application piece, like, as a feature, our postmortem stuff is under review. But as a practice, as a thing that you do, like, a postmortem is an—it should be an active word; like, it’s a verb, right? You hol—and if you want to call it a post-incident review, or whatever, or post-incident retrospective, if you’re more comfortable with those words, like that’s great, and that’s—as long as you don’t put a hyphen in postmortem, I don’t care. So, like—

Julie: I agree with you. No hyphen—

Mandi: [laugh].

Julie: —please. [laugh].

Mandi: Please, no hyphen. Whatever you want to call that, like, it’s an active thing. And you and I have talked a number of times about blamelessness and, like, making sure that what you do with that opportunity, this is—it’s a gift, it’s a learning opportunity after something happened. And honestly, you probably need to be running them, good or bad, for large things, but if you have a failure that impacted your users and you have this opportunity to sit down and say, all right, here’s where things didn’t go as we wanted them to, here’s what happened, here’s where the weaknesses are in our socio-technical systems, whether it was a breakdown in communication, or breakdown in documentation, or, like, we we found a bug or, you know, [unintelligible 00:32:53] defect of some kind, like, whatever it is, taking that opportunity to get that view from as many people as possible is super important.

And they’re hard, right? And, like, we—John Allspaw, on our podcast, right, last year talked a bit about this. And, like, there’s a tendency to sort of write the postmortem and put it on a shelf like it’s, like, in a museum or whatever. They are hopefully, like, they’re learning documents that are things that maybe you have your new engineers sort of review to say, “Here’s a thing that happened to us. What do you think about this?” Like, maybe having, like, a postmortem book club or something internally so that the teams that weren’t maybe directly involved have a chance to really think about what they can learn from another application’s learning, right, what opportunities are there for whatever has transpired? So, one of the things that I will say about that is like they aren’t meant to be write-only, right? [laugh]. They’re—

Julie: Yeah.

Mandi: They’re meant to be an actual living experience and a practice that you learn from.

Julie: Absolutely. And then once you’ve implemented those fixes, if you’ve determined the ROI is great enough, validate it.

Mandi: Yes.

Julie: Validate and validate and validate. And folks, you heard it here first on Break Things on Purpose, but the postmortem book club by Mandi Walls.

Mandi: Yes. I think we should totally do it.

Julie: I think that’s a great idea. Well, Mandi, thank you. Thank you for taking the time to talk with us. Real quick before we go, did you want to talk a little bit about PagerDuty and what they do?

Mandi: Yes, so Page—everyone knows PagerDuty; you have seen PagerDuty. If you haven’t seen PagerDuty recently, it’s worth another look. It’s not just paging anymore. And we’re working on a lot of things to help people deal with unplanned work, sort of all the time, right, or thinking about automation. We have some new features that integrate more with our friends at Rundeck—PagerDuty acquired Rundeck last year—we’re bringing out some new integrations there for Rundeck actions and some things that are going to be super interesting for people.

I think by the time this comes out, they’ll have been in the wild for a few weeks, so you can check those out. As well as, like, getting better insight into your production platforms, like, with a service graph and other insights there. So, if you haven’t looked at PagerDuty in a while or you think about it as being just a place to be annoyed with alerts and pages, definitely worth revisiting to see if some of the other features are useful to you.

Julie: Well, thank you. And thanks, Mandi, and looking forward to talking to you again in the future. And I hope you have a wonderful day.

Mandi: Thank you, Julie. Thank you very much for having me.Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Podcast: Break Things on Purpose | Mandi Walls, DevOps Advocate at PagerDuty

Show Notes

Transcript

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Treat reliability risks like security vulnerabilities by scanning and testing for them