For this episode your hosts, Jason Yee and Julie Gunderson, are sitting down for a year in review! With the new year just around the corner, lets take a glance back at a year of chaos...engineering that is. The rest of the chaos we will leave out of the conversation. Julie and Jason talk about their favorite outages of the year. From Fastly to texts from Julie’s mom, we’ve definitely got a heck of a year to consider!
In this episode, we cover:
- 00:00:00 - Introduction
- 00:30:00 - Fastly Outage
- 00:04:05 - Salesforce Outage
- 00:07:25 - Hypothesizing
- 00:10:00 - Julie Joins the Team!
- 00:14:05 - Looking Forward/Outro
Jason: There’s a bunch of cruft that they’ll cut from the beginning, and plenty of stupid things to cold-open with, so.
Julie: I mean, I probably should have not said that I look forward to more incidents.
[audio break 00:00:12]
Jason: Hey, Julie. So, it’s been quite a year, and we’re going to do a year-end review episode here. As with everything, this feels like a year of a lot of incidents and outages. So, I’m curious, what is your favorite outage of the year?
Julie: Well, Jason, it has been fun. There’s been so many outages, it’s really hard to pick a favorite. I will say that one that sticks out as my favorite, I guess, you could say was the Fastly outage, basically because of a lot of the headlines that we saw such as, “Fastly slows down and stops the internet.” You know, “What is Fastly and why did it cause an outage?” And then I think that people started realizing that there’s a lot more that goes into operating the internet. So, I think from just a consumer side, that was kind of a fun one. I’m sure that the increases in Google searches for Fastly were quite large in the next couple of days following that.
Jason: That’s an interesting thing, right? Because I think for a lot of us in the industry, like, you know what Fastly is, I know what Fastly is; I’ve been friends with folks over there for quite a while and they’ve got a great service, but for everybody else out there in the general public, suddenly, this company, they never heard of that, you know, handles, like, 25% of the world’s internet traffic, like, is suddenly on the front page news and they didn’t realize how much of the internet runs through this service. And I feel it that way with a lot of the incidents that we’re seeing lately, right? We’re recording this in December, and a week ago, Amazon had a rather large outage, affecting us-east-1, which it seems like it’s always us-east-1. But that took down a bunch of stuff and similar, they are people, like you know, my dad, who’s just like, “I buy things from Amazon. How did this crash, like, the internet?”
Julie: I will tell you that my mom generally calls me—and I hate to throw her under the bus—anytime there is an outage. So, Hulu had some issues earlier this year and I got texts from my mom actually asking me if I could call any of my friends over at Hulu and, like, help her get her Hulu working. She does this similarly for Facebook. So, when that Facebook outage happened, I always—almost—know about an outage first because of my mother. She is my alerting mechanism.
Jason: I didn’t realize Hulu had an outage, and now it makes me think we’ve had J. Paul Reed and some other folks from Netflix on the show. We definitely need to have an engineer from Hulu come on the show. So, if you’re out there listening and you work for Hulu, and you’d like to be on the show and dish all the dirt on Hulu—actually don’t do that, but we’d love to talk with you about reliability and what you’re doing over there at Hulu. So, reach out to us at email@example.com.
Julie: I’m sure my mother would appreciate their email address and phone number just in case—
Julie: —for the future. [laugh].
Jason: If you do reach out to us, we will connect you with Julie’s mother to help solve her streaming issues. You had mentioned one thing though. You said the phrase about throwing your mother under the bus, and that reminds me of one of my favorite outages from this year, which I don’t know if you remember, it’s all about throwing people under the bus, or one person in particular, and that’s the Salesforce outage. Do you remember that?
Julie: Oh. Yes, I do. So, I was not here at the time of the Salesforce outage, but I do remember the impact that that had on multiple organizations. And then—
Julie: —the retro.
Jason: —the Salesforce outage was one where ,similarly ,Salesforce affects so much, and it is a major name. And so people like my dad or your mom probably knew like, “Oh, Salesforce. That’s a big thing.” The retro on it, I think, was what really stood out. I think, you know, most people understand, like, “Oh, you’re having DNS issues.” Like, obviously it’s always DNS, right? That’s the meme: It’s always DNS that causes your issues.
In this case it was, but their retro on this they publicly published was basically, “We had an engineer that went to update DNS, and this engineer decided to push things out using an EBF process, an Emergency Brake Fix process.” So, they sort of circumvented a lot of the slow rollout processes because they just wanted to get this change made and get it done without all the hassle. And turns out that they misconfigured it and it took everything down. And so the entire incident retro was basically throwing this one engineer under the bus. Not good.
Julie: No, it wasn’t. And I think that it’s interesting because especially when I was over at PagerDuty, right, we talked a lot about blamelessness. That was very not blameless. It doesn’t teach you to embrace failure, it doesn’t show that we really just want to take that and learn better ways of doing things, or how we can make our systems more resilient. But going back to the Fastly outage, I mean, the NPR headline was, “Tuesday’s Internet Outage was Caused by One Customer Changing a Setting, Fastly says.” So again, we could have better ways of communicating.
Jason: Definitely don’t throw your engineers on their bus, but even moreso, don’t throw your customers under the bus. I think for both of these, we have to realize, like, for the engineer at Salesforce, like, the blameless lesson learned here is, what safeguards are you going to put in place? Or what safeguards were there? Like, obviously, this engineer thought, like, “The regular process is a hassle; we don’t need to do that. What’s the quickest, most expedient way to resolve the issue or get this job done?” And so they took that.
And similarly with the customer at Fastly, they’re just like, “How can I get my systems working the way I want them to? Let’s roll out this configuration.” It’s really up to all of us, and particularly within our companies, to think about how are people using our products. How are they working on our systems? And, what are the guardrails that we need to put in place? Because people are going to try to make the best decisions that they can, and that obviously means getting the job done as quickly as possible and then moving on to the next thing.
Julie: Well, and I think you’re really onto something there, too, because I think it’s also about figuring out those unique ways that our customers can break our products, things that we didn’t think through. And I mean, that goes back to what we do here at Gremlin, right? Then that goes back to Chaos Engineering. Let’s think through a hypothesis. Let’s see, you know, what if ABC Company, somebody there does something. How can we test for that?
And I think that shouldn’t get lost in the whole aspect of now we’ve got this postmortem. But how do we recreate that? How do we make sure that these things don’t happen again? And then how do we get creative with trying to figure out, well, how can we break our stuff?
Jason: I definitely love that. And that’s something that we’ve done internally at Gremlin this year is, we’ve really started to build up a better practice around running Chaos Engineering internally on our own systems. We’ve done that for a long time, but a lot of times it was just specific teams, and so earlier this year, the advocacy team was partnering up with the various engineering teams and running Chaos Engineering experiments. And it was interesting to learn and think through some of those ideas of as we’re doing this work, we’re going to be trying to do things expediently with the least amount of hassle, but what if we decide to do something that’s outside of the documented process, but for which there is no technical guardrails? So, some of the things that we ended up doing were testing dependencies, right, things that again, are outside of the normal process.
Like, we use LaunchDarkly for feature flagging. What happens if we decide to circumvent that, just push things straight to production? What happens if we decide to just block LaunchDarkly all together? And we found some actual critical issues and we’re able to resolve those without impacting our customers.
Julie: That’s the key element: Practice, play, think through the what ifs. And I love the what ifs part. You know, going back to my past, I have to tell you that the IT team used to always give me all of the new tech because if something was going to break for some reason—they used to call me the “AllSpark” to be honest with everybody out there—for some reason, if something was going to break, with me it would break in the most unique possible way, so before anything got rolled out to the entire company, I was the one that got to test it.
Jason: That’s amazing. So, what you’re saying is on my next project, I need to give that to you first?
Julie: Oh, a hundred percent. Really, it was remarkable how things would break. I mean, I had keyboards that would randomly type letters. I definitely took down some internal things, but I’m just saying that you should leverage those people within your organization, as well. The thing was, it was never a, “Julie is awful; things break because of Julie.” It was, “You know what? Leverage Julie to learn about what we’re using.” And it was kind of fun. I mean, granted, this was years ago, and that name has stuck, and sometimes they still definitely make fun of me for it, but really, they just used me to break things in unique ways. Because I did.
Jason: That’s actually a really good segue to some of the stuff that we’ve been doing because you joined Gremlin, now, a few months back—more than a few months—but late summer, and a lot of what we were doing early on was just, we had these processes that, internally for myself and other folks who’d been around for a while, it was just we knew what to do because we’d done it so much. And it was that nice thing of we’re going to do this thing, but let’s just have Julie do it. Also, we’re not going to tell you anything; we’re just going to point you at the docs. It became really evident as you went through that of, like, “Hey, this doc is missing this thing. It doesn’t make sense.”
And you really helped us improve some of those documentation points, or some of the flows that we had, you would execute, and it’s like, “Why are we doing it this way?” And a lot of times, it was like, “Oh, that’s a legacy thing. We do it because—oh, right, that thing we did it because of doesn’t exist anymore. Like, we’re doing it completely backwards because of some sort of legacy thing that doesn’t exist. Let’s update that.” And you were able to help us do that, which was fantastic.
Julie: Oh, yeah. And it was really great on my end, too because I always felt like I could ask the questions. And that is a cultural trait that is really important in an organization, to make sure that folks can ask questions and feel comfortable doing so. I’ve definitely seen it the other way, and when folks don’t know the right way to do something or they’re afraid to ask those questions, that’s also where you see the issues with the systems because they’re like, “Okay, I’m just going to do this.” And even going back to my days of being a recruiter—which is when I started in tech, but don’t worry, everybody, I was super cool; I was not a bad recruiter—that was something that I always looked for in the interview process. When I’d ask somebody how to do something, would they say, “I don’t know, I would ask,” or, “I would do this,” or would they just fumble their way through it, I think that it’s important that organizations really adopt that culture of again, failure, blamelessness, It’s okay to ask questions.
Jason: Absolutely. I think sort of the flip side of that, or the corollary of that is something that Alex Hidalgo brought up. So, one of our very first episodes of 2021 on this podcast, we had Alex Hidalgo who’s now at Nobl9, and he brought up a thing from his time at Google called Hyrum’s Law. And Hyrum’s Law is this guy Hyrum who worked at Google basically said, “If you’ve got an API, that API will be used in every way possible. If you don’t actually technically prevent it, somebody is going to use your API in a way it wasn’t designed for. And that because it allows that, it becomes totally, like, a plausible or a valid use case for this.”
And so as we think about this, and thinking about blamelessness, use the end-runaround to deploy this DNS change, like, that’s a valid process now because you didn’t put anything in place to validate against it, and to guarantee that people weren’t using it in ways that were not intended.
Julie: I think that that makes a lot of sense. Because I know I’ve definitely used things in ways that were not intended, which people can go back and look at my quest for Diet Cherry 7 Up during the pandemic, when I used tools in ways they weren’t intended, but I would like to say that Diet Cherry 7 Up is back, from those tools. Thank you PagerDuty and some APIs that were open to me to be able to leverage in interesting ways.
Jason: If you needed an alert for Diet Cherry 7 Up, PagerDuty, I guess it’s a good enough tool for that.
Julie: Well, the fact is, is I [laugh] was able to get very creative. I mean, what are terms of service, Jason?
Jason: I don’t know. Does anybody actually read those?
Julie: Yeah. I would call them ‘light guardrails.’
Jason: [laugh]. So Julie, we’re getting towards the end of the year. I’m curious, what are you looking forward to in 2022?
Julie: Well, aside from, ideally, the end to the pandemic, I would say that one of the things that I’m looking forward to in 2022, from joining Gremlin, I had a really great opportunity to work on certifications here, and I’m really excited because in 2022 we’ll be launching some more certifications and I’m excited for what we’re going to do with that and getting creative around that. But I’m also really interested to just see how everybody evolves or learns from this year and the outages that we had. I always love fun outages, so I’m kind of curious what’s going to happen over the holiday season to see if we see anything new or interesting. But Jason, what about you? What are you looking forward to?
Jason: You know I, similarly, am looking forward to the end of the pandemic. I don’t know if there’s really going to be an end, but I think we’re starting to see a return to some normalcy. And so, we’ve already participated in some great events, went to KubeCon a couple months ago, went to Amazon re:Invent a few weeks ago, and both of those were fantastic just to see people getting out there, and learning, and building things again. So, I’m super excited for this next year. I think we’re going to start seeing a lot more events back in person, and a lot of people really eager to get together to learn and build things together. So, that’s what I’m excited about. Hopefully, less incidents, but as systems get more complex, I’m not sure that that’s going to happen. So, at least if we don’t have less incidents, more learning from incidents is really what I’m hoping for.
Julie: I like how I’m looking forward to more incidents and you’re looking forward to less. To be fair, from my perspective, every incident that we have is an opportunity to talk about something new and to teach folks things, and just sometimes it’s fun going down the rabbit holes to find out, well, what was the cause of this? And what was the outcome? So, when I say more incidents, I don’t mean that I don’t want to be able to watch the Queen’s Gambit on Netflix, okay, J. Paul? Just throwing that out there.
Jason: Well, thanks, Julie, for being on. And for all of our listeners, whether you’re seeing more incidents or less incidents, Julie and I both hope that you’re learning from the incidents that you have, that you’re working to become more reliable and building more reliable systems, and hopefully testing them out with some chaos engineering. If you’d like to hear more from the Break Things on Purpose podcast, we’ve got a bunch of episodes that we’ve published this year, so if you haven’t heard some of them, go back into our catalog. You can see all of the episodes at gremlin.com/podcast. And we look forward to seeing you in our next podcast.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.