Podcast: Break Things on Purpose | Brian Holt, Principal Program Manager at Microsoft
Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode of the Break Things on Purpose podcast, we speak with Brian Holt, Principal Program Manager at Microsoft.
- The importance of reliability in dev tooling (1:57)
- Chaos at Reddit (4:03)
- Frontend perspectives on Chaos Engineering (15:09)
Patrick Higgins: So, how are you doing Brian?
Brian Holt: I'm doing okay. I'm off today so I'm doing extra good.
Patrick Higgins: That's great. Awesome.
Brian Holt: I was just going to say, I'm just working on my new course, so I guess I am still kind of working a little bit.
Patrick Higgins: Nice. What is the latest one?
Brian Holt: This is version six of my intro to React course.
Patrick Higgins: Oh, awesome. Very cool.
Ana Medina: That's a pretty good use for a day off Monday, but very much not really a day off.
Brian Holt: Yeah. I kind of get myself into these situations. I guess you could say this is me breaking my own life on purpose.
Jason Yee: Welcome to Break Things on Purpose. A podcast about building reliable systems out of the chaos. In this episode, Pat Higgins and Ana Medina chat with Brian Holt about developer tooling, an incident at Reddit, and chaos in the frontend.
Patrick Higgins: So, Brian, can you tell the listeners a bit about what you do on the day-to-day? What does a principal program manager at Microsoft actually take care of?
The importance of reliability in dev tooling
Patrick Higgins: So something that comes to mind for me is, engineers can be pretty hard taskmasters when it comes to our time and reliability with their tooling. Have you found that that has to be a priority when you're thinking about building out product?
Brian Holt: Yeah. I mean, especially when it comes to things like Visual Studio Code. If we had some sort of breaking bug that we shipped with VS Code, it would be catastrophic for not only our reputation, but just general productivity of the world, right? How much of the world is writing code on VS Code these days, and if all of a sudden we shipped an update that broke everyone's developer tool, we would never hear the end of it and we'd also lose a lot of trust from developers.
Patrick Higgins: Yeah. I know on a personal level, I'd never let you hear the end of it as well. So it's good to hear that it's top of mind for you.
Brian Holt: Yeah. That's the other reason is I'm afraid of Pat. That's the other reason that I don't ship bugs.
Ana Medina: I do love that you touched a little bit upon, you have to keep this platform up and running because trust will be lost. And we know in developer tools that it's like, you get an engineer mad for one day of something being down, they're going to go find another tool. They'll go pay for another service. And they might tweet about it too.
Brian Holt: Yeah, absolutely. I heard Scott Guthrie, our executive vice president of Cloud and AI, I don't know if he came up with this quote, I just heard him say it, that you gain developer trust in drops and you lose it in buckets. So something like that, you'll spend another five years recovering from the reputation loss of one major outage.
Patrick Higgins: Yeah. That's heavy. That's so brutal to think through, particularly seeing that we take so much of our time. So much of our life is dedicated to the work that we do and being able to lose it that quick, it's quite a brutal sentiment to really think about.
Brian Holt: It's tough on developer tooling too, because there's so many good tools out there. It's not just VS Code, right? You can hop on Sublime, you can hop on WebStorm and they're all fantastic tools as well. And so that reliability piece of it is perhaps the most important part because people really want to trust you, but not only that, they just don't want to think about it, and when you make them think about it, then they get upset.
Chaos at Reddit
Patrick Higgins: So Brian, on the podcast, we like to ask our guests about any horror stories that they've had. What has been a horrible incident that you've encountered in your career?
Brian Holt: So I remember when you asked me this and I sat there and thought about it. And I've had a couple pretty big ones, but the biggest one that I was personally most responsible for and had the most, one, skin in the game and two, I had to respond to the outage, was one that ironically, Pat, you were in the room for. Years and years ago, when I was working at Reddit and Pat was working on his political thing that he was doing when we both lived in Utah. So I used to work at Reddit. I was the director of engineering for the experimental division part of Reddit, which one, doesn't exist anymore and two, I was director of myself and I just liked telling people that I was a director at Reddit. So one of the things that we did, we had a Reddit marketplace because our directive was to find ways for Reddit to make money that wasn't just advertising.
So one of the things we thought was, there's a lot of commerce going on right now, we should have a marketplace where people can go on there and we can get Redditors to put their wares up there. It was basically, it was turned out to be kind of Etsy for people on Reddit. I'm a creator on Dogecoin, I want to go sell Dogecoin t-shirts, which we did. We had Dogecoin t-shirts, back before it was cool. So it was really just kind of me and maybe five other people working on this. And I was really the only frontend person fully focused on this. We got some attention from other developers, but they were working on other kinds of initiatives. So one of the things we wanted to do is we wanted to start running flash sales, really big, we're going to put something on the front page of Reddit saying, "Hey, Redditors, we're having a huge liquidation sale," kind of trying to build up for a Black Friday type sale.
But we were kind of running in production, chaos experiments, it's like, "Hey, let's see how this breaks first if we run it in April, and then we can kind of run several of these building up to black Friday." It turns out that was a very wise thing to do because it broke in profoundly unexpected ways. So I remember the first one that we ran, I think it was in April, and it was me and another engineer, Eric Fish bless him. He's definitely not going to listen to this so we can talk as much crap about Fish as we want to.
Patrick Higgins: I hope he does.
Brian Holt: Now you're going to send it to him.
Patrick Higgins: Yeah. Yep.
Brian Holt: Worked tirelessly night and day on this. There was a couple of nights in the office that I didn't leave until one or two in the morning. But we felt pretty good going into it. We had a heavy caching layer in front of it, I guess it's probably interesting what kind of architecture we had. It was a Django Python backend. I think we were running MySQL and then the frontend was written in React... Parts that were in Angular and parts of it were in React and this would have been 2014 probably. Yeah, so we got all excited about it. We launched it. We put the blog post on the front page of Reddit, which is just an absolutely insane amount of traffic. So instantly we posted it and we had just a massive spike of traffic to, I can't even remember. There is a point where Google analytics gives up in trying to tell you how many people are on your website, and I want to say it was 240,000 people at a time. It was something like that.
Patrick Higgins: Wow.
Brian Holt: I Google this, "has anyone ever seen this before?" Because there's not many people that get to see 240,000 people concurrently on their site that are using the free tier of Google analytics. So we got on there and the way the sale worked was every 30 minutes, it was flipping... It was one item for sale every 30 minutes. And there was a big fat timer that kind of put some pressure on people to buy something. And all this was heavily cached because it was content that we could understand in advance. But we noticed that the frontend wasn't counting down correctly and it was always serving the same time. So if you loaded the page, it would say that there's 29 minutes left, and then if you refresh the page after two minutes, it still said there was 29 minutes. And we realized we had cached the timer, which is a problem because the timer is supposed to go down. So we're like, "Oh, okay. We'll just invalidate the cache." That totally makes sense. So I click invalidate the cache-
Patrick Higgins: Oh no.
Brian Holt: And I just immediately recognized my problem because it was just classic thundering herd. There was zero requests hitting the server and all of a sudden, every request was hitting the server. And so the servers just burst into flames. There was just an unreal spike of traffic. Our ingress just had no idea how to route traffic, like, "Hey, everything's down. Do I go down now?" It was just... It was a profoundly amazing crash. And so essentially what we had to do, is we had to take the entire site offline, redirect people to a, "sorry, we're under heavy load" sign, get everything back up, manually put the cache back in, and we just deleted the timer off the side. So I was, "well, people will figure it out when they can't buy it."
Patrick Higgins: That'll do.
Brian Holt: Yeah. Does that make sense? My profound idiocies? Does that make sense there?
Ana Medina: It just sounds like things had to break in order for you to learn.
Brian Holt: Yeah, absolutely. And so this was the first of, I think, four or five that we ended up running, but by the time that we actually got to Black Friday, we actually had things mostly smoothed out to the point that I don't think we had any outages, despite the fact that we had much more traffic on the Black Friday sale day.
Patrick Higgins: So were you running the same format for Black Friday as well?
Brian Holt: Yeah. Yeah. So it was always a reskin on it and we would always throw on some Easter eggs. One of them, I remember, if you click the little lightning bolt face, you would get an extra secret lightning bolt plushie that you could buy that was super limited. I think we only had like a hundred of them. So we would do stuff like that. But other than that, we were trying to keep it on the same code base so that really we could scale up something that we understood.
Ana Medina: I had a question around, when y'all noticed that things were going down, was it that you had some form of monitoring/observability to kind of tell you, "Oh, we're having issues on the timer being served as a read or anything like it," or you were kind of like, "all right, cool. We have 240,000 users on it. Everything's successful. I can still hit the website."
Brian Holt: Well, let me tell you about the most accurate and fast paced monitoring solution that I can think of, and that's having all of Reddit look at it because you will hear everything that's wrong with it. You'll also hear about your mother. You'll also hear about conspiracy theories and cryptocurrencies and everything that you didn't want to hear in the first place anyway.
So I think I noticed the timer just by perusing the site, but most of the issues, we would just look at the thread underneath the blog post and just see people yell, "Hey, this doesn't work. This is broken." Not recommend. I would not recommend
Ana Medina: It does remind me of a last employer, Uber. We used to say our best monitoring is just having Twitter search Uber down and just have it showing, "well, when people start blowing us up with tweets, we should really make sure what's going on our microservice architecture."
Brian Holt: I mean, hopefully your monitoring tells you that before that, but ours definitely did not.
Patrick Higgins: So what was that process once everything burst into flames, essentially. Did you have any contingencies or backup plans for getting that back up? Or did you have to think really on your feet about how to get it done?
Brian Holt: It was pretty on our feet. We over provisioned, which was definitely helpful because at that case, we were able to take half out of rotation, load them into the new code, whereas the old ones were just serving the down page. So that one was definitely helpful. And then we had a quick redeploy script that we had written that was just wildly, just Wild West. It was just a poorly written Python that would query your AWS account, get all the EC2 instances and then deploy to them. It was nothing, it wasn't Terraform or anything like that. This was... I mean, you had a frontend engineer writing infrastructure. This is the kind of thing that you get. But I think beyond that, it was mostly just me and Fish panicking. I don't know, Fish probably had his shit together more than I did. I mean, just in general in life and such, but in this particular case, that's what we had.
Patrick Higgins: So what were your takeaways looking back on it? What were the things that you remember? How did that really affect you and your processes in terms of your work going forward?
Brian Holt: I think one is, I learned a lot about caching that day. I profoundly understand the thundering herd problem now. And I think that's just, when you kind of have to have it happen to you, you realize like, "Oh, if there's nothing in the cache I'm screwed." It's a really easy strategy to say, "Hey, if this doesn't exist in cache, go get it and then put it in the cache." That nice, succinct, beautiful code. But in reality, you're just asking for something terrible to happen to you. I think from there, we did come up with some different strategies, we expect things to go wrong in the future, but the first one is really the only time that we had straight up downtime.
The rest of the time, we kind of had figured out our caching strategy, which was more, we had a background prep process, basically like a Lambda running in the background that was just constantly populating the cache, and so the server is only getting one request every 10 seconds and then that was always reading from a populated cache. And if it just wasn't in the cache, it just didn't hit the server and the client would smartly handle a 500 or something like that. And once we switched to an architecture like that, where it basically said, if it's in the cache, it's in the cache, if it's not, then just assume a 500, that really made us much more resilient to lots and lots of scale.
Patrick Higgins: So I'm curious. When you had any incidents that happened, did you have any kind of formal process for that? Or was it like, let's go down to Whiskey Street and talk about it over a beer?
Brian Holt: Yeah, that's the whole process. So Whiskey Street was the bar that our office was above. It was either that or Junior's which was the other much crappier bar down the street. Depending on who was buying I suppose. It's like, how bad do I feel about this? So that was relatively... A lot of times we would call the engineers over at Reddit and kind of get their weigh in on it. So that was another huge advantage that we had is while we were a really small team, we had access to some world-class infrastructure engineers, so we could give them a call and say, "okay, this happened, what do you think about this?" And they could basically come and consult with us and help us get our stuff back up and running. What was hard is that they had to run Reddit itself, which always has that kind of traffic and they always had their own fires to put out. So thanks, Jason, thanks, Ricky, Neil, all those guys. Except Ricky.
Patrick Higgins: We love you Ricky. And we'll make sure Ricky listens to this as well.
Brian Holt: He better.
Frontend perspectives on Chaos Engineering
Patrick Higgins: So, Brian, you've spent a lot of your career or at least up until that point, you had spent a lot of time working on a frontend and have been a bit of a frontend specialist. And obviously, at that point you had to step into some infrastructure and reliability related issues. Do you have any good advice for frontend engineers that might find themselves in a similar situation?
So at that point, my best advice for anyone in any situation of that variety is just ask all of the dumbest questions. It doesn't matter if you look dumb, it's because you are dumb. So you need to rectify the being dumb part. And so in these particular cases, I would just call the Reddit engineers and just say, "this is going wrong. This is broken. Here's all the things I don't know how to do. Please help me." And that was definitely the best way for me to just get started was just to soak up all the experience of these people that had been doing it for a long time. And then eventually kind of through the course of listening a lot and just going through stuff like this, you kind of start to gain your own experience and get into it, but just use your ignorance as a superpower to ask all the questions to all the people that know how to do it. It's really just a huge shortcut to learning anything.
Patrick Higgins: Right. So essentially having the humility to be able to ask the questions which are a bit embarrassing are a bit hard to do.
Brian Holt: Yeah. Or just to being totally shameless and not feeling any shame whatsoever. That's kind of my strategy.
Ana Medina: I think that's a great advice for whether it's people starting out in tech or they're 10 years in, don't let your ego make your career stumble. Sometimes you just have to be like, "no, I actually don't really know how this works. Can we sit down for a bit and really get whether it's brain dump or let's walk through this code or architecture diagram." We're all humans. We have to constantly be learning.
Brian Holt: For sure. Yeah. And I recognize my, I guess my privilege in this particular situation where I can just kind of shamelessly go into that and generally not feel a lot of consequences for that, but I think if you find the right people and you work with the right kind of people, you can find people that are more than happy to share that kind of knowledge with you and kind of impart good direction. So I think maybe identifying those kinds of people in advance as well, who can I go to and really start to absorb the wisdom. And this is why I was making fun of Ricky earlier. He's still my good friend, Pat and I still text with him a lot. But he was definitely that person for me, that was when stuff really started going wrong, I knew I could always send Ricky a message, like "I will send you a bottle of whiskey if you help me fix this thing that I don't know how to fix."
Ana Medina: One of the questions that we had for you, we were kind of wondering if you had ever done Chaos Engineering, and if you had any recent experiences that you wanted to share about.
Brian Holt: So I actually have a lot of experience with Chaos Engineering because I was, in general, the victim of Chaos Engineering when I worked at Netflix. I mean, obviously Netflix is well known for all the Chaos Monkey and Chaos Kong and all that kind of stuff that they do there. And so I was constantly just being involved with those kinds of things. It's just permeated the entire culture of the company of like, "Hey, we need to make sure that we don't have single points of failure anywhere," and that ranges from, sometimes my development database would go down, and so just dealing with stuff on that level, personal development kind of stuff, to what happens if the entire region goes down with Chaos Kong, and who gets paged and are they responding appropriately?
And what happens if this person's not available? Who's going to respond to this page if this person is not available. So we had Chaos throughout the entire structure of the organization to make sure that we didn't have a bus problem with one developer being the know-all for everything to kind of what you traditionally think of Chaos Engineering of, "what happens if one of my instances goes down?"
Ana Medina: Apart from getting a chance to do some of that Chaos Engineering in the Netflix space, is there any recent experiment, like your latest Chaos Engineering experiment that you want to share with listeners about?
Brian Holt: Pat worked with me on the one that I did last, which was, I wrote a little browser extension, which I was calling Imp at the time, that would go in and kind of just mess with a frontend engineer's workflow, and just say, what happens if we introduce latency? What happens if we 500 here? What happens if I don't let any fonts load? What happens if I shut down Facebook as a domain? And just kind of trying to introduce some chaos into a developers' development process so that they could face issues that you would expect someone browsing your site to see. Unfortunately, I haven't worked on it in quite some time.
Patrick Higgins: Brian gave a really good talk that we'll link to in the show notes at React Rally about Frontend Engineers and Chaos Engineering and why frontend should be focused on it. We'll link to that so that everyone can have a look at that talk as well. Brought up some really interesting points.
Brian Holt: It's just a really under thought of part of the... And I guess another thing is, I'd say the frontend engineers are the most poised to understand chaos because when you're an infrastructure, and I can say this now that I've worked in infrastructure for some years, you have a lot of control over what's happening. You may not necessarily have access to the rack anymore, but you have control of the network configurations, you know what kind of operating system is running, what patches are running. As a frontend developer, you write this app that's then shipped to 9,000 different browser engines that are all going to interpret and run it in different ways. You basically run your code in a hostile environment all the time. And so you constantly have to try it in seven different browsers at different viewport sizes, with different resolutions, on different processors with different memory, right?
This idea of chaos is just innate to the frontend. And so I wanted to try and find a way to harness that so that we could really capture it all up front. It's like, imagine your development workflow if it changed browsers every time that you refresh the page and it changed viewport sizes and it changed network latency and just kind of introduce that chaos directly into their process so that they could see it instead of having to react to it. We could be proactive, which I think is kind of the point of Chaos Engineering in general, right? It's like, let's identify the bugs up front by introducing chaos. So anyway, that was the idea and I didn't follow through because I'm lazy.
Patrick Higgins: You're very busy as well. Brian, thanks so much for coming on the show today. Before we wrap things up, do you have anything you're really excited about that's coming up? Any things that you'd like to plug that are in the future for you?
Brian Holt: I'm pretty excited about some of the courses I've taught for Frontend Masters. I think some of them will really apply to some of the listeners of this. But the one I'm working on right now is the Complete Intro to React v6, but I also recently taught courses on the Complete Intro to Containers, to Linux and the command line and all that's on frontendmasters.com. I know it says frontend in the name, but it's actually a kind of a full stack place now. The other thing that I always want to plug is AnnieCannons and Vets Who Code, who are two of my very, very favorite charities.
Patrick Higgins: Awesome, great plugs. I can attest to Frontend Masters being really good, and I've learned a lot on there from Brian, so thank you very much for that, Brian. Thanks for coming on the show. It's been an absolute pleasure.
Brian Holt: Yeah, no, thanks for having me.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to Break Things on Purpose on Spotify, Apple Podcasts, or wherever you get your podcasts. Our theme song is called Battle of Pogs by Komiku, and is available on loyaltyfreakmusic.com
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more