Podcast: Break Things on Purpose | Ep. 8: Haley Tucker, Resilience Engineering at Netflix

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

You can subscribe to Break Things on Purpose wherever you get your podcasts.

If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

In this episode, we speak with Haley Tucker from the Resilience Engineering team at Netflix.

Transcript of Today's Episode

Rich Burroughs: Hi, I'm Rich Burroughs and I'm a Community Manager at Gremlin.

Jacob Plicque: And I'm Jacob Plicque, a solutions architect at Gremlin and welcome to Break Things On Purpose, a podcast about Chaos Engineering.

Rich Burroughs: Welcome to episode eight. In this episode we speak with Haley Tucker from the Resilience Engineering team at Netflix. Haley was a lot of fun to talk to and she filled us in on a lot of cool things they do at Netflix that allow us to binge watch our favorite shows. Jacob, what stood out to you from our chat with Haley?

Jacob Plicque: Yeah, so at Gremlin we talk a lot about Netflix and how they started with Chaos Monkey and how they built fit their failure injection tool, but that was a few years ago, so I wanted to know what they'd been up to since then. We were able to deep dive into their Chaos Engineering practice today using CHAP, their Chaos Automation Platform, and Monocle. How about you?

Rich Burroughs: Yeah, I also loved hearing about the way their experimentation platform has evolved. They do very sophisticated testing, combining Chaos Engineering, load testing, and doing it all with canary analysis. It was very interesting to hear about.

Jacob Plicque: Agreed. I think it will be super helpful for folks to hear. As a reminder, you can subscribe to our podcast on Apple Podcasts, Spotify, Stitcher, and other podcast players. Just search for the name, Break Things On Purpose.

Rich Burroughs: Great. Let's go to the interview.

Rich Burroughs: Today we're speaking with Haley Tucker, Haley's a Senior Software Engineer at Netflix on the Resilience Engineering Team. Welcome.

Haley Tucker: Thank you. Happy to be here.

Jacob Plicque: Yes indeed, thank you so much for joining us. So just to kick things off, could you tell us a little bit about your career that led you into Netflix?

Haley Tucker: Sure. So when I started off in my career, I was actually in the defense sector at Raytheon, was there for a few years, came out and moved into a consulting role doing Java integrations with various Oracle products. So I kind of bounced around a bit. And then when I was moving out to California, I ended up interviewing with Netflix for a team at the time was primarily doing business rules around what we allowed during playback. And so since I had some business rules experience, that's how I ended up getting into that team. But that quickly turned into not just a team that owned a bunch of business rules and policies but also operating a service at scale for Netflix. So I was there for... on that team doing playback stuff for about three years, at which point I moved into the resilience engineering team because I decided I actually really liked the operational aspects and learning how distribution systems behave at scale.

Rich Burroughs: That's very cool. Resilience Engineering involves a lot of other things besides Chaos Engineering, but what are some of the other areas that you work on?

Haley Tucker: Yeah, so my team, we've just recently been adding the ability to do load testing of services using production data and where we can kind of ramp up the load slowly to using a canary strategy and monitor the impact to kind of find where is the optimal place for services to run as far as requests per second. So that's one area. We have the Chaos Engineering platform, which we can talk about as well. And then we also own all of canaries at Netflix. So when you're getting ready to deploy a new code into production, can we vet the new build against the existing build and make sure that we're not introducing any new regressions?

Rich Burroughs: And so this is like AB testing, right? So is that involved in the Chaos Engineering as well?

Haley Tucker: Yes. So actually our Chaos Engineering platform is an A/B testing platform. So when we're introducing Chaos into our experiments, we use a baseline of users that are not getting the treatment. And then we take the canary population and inject latency or failure. But we're actually looking at what that does to the end users. So we're not monitoring necessarily the service itself for the failures being injected, but rather the KPIs that we care about from a customer or user perspective. So in our case, that's primarily streams per second. So we want to know if users are still able to get playback while the test is ongoing.

Rich Burroughs: But you're comparing that to a population that's not getting the experiment?

Haley Tucker: Correct.

Rich Burroughs: Yeah. Wow. That's super cool.

Jacob Plicque: Yeah. I think it's a really key component that we hear and talk about a lot is everyone has all these different dashboards and things like that. But maybe you don't care if that EC2 instance hits 80% CPU if it's not affecting your KPIs. Right? So is that kind of the mindset or does it go deeper than that?

Haley Tucker: So that's the kind of high level mindset as far as whether or not we will let the test continue. So I would say there's tiers of things we care about. So during the test, we actually want to know are we negatively impacting users, and if so we want to shut it off. Because if we found a problem, there is no reason to continue running the test. We should just notify server owners about that and have them fix it. If the test runs the full duration and we don't detect any noticeable impact to end users, we still will check things like system impacts, like where the system may be approaching a level in which it would fall over. So that we will do like a canary analysis on the system at the end of the experiment. And so we can still surface that information to service owners in case they want to add headroom or something like that. So kind of a tiered approach.

Jacob Plicque: Makes sense. And so then would you use that to justify, to your point earlier about kind of falling over, that data, is that what allows you to expand the blast radius a bit or do you tend to analyze a little longer first before even redoing the experiment?

Haley Tucker: Let me make sure I understand what you're asking. So you want to know from the system perspective, if we detect issues, will we run the experiment for longer or increase the scope of the experiment?

Jacob Plicque: Actually the other side of it. If it's not affecting your KPIs, because you mentioned earlier, you're still taking that data and kind of making a decision from there. Does it kind of stay that way or do you tend to go up the blast radius level from that perspective? If that makes a little more sense.

Haley Tucker: Ah, so we don't typically increase the breadth of the blast radius, we wouldn't add more customers to it. We might run it for longer periods of time. We also tend to run, from a Chaos perspective, we try to run things in different regions or different times of day because we do see cases where the device mix might be different in one region than the other and so we need to run it in different ways to find if there's a problem in one region but not another. In that regard, we're sort of increasing the blast radius but not really. We'll keep it to a very small percent of users for the duration of the experiment. And these are typically like 30 minute experiments.

Rich Burroughs: Wow. Sure. And are these happening as part of a pipeline?

Haley Tucker: So it varies by team. Some of this is users manually going and interacting with the tool because they have something that they're rolling out and they want to verify it. We also see that kind of around this time of year going into the holidays, people take more of an active role in making sure that their dependencies are behaving as expected. We have some teams that do run these as part of pipelines, so every week or every other week depending on what the dependency is to make sure that they're catching regressions in their dependencies. We also have an experimental system that we've been building out where we can automatically generate and run experiments periodically. From our perspective, we can look at a service and kind of look for common patterns that we can run tests against. So that's not happening, that's not user-driven or pipeline driven. It's kind of coming out of a centralized tool. But we've been turning that on and off as we experiment with it.

Rich Burroughs: Wow, that's super cool. So is that the Lineage Driven Fault Injection that you're using for that or are you doing something else?

Haley Tucker: No, we're doing something slightly different. So basically the problem we were trying to solve is we've got a few large clusters that have just tons and tons of dependencies on them and they wanted to do more fault injection testing, but it was just a lot of kind of churn and tedious work to get all the experiments set up, to run them and then to analyze them. And so we wanted to kind of cut down on the barrier to entry there and be able to provide those service owners with actionable things. Instead of having them focused on running the experiments, have them focused on fixing the issues that come out of them. So what we did was we did a static analysis of the clusters that fall into this category. We can say for all of their dependencies, which ones have fallbacks? That's a good place to start because if you have a fallback, in theory it's a noncritical dependency. So can we focus just on testing failures for noncritical dependencies? So we did a static analysis of the data and then try to generate those tests for them and run them.

Rich Burroughs: Awesome. So let's explain to folks who might not know what a fallback is.

Haley Tucker: Yeah, so fallback is basically if you have a network call for instance, and that network call fails and you want to execute something to return in its place. So say one particular example is in the Netflix UI you might be looking at a movie and in that movie page, it has a list of like, this is an HD title and it has 5.1 audio. Those icons we call badges. And in the case that we're not able to hit the service that tells us which badges to show, we want to just not show them. It's a better user experience to still be able to get playback rather than to fail it all together. So that is an example of what we would consider to be a non critical fallback case.

Rich Burroughs: Cool. Yeah, and sometimes you would fall back to some sort of cached data or something as well, right?

Haley Tucker: Right. Yeah, I would definitely recommend either something just like static or something from cache, something very quick and easy to get to that's not going to also have a failure.

Rich Burroughs: Right, right. And that ends up taking it from the failure causing a total failure of the playback or whatever service is involved to suddenly the user has a pretty good experience. Not the optimal one, but they're able to do what they need to do.

Haley Tucker: But it's slightly degraded. Exactly. Yeah, absolutely.

Jacob Plicque: And honestly, they probably wouldn't really know otherwise unless they were very specifically looking at that particular part, right? So it makes perfect sense. Actually, I want to kind of circle back on a point you made earlier around, because from what it sounded like, churn perspective became a problem and it was causing things to, I guess would essentially cause things to bottleneck. And it actually reminded me, I just read an article, or I guess it was more of a paper by Adrian Colyer, I think you actually may be involved with this one. It's a blog on his website about automating Chaos Experiments in production. And I think that something that came out of that was something called Monocle. Are you involved with that at all?

Haley Tucker: Yes. Yeah. So actually that's the tool that I was talking about. So Monocle is two pieces. It's the static analysis, which is basically looking at the service and seeing what do we think is safe to fail and what could we experiment on? And then it's actually the runner piece of it, which will then take that data, prioritize experiments, generate experiments and run them on some sort of a periodic basis. So that's exactly what that is.

Jacob Plicque: That's awesome. Yeah. So I have to imagine at least at a high level there's some form of service discovery of knowing what talks to what, and then how does the uncovering the service dependencies automatically come in? Because I think that's unbelievably cool.

Haley Tucker: We basically start with our distributed tracing data. Currently it's South, which is our internal one, but where I think we're moving everything to Zipkin. So we look at that to see what dependencies the service is actually talking to. Because if you just looked at all the jars that were pulled into the service, it would be a huge number of things. So we look at the distributed tracing data to see what is it actually talking to. We look at our metrics like our Atlas data to see kind of volume because when we're looking for things that we want to experiment on, we don't really want to bother with things that are one call a week or one call a minute even, we're looking for things that are higher scale. So we look at our metrics to tell us that.

Haley Tucker: We also then have metrics that will tell us if we've seen fallbacks succeed within the last four days. Because we don't, if we see a fallback that is failing, then as part of that static analysis, we will actually just surface that to users to say like, "Hey, you have a fallback but it's failing, so maybe you should go fix that." We won't actually run an experiment on it. Yeah. And so it's a collection of all of this data. We also look at configuration data for a service so we can get access to how it's configured as far as retries and timeouts and if the fallbacks are turned on or not. There's a bunch of analysis there that we can surface to users to say like, "Hey, your timeout for your Hystrix command is shorter than your timeout for the rapped RPC call that's underneath the covers. And so that can actually end up in some very unexpected timeout behavior. So that's another one of the things that we would surface to users before actually running a test on it.

Rich Burroughs: This is really fascinating stuff. Like this is very next level. Most of the people that we talk to are just trying to get started with Chaos Engineering, at least on a day to day basis they just want to be able to get to the point where they can run an experiment a month or something. And I think that it's really good to get this information out there because I guess the analogy that I'll use is I play video games some and I suck at video games. I like playing the new Mortal Kombat and I'm horrible at it, but I watch these high level tournament players play and it lets me know, it gives me a new idea of what's possible right? And so I think this is really going to be great for people to hear in terms of where they might want to move down the road as they're getting their programs going.

Haley Tucker: Yeah, that's awesome. The one thing that I will throw out, I mentioned we've been turning this thing on and off. The reason we've been turning it off is because we got this going and we actually had a lot of success with finding problems. But then the next problem that comes along is scaling the analysis. So that's kind of where we are now is we can find these problems but can we localize them? Because when we're monitoring the impact on users, we may be failing something that's three layers down in the stack, but the actual impact shows up two layers up. So how do we get the right insights and the right observability to say this is the problem and this is the general area where you need to go look to fix it? So yeah, that's the next level of like how do you make sure that as you're finding problems, you can quickly address them and get fixes turned around.

Rich Burroughs: So you're talking about, like if a service fails because its dependency has a problem, you might see the error in the service, not the dependency, is that what you're saying?

Haley Tucker: Yeah. Or even in the device. So we return a fallback from a service that's two or three layers down the stack. That bubbles up. Maybe everybody's correctly handling the failure and then we hand it back to the device even. But that device gets that and it's a bad fallback and it can't handle it. So the error actually comes in through a device error message. And so because you're so many layers removed, and when you're tracking user impact, the actual system KPIs that you care about, actually tracing that from the experiment directly to the correct error message is a little tricky. So that's what we're working on to kind of make this really viable.

Rich Burroughs: Then you all have to support it. There's a ton of different devices, right? Netflix runs on everything.

Haley Tucker: Yeah. Thousands. Thousands of devices.

Rich Burroughs: I think I could probably play it on my microwave or something if I tried it hard enough.

Haley Tucker: There's probably been a hack day for that.

Rich Burroughs: That would not surprise me. So talk to me a little bit about, say that I'm a service owner at Netflix and I know that you all have this Freedom and Responsibility ethic where people are encouraged to figure out their own way to do things, but they're also on the hook for the results. Am I summarizing that correctly?

Haley Tucker: Yep.

Rich Burroughs: Yeah. Yeah. So say that I'm a service owner, talk me through how I interact with your team. If I'm rolling out a whole new service, what does that look like?

Haley Tucker: So basically when teams are getting ready to roll out a new service, most of our tooling is primarily focused around production traffic. We have some mechanisms where you can say test a failure for an individual device or customer ID. So usually when it's a brand new service, we recommend people start with that. So I could actually take my iPhone and configure our failure injection framework to fail a call for my iPhone or inject latency for my device. So then they could actually do kind of an end to end failure test on a very, very small scale. Make sure that functionally everything is working. So that's usually where we tell people to start.

Haley Tucker: And then most of our tooling really kicks in once they actually deploy it into production because we need the load and the volume of traffic to exercise a lot of our tools. So at that point, depending on how familiar they are with our tools or not, some teams are fully self service and they just run with it. They'll start doing canaries, they'll start doing squeeze tests, they'll start doing chaos tests on their own. But then other teams, we absolutely do kind of a consulting model where we'll meet with them, show them the tooling, answer any questions they have and make sure they're comfortable with it. Because we want people to understand the tools. We also want people to be able to take action on the results. So we don't want them feeling like they don't know or understand what's going on.

Rich Burroughs: Right. And if they're enabled to do it themselves, then they're not sitting around waiting for you all the time.

Haley Tucker: Exactly.

Rich Burroughs: Yeah. So as a user, I can launch these experiments myself?

Haley Tucker: Yep, absolutely. You can launch the experiments. You can also stop them or stop all experiments. So we have a big giant stop all buttons that anybody in the company can hit at any time.

Rich Burroughs: We have that too. It's important.

Haley Tucker: It's really important for if there's a production outage or something happening, we don't want people debating whether or not it's related. Just shut it all down. It's fine.

Rich Burroughs: Yeah. So that's an interesting point because I've heard of companies doing things like even blocking deployments while there's an outage. Do you all have any tooling around anything like that?

Haley Tucker: We have some support for calendar-based quiet periods where we'll not run things. We don't necessarily prevent people if they want to manually go in and kick one off. There's abilities for them to force a run through. And then for fail overs, we do have some mechanisms that will detect if a fail over starts, we'll actually shut down experiments because that's an indication that there is a problem in the environment and we don't want to add to that. So we'll detect that the failover is starting and anything that's queued will just stay queued. We won't kick them off and anything that's in progress we'll shut it down.

Rich Burroughs: Wow. That's super cool. I think that what I'm going to take away from this episode is that I'd like to come and hang out with you all for like a month.

Jacob Plicque: Right?

Rich Burroughs: And just what you do.

Jacob Plicque: Is that in consulting role or just like I'm a fan.

Rich Burroughs: I don't expect they would pay me.

Haley Tucker: We'd love to have you anytime.

Rich Burroughs: No, it's funny because you all were hiring so many great people too. I'm friends with J Paul Reed and Jessica DeVita just got hired there and such a great team you all have.

Haley Tucker: Yeah, I love it. It's a great group.

Rich Burroughs: So you have a lot of folks there who are coming more from that sort of Human Factors, Safety side. I guess the literal kind of John Allspaw definition of what Resilience Engineering is. Does that bleed through to your team too?

Haley Tucker: A little bit. To the extent that our primary goal is to enable the service owners to understand their services, be able to deal with them. We're not looking to automate away all of the knowledge and the ability from the people. So from that regard, I think the human factors definitely plays a part because that is what we're trying to do.

Rich Burroughs: Right, no, that totally makes sense.

Haley Tucker: So yeah, I think my team is not as active in the Twittersphere and stuff around Human Factors space, but definitely, yeah, plays a part.

Rich Burroughs: Yeah. I mean, a big part of that is, the way that Allspaw talks about it is adaptive capacity, right? The idea that your team is able to adjust to problems and surprises.

Haley Tucker: Exactly. And I think for me as a service owner, that was the biggest benefit of Chaos Engineering, was just being able to see what happens when something fails and understand it and be able to know that the mechanisms that I put in place are going to work or that they're not going to work and I need to fix it.

Rich Burroughs: Yeah.

Jacob Plicque: I think that's what's really interesting is there definitely is a piece or... A question that I've received a lot is around what's next and when does AI and machine learning come in and stuff like that. And I always kind of chuckle and I was like, honestly, as cool as that stuff sounds, the human impact is still so one important and so interesting. I mean, we talk about things like, pager pain and getting woken up at 4:00 AM on Christmas and these are things that we are still dealing with even though Chaos Engineering has been around for a while and is of course becoming a lot more popularized these are still things that we're fighting today. And I don't think that-

Haley Tucker: Yeah.

Jacob Plicque: So yeah, I'm just curious to know what do you think gets us to that next level or is that possible?

Haley Tucker: So I think we can do a lot around testing for the known knowns or even to an extent the known unknowns. But when you look at outages, at least for us, most outages are some crazy combination of things going wrong. It's not the things that we can easily verify. And so I think that's where I don't know that we ever get away fully from oncall and pagers and things like that. But what we can do is just make sure that people are more comfortable when things do go wrong, that they're able to adjust, that we have... Big levers like failover I think are a great tool to just say something's going wrong, let's stop the bleeding and then we can fix the problems without impacting our users. So I don't know that we ever get away from that. I think we can lessen the space that teams have to worry about through Chaos Engineering by testing the things that we can predict and make sure that we don't have anything predictable happen, but there's always going to be that category of things that's a combination of factors that you can't predict.

Rich Burroughs: Sure. And the big outages really are often the kind of cascading failures where a combination of things happen before it gets to the point where there's an actual outage.

Haley Tucker: Right.

Jacob Plicque: Yeah. I think it's really key because in our lot of our talks, we show this big microservice deathball of Amazon and Netflix. It always gets a nice chuckle, but I always say, "Hey, the secret sauce is that this diagram is like four years old. So imagine what it looks like now". And then on the flip side, because, we kind of joked about Twitter earlier, our customer demand is so massive now. I demand Netflix to work at 3:34 in the morning because I really need to watch Stranger Things season three and I don't care what's happening to you behind the scenes, that's my expectation for the money I'm paying a month.

Haley Tucker: Yeah, absolutely.

Jacob Plicque: And so as long as we're dealing with that, which I don't think that's going away I'd argue maybe ever considering it's done nothing but go up as the social networking has become the mainstay, I think it's going to be key. So I couldn't agree more. I don't think it's going away maybe ever. So we have to kind of answer the call so to speak.

Haley Tucker: Yep. Absolutely.

Rich Burroughs: Yeah. It's interesting too because I think as reliability gets better, it actually sort of conditions people to that new state, right. Where we're like, I almost never have a problem with Netflix. It almost always works for me. I can't remember the last time when I couldn't play something. Once in awhile I get kicked out of something I'm watching and I go right back in. But you get accustomed to that and then that's your new level of what you judge things to be as user I think.

Haley Tucker: Yep. That's what I like to hear.

Rich Burroughs: So I listened to, we're going to put a link in the show notes. You were on the podcast that Netflix is doing called We Are Netflix and you talked about Chaos Engineering on there with Aaron from the SRE team and I really the episode. Like I said, we'll link to that. But I was interested, on that podcast you mentioned that Chaos Monkey, which is the original thing, right? That shuts down an instance randomly in AWS, that that was your least interesting experiment to you. And I'm interested in why that is.

Haley Tucker: Yeah, I mean mostly, like for us it's always on now. So it's one of those things that people don't generally think about. And for the most part we don't find many problems. I think we've gotten good at making sure we have redundancy and so it served its purpose at the time. But now it's just not something that anybody even usually thinks about except for developers. Every developer spins up their new cluster and then that gets killed and then they're like dang it Chaos Monkey! So that's the extent now. And I think it just happens silently in the background and so it's a very good thing to be there and to remind people periodically because we will catch things periodically, but it's just not as interesting as being able to do these targeted dependency tests. I think you learn a lot more from those and you can do more user-focused testing as well.

Rich Burroughs: Yeah. Do you all have other things that run like that constantly?

Haley Tucker: No, probably not. The Squeeze tests we are starting to run daily now on a large number of our larger clusters, but that's been nice so that we can kind of start to detect regressions, performance regressions. Since we do run with a decent amount of head room, we're trying to find ways that we can detect performance regressions that don't show up at the normal traffic volumes.

Rich Burroughs: That's load testing, the Squeeze tests?

Haley Tucker: Yeah. Sorry, load testing.

Jacob Plicque: Well I think it's really interesting and something I wanted to kind of expose a little bit is even though, we were touching on earlier about how Chaos Monkey is not really all that interesting and you guys haven't found a lot of value adds to it. But what's interesting about that though is that you're still running it, which to me means two things. One is that from a maturity model perspective, you've gotten to the point that you've uncovered most if not all of the known issues that you'd expect from startups issues and stuff like that. But at the same time, you're understanding that you don't want to regress either. So, which is why it's still being run today. Does that sound right?

Haley Tucker: Yeah. And that's absolutely correct. We certainly don't want to lose any progress that we had related to Chaos Monkey.

Jacob Plicque: Exactly. Exactly. I think that's really interesting because you go to the Chaos Monkey Github and it's like blah blah blah, open source depreciated. But I think it's just really important to note even though it's depreciated from an open source perspective, Netflix is still using it on arguably a daily basis I would imagine.

Haley Tucker: Oh yeah, yeah, absolutely. It runs in our prod account and kills a whole bunch of services daily. So.

Jacob Plicque: Yeah.

Rich Burroughs: So what are the things that you find more interesting? So is that, you were talking about the kind of newer analysis that you're doing nowadays.

Haley Tucker: Basically being able to do targeted experiments, we learn a lot more from. And there's I think more actionable things that come out of it. So with Chaos Monkey, if you kill an instance and it takes your service down, then that's a sign that you need more instances, right? You're not properly scaled to have redundancy. So if you run a latency experiment and you say you inject latency into this dependency, you may see thread pools fill up, you may see circuits open, you may see those circuits opening cause a cascading failure in the form of services returning 500 responses.

Haley Tucker: There's a bunch of things that you can learn just by injecting latency and even into a single dependency. And there's very concrete things that you can change coming out of that. Either you need to make sure that your SLA stays below a certain level or you need to increase your thread pool sizes or you need to properly handle the fact that you're getting an error in these conditions to prevent those cascading type effects. And so I think it's just a lot more concrete actions that come out of it and makes your service more resilient going forward.

Jacob Plicque: Yeah, makes sense. Because you mentioned earlier on kind of having some, I won't say notation, but having at least a note somewhere where, "Hey, when this last time this fallback was created?" So maybe you don't necessarily, and this could be wrong, so feel free to tell me I'm not. But maybe you don't necessarily mind that you have regions that you're failing over, but it's more about what happens when it breaks versus the fact that it did even successfully. Because I have to imagine at the scale of a company like Netflix, the fact that that even happens is, it becomes less important over time tying back to the Chaos Monkey point we were making earlier.

Haley Tucker: You're talking about the failovers themselves?

Jacob Plicque: Well, yes. The failovers themselves and then when a fallback works.

Haley Tucker: Oh yeah. So-

Jacob Plicque: Kind of two points.

Haley Tucker: Yeah. So with failovers, you're absolutely right. We've gotten to the point with failovers that we actually run planned exercises for failovers. I think we're at every other week now. And one of them, we just do a quick failover, fail back and one of them we do a failover and hold overnight to kind of get the scale aspect of it. And so for those, we're absolutely to the point where, not that there's not an entire team dedicated to this, that it's very important to them, but for service owners themselves, it's kind of a no... It used to be a long time ago where everybody knew that a failover was happening and you kind of watched your dashboards and might even be looped in to help scale your service. But now it just happens all the time. But we need to make sure that that lever is always available and that it's safe to do it. And so yeah, that's kind of more in the bucket of widely accepted and people are used to it. From a fallback perspective, you're also absolutely right. At our scale, we're almost always serving some fallback somewhere.

Rich Burroughs: Oh wow. That's really interesting.

Haley Tucker: So it's not a matter of verifying necessarily that an individual fallback works, but also when we actually have a service outage and we're returning nothing but fall backs, what is that doing? Because you may have a fallback that works that impacted a user but then they retried and it worked because it hit a different instance, but when that actually is a full on outage and it fails at scale, you can see very different behaviors where you see the kind of cascading effects.

Jacob Plicque: Well, I think what's cool is that based off of that information, it sounds like I want to say an inevitability, knock on the wood, but it sounds like that eventually the idea I think naturally would be that it would be in the status that Chaos Monkey is today where it's just, we know this is going to happen, we've handled it, not a big deal. We still run it, we still automate it. It doesn't sound like it's automated yet. It sounds like it's in the process of it. I think that's actually really, really fascinating because that kind of tells the tale of where you guys are taking your practice, which is really, really fantastic.

Haley Tucker: Yeah, absolutely. That's the desire behind Monocle is exactly that. If we can just always have it running and just let teams know when something breaks that they need to address, I think that would be a great position to be in.

Rich Burroughs: What kind of struggles do you see service owners having trying to make their services resilient?

Haley Tucker: One of the big areas is just the number of things that can go wrong is very large. And so trying to kind of chip away at that, especially when you have limited bandwidth to dedicate to operational type activities I think is the biggest struggle. So where do you choose to spend the bulk of your time investing when you've also got all these other product features and things that you need to deliver? So I think the time commitment as well as just like a large search space of things that you could potentially test and you're never going to be able to test all of them are both struggles.

Haley Tucker: I think the other thing is just you've got teams that are focused on performance or teams that are focused on resilience or teams that are focused on various aspects, but we kind of expect service owners to wear a little bit of all of those hats and be able to do it all and it's a large amount of to be good at. And so that's where I kind of view tooling as being really important to help them kind of bridge the gap in their knowledge and where the subject matter experts can step in and help.

Rich Burroughs: That's awesome. Where do you see Chaos Engineering heading in the future? At least maybe what you're doing there at Netflix?

Haley Tucker: I would love to see Chaos Engineering at Netflix be at a point where we are always running these targeted types of experiments and we're able to cover kind of the known degraded modes so that we don't have any cases where users are impacted by things that we could easily prevent. So I'd like to see us get there. In order for us to get there, we have to solve the problem with observability and fault localization, which I think is a big part of it. Beyond that we don't have a lot of, I think that's going to be a lot of work for us, so we don't have a lot of plans there right now. We have talked about combining chaos into load tests and canaries, being able to run larger experiments that cover more dimensions. So instead of having to do individual tests, maybe running one long canary that has different phases to it so that we can cover that all as part of a continuous deployment cycle.

Rich Burroughs: That would be super cool.

Haley Tucker: So yeah, it would be awesome. There's a lot of stuff to figure out there, but I would really love to see us get to that point so that it's just part of everybody's daily process.

Rich Burroughs: That's great. I think that's all the time that we have. I really want to thank you for coming to talk to us, Haley. I really, really appreciated hearing more about what all you are up to. Because like I said, I feel like this will give a lot of other folks an idea of things that they can do down the road in their practices. Do you have anything you want to mention, like where people can find you on the internet, on Twitter or anything like that?

Haley Tucker: So yeah, if you're interested you can find me on LinkedIn or on Twitter at hwilson1204. I have some other talks also connected to my LinkedIn profile, so, and I would love to hear from you. Direct messages are open.

Rich Burroughs: Okay. Awesome. Well thank you so much for joining us Haley.

Jacob Plicque: Yep, thanks.

Haley Tucker: Great. Thank you.

Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more of Komiku's music, visit loyaltyfreakmusic.com or click the link in the show notes.

Rich Burroughs: For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening, and join us next month for another episode.

No items found.

Podcast: Break Things on Purpose | Ep. 8: Haley Tucker, Resilience Engineering at Netflix

Transcript of Today's Episode

What is Failure Flags? Build testable, reliable software—without touching infrastructure

Introducing Custom Reliability Test Suites, Scoring and Dashboards