November 2, 2018

Adrian Cockroft: "Chaos Engineering - What it is, and where it's going" - Chaos Conf 2018

This is the keynote from Chaos Conf 2018, given by Adrian Cockroft, VP of Cloud Architecture Strategy at AWS.

I'm very happy to be invited to be the opening keynote here at Chaos Conf. It's a great event. It's great to see all of the speakers and have everyone here. Since I'm starting off, I decided that I really needed to try and set context, and they gave me a whole hour, so I took some material that you might've seen before if you've been watching me talk about chaos. But I've added quite a lot to that. And what I decided we needed to start off with is a little bit about what is chaos engineering, and where did it come from? And then I'll end up with a bit about where it might be going. Okay?

I'm gonna start off with some questions. This is when I'm talking to customers or trying to figure out whether they've got their heads around this. This is always a good question to start with. What should your system do when something fails? And a lot of the time, people say, "I don't want it to fail. Could you just make it not fail?" And then you have to sort of tie them down and argue them into a corner and say, "Well, what should it do when it fails? Should it collapse in a heap?" Because that's actually the default behavior for most systems.

But what you'd really like to do is choose between two things other than collapsing in a heap. One is stop. If you're moving trillions of dollars around and you're not sure what's going on, you should probably stop. But if you're trying to decide whether somebody's a valid customer and you want to show them a movie or not, you can just carry on and say, "I'm going to assume that this looks vaguely okay and I'm just gonna carry on," because the cost of doing that is pretty marginal.

So to be more specific, a small specific case, quite often in the code, you'll be looking up a customer, you'll be looking up permissions, and you can't figure it out because say the subscriber service is down. You have to figure out what's the real cost of continuing. And how do you want to apologize to a customer if you can't figure out what to do? So this is something that developers need to really start thinking about and understanding, and there's a nice paper by Pat Helland where he describes a database as a thing that tries to remember something for you, tries to guess what you told it to remember, and then figures out how to apologize if it couldn't figure out what it was supposed to remember. And it's a nice little summary on that.

The other thing that I found as a really useful line of questioning was to start with this one. So how many people here have a backup data center? A good number of you, okay. How many of you actually failover apps to it regularly, say at least once a year? Okay, I think that's a smaller number of people. How many people failover the whole data center at once? Okay, I think I can see two or three hands. I know there's a few of you out there. I call that availability theater. It's like taking your shoes off at the airport. It makes you feel better, but you're not really getting the benefit. You spent all this money on a backup data center and it's not really giving you the benefit that you were trying to get.

So what we're actually doing is we're living out this fairy tale. Once upon a time, in theory, if everything works perfectly, we have a plan to survive the disasters we thought of in advance. And it just doesn't seem to work out. That's not a good fairy tale to go at bedtime because it'll keep you up at night.

Here's a few examples from recent history. A SaaS company forgot to renew its domain name. I won't mention the name in public. But you may remember this. So your email's down. All your systems are down. Your services are down. What have you got left? What we were left with was the CEO apologizing on Twitter. That was the only channel they had to their customers for about a day, day and a half until they got it back up again. That's an interesting problem. It turns out probably the easiest way of breaking a system, a web service anyway, is to get at its DNS. It's the weak underbelly of just about everything.

So think about how would you defend against that? It's an interesting problem to solve for. I'll come back to this kind of thing a bit later. This has probably happened to just about everyone in the room that's run anything online. Certs have a timeout and you didn't pay any attention. And bit later on, they time out, and your thing stops working, whatever it was. And this happened many times at Netflix. We ended up building a, I think it was Security Monkey or Howler Monkey or something to just go and complain every time it found one that was getting near its end of time.

I have a good friend lives in the New York area, and turns out data centers don't work when they're underwater, and they don't work after the water's gone away when they're full of sand and salt and stuff. So, yeah, you have to build a new data center. And they got to practice their disaster recovery for real at the data center level. But it wasn't a practice. And unfortunately, this could be you tomorrow, right? So think about that.

But there's this other problem here like, well, what might go wrong? And this is a really good quote from Chris Pinkham, who was the original engineering manager for the first version of EC2 about a decade ago, more than a decade ago. "You can't legislate against failure." What he means is you can't write down every single thing that could go wrong. But what you can do is get really good at fast detection and response. This is the underlying principle that AWS has had for the last decade. We put a lot of effort into figuring out how to detect that something's gone wrong really quickly and how to respond to it quickly. And that will work regardless of what's gone wrong. It's a general answer to the problem.

But there's another issue. This is, I'm gonna maybe teach some of you a new phrase. And you can go dazzle your friends with this. Synoptic illegibility. It's a bit of a mouthful. But basically, it means if you can't write down what really happens, in other words, you can't write a synopsis, you can't automate a system. So in the real world, real-world systems have humans and automation. If you try to go for 100% automation, it will still fail probably because your automation couldn't actually capture every possible thing that could go wrong and every possible way to fix it.

So it's important to figure out how to design the human into the system because humans have a lot more judgment, a lot more ability to handle failure modes which you didn't anticipate. So there's this idea that you can't write a runbook, you can't automate because there needs to be a human involved to actually make some decisions. And you have to be very careful about the level to which you can take automation. This comes from a book by Sydney Decker called "The Safety Anarchist," one of his more recent books, which is quite a entertaining story.

All right, so really these are the questions we're trying to figure out. What's supposed to happen when part of the system fails? And then another interesting question: How is it supposed to recover after the failure went away? Because quite often the system will get into a broken state, the failure goes away, and it stays in that broken state. Or if you failover to a backup data center, which I was at eBay once when this happened ... It took a day or two to scramble to get everything up in the backup data center. It took a month or two to transfer it back again because everything was in a mess, and you're trying to transfer back without causing another outage. All right, so it's actually not symmetric. Failover, failback often is not symmetric.

And Kolton mentioned this earlier. The Network is Reliable. Spoiler, it isn't. But this paper is named after the first principle of the first fallacy of distributed computing. And Peter Bailiss is a database researcher, really worth following. And Kyle Kingsbury, many of you know, torches databases for a living nowadays.

So a few more interesting books. "Drift Into Failure." This is one that we really dug into when I was at Netflix in particular. The point that this book makes that's really fundamental is if you do the right thing at every single step, you can still have an outage, that there can be no point at which anybody did anything wrong, everybody's optimizing for all the information, all the context they have, they make the right decision. But the global situation drifts closer and closer to a failure until the entire thing collapses. And the examples in this book are mostly airplanes crashing and people dying in hospitals.

I used to tell people not to read this book on a plane. In fact, it's pretty uncomfortable to be reading it on a plane. And then I realized that no one was reading it at all. So I did a ... You can go find the video for GOTO Chicago earlier this year. I did the entire chapter of a plane crashing out of this book as the basis of a conference talk in Chicago. And I wrote it on a plane on the way there. And this was just after the Southwest plane had had the window blow out. And the whole story it's about failure modes and how things go wrong. I forget what I called this story now. It was "Dynamic Non-events." In a dynamic system, you want to have non-events. You want non-bad things to happen even though it's subject to dynamic stuff happening, stuff going on.

All right, another picture of a plane. Michael Nygard, the first version of this book was really kind of the bible we had back in the day again at Netflix where bulkheads and circuit breakers were the ideas we got out of this book. Version two, the second edition has some new ideas, just a fantastic sort of tour de force of all the ways to think about what might go wrong.

All right, another question for you. Who knows what Greenspun's 10th rule is? By the way, as I noted, one to nine never existed. He started at the 10th rule. Anyone know this rule? It's from 1993. Yeah, I guess you are maybe you're too young. But it's comes around. You've probably seen something like this go by. Recognize this one? "Any sufficiently complicated C or Fortran program contains an ad hoc, informally specified, bug-ridden, slow implementation of half of common lisp."

And you see variations of this pop up. It's kind of an internet meme when something bad happens, people do a modernized version of this, right? A thing now contains a half-built implementation of thing. The more modern version of this might be some object system or something. But the real point here is developers love to reinvent things from first principles. That's the point he was making, like I am not going to actually learn how I could do this or how anyone else did it. I'm just gonna build it again from scratch because I like building things.

But what I think is, what I'm gonna do in the rest of this talk is try and give you some historical context and a bit of domain knowledge that will maybe illuminate the path and give you some references that you can use as you're going off and building your chaos systems.

So let's start with this. This is from the "Principles of Chaos Engineering." We got a nice definition here. "Discipline of experimenting on a distributed system in order to build confidence in the system's capacity to withstand turbulent conditions in production." That's out of the Chaos Engineering book. So I think this needs simplifying a little bit to make it a bit tighten it up. So I think experimenting, yeah, we're definitely experimenting. We're trying to build confidence, yep.

Then there's capacity to withstand turbulent ... That's a bit of a mouthful. So just trying to boil it down a bit. And I think what we're really trying to do, build confidence is a bit fluffy. So let's change this a little bit to what I think is a bit more of a concrete idea about what we're trying to do. Think we're trying to experiment to ensure that the impact of failures are mitigated. That's some sort of a more boiled-down definition].

This is for a large ... I have a fairly broad definition of what I mean by failure and mitigate. But this is kind of something more concrete. These terms all have meanings I can go drill in and start defining them. Where I was talking about turbulent conditions is actually a bit fuzzy. I could spend a whole hour talking to you about the definition of turbulence probably.

But let's drill in a little bit to some of this. By failures, I mean what went wrong and what kind of thing failed. For impacts, I mean what are the effects of the failure, and what mitigation mechanisms are in place? Because what you want to have is a failure that doesn't cause an outage, right? Those failures are gonna happen all the time. They do happen all the time. It's when your mitigation doesn't work. So what you're trying to do with chaos engineering is to test those mitigations to make sure that if you create this effect that this failure it should not cause an impact.

And all of this is this big problem. We're only as strong as our weakest link. You go and you figure out and you test and you chaos test and you do all of these things. And you run all of your game days. And then the one thing you didn't try is the thing that breaks, because that was your weakest link. So, again, how do you think of everything that might fail?

So I think it's useful, I'm gonna go through now lots of lists of things that I've seen fail and different areas to think about it. But I think it's useful to try and come up with something that's maybe some kind of taxonomy of failures, like what are all the different types of failures? What are all the different failure modes? And can we come up with a list? And that list maybe should be a public list that we store somewhere.

There's a CNCF working group on chaos engineering that maybe that's a good place to store it, but somewhere where you can go and say, "Okay, here's the list of failures. Have I thought of all these? Which ones do I think really matter to me?" Rather than trying to invent them on the spot. And I'm gonna basically put this into four layers. Start with the infrastructure or software stack, which is the software you imported, plus your application, plus the way you operate it. Okay?

So stuff that goes wrong with infrastructure. Disks, power supplies, cabling, circuit boards, firmware, things you can't upgrade that run inside devices. Devices fail; we know that. CPU failures. What I mean here is things like cache corruption and logic bugs. Back in the 1990s, there was a famous Intel Pentium bug where under certain conditions, the Pentium would give you the different answer for a floating point calculation. This was in all the Pentium chips. And it was a pain in the neck to work out, how to deal with that.

When I was working at Sun Microsystems then, we shipped some machines which worked perfectly most of the time. Every now and again, a bit would flip in one of the caches, and you'd get the wrong answer or it would crash or whatever. Again, deployed into the installed base as hardware it's really difficult to fix.

Then there's data center failures, pretty common ones. But the interesting thing here is these failures don't all look alike. And there's a story of Jesse Robbins simulating a data center failure as if it was a fire, not as if you just pulled the plug or a digger took out the connection to it, but as if it had caught fire. He said he planned out, since he's a fireman in his spare time, he said, "This is the sequence of failures I would expect if there was a fire in the data center." And he ran that exercise at Amazon 15 years ago when that was what he was working on.

But so think about that. A quake or a wind ... If wind blows the roof off your data center or it floods, maybe all the machines die in the flood. Well, do you have under-floor cabling or over-rack cabling and you flood a data center? That would make a difference, wouldn't it? Right? Do you want all your cables and the power supplies to get wet first, or do you want to get them wet last? Yeah.

And then I'm not quite sure how to put this in, but what I call internet failures, meaning the actual connection to the ISP or the route or the DNS, the way to get to your system failed. It's really I regard that as infrastructure nowadays for a web service. So that's one class of failures. And you could think about this list, and this is just a starting point. Add more things to this. Think about how you can extend it.

Now, if we look at the software stack, what I call time bombs, and they're bombs because they have a fuse, you will eventually run out of memory and the system will crash because it's got a memory leak, or a counter will wrap round. There was a time when I was helping run the Sun websites, Sun.com, and after nine months the machine crashed because the 100 hertz clock wrapped around after nine months. We had to fix that.

Then there's date bombs. Fine, everything's gonna be fine until you hit a leap year or a leap second or an epoch. And some of you might know about this, but you can subscribe on Facebook to a nice event which is gonna happen called the end of Unix time, 2038. All 32-bit Unix machines will keel over or pause for 80 years or set the clock back to ... I don't know what it's gonna do. So how many people here test their systems and software by setting the time stamps, setting the date to arbitrary places, the sort of Y2K type testing? It's not that common. A few people maybe. Yeah.

It's a good thing to do. Re-run a leap second to make sure that you don't have any bugs there. Run past the end of ... I'll be quite old at this point. I'll be coming out of my retirement home bearing a copy of the manual set in case anybody's still got any old machines running. But this is the kind of thing that you can test for pretty easily. Put it in your test framework, but there's a lot of cases where this has taken systems out.

Here's another kind of failure. Certificates timing out, I mentioned. But revocation, if you forget to re-subscribe. That's basically that DNS problem the SaaS vendor had, right? They just didn't pay the bill, so somebody forgot to pay it, that service got revoked. You have a whole bunch of dependencies. Maybe it's a library you've licensed that's got some license key in it. Or maybe it's a SaaS service you depend on. You don't want those things to time out and get shut down, or even suppliers going away, right? Suppliers can go out of business, right?

Maybe that IoT device that's running your house, the vendor didn't sell enough of them, goes out of business, and now your mobile phone doesn't connect to your light switches anymore or something. Exploits, security failures, things like Heartbleed. Again, this gets into your software stack in a systemic way. You have to go and clean it up.

Here's a few more. It's amazing how reliable our languages are, right? But if you think about taking a whole bunch of text and turning into an executable binary that works perfectly every time, that's a hard problem. And there are language bugs. They're in compilers and interpreters. And there could be, there's bugs in the JVM and Docker and Linux and the Hypervisor that can affect your runtime.

And then quite often protocols. Protocols work fine when everything's nice. But then you put a bit of latency or delay, and you discover the protocol didn't really handle that error condition well, and it'll keel over under a lot of latency. Like you can use UDP packets, and that's all fine until you put a bit of packet loss on the network or flood your server and you get lots of packets being dropped. You have to go test for those things.

Now let's move up to the application layer. And again, you have the same time bombs and date bombs in your own code, not just in this infrastructure code. But then there's this interesting one I call content bombs. I got a great example here. This is a Netflix example. There was this interesting movie and it had a cedilla in its title, and somebody had encoded it using that little character sequence, but they've missed the closing bracket or they've missed the semicolon or something. And it looked fine when you just looked at it, and nobody noticed until it started becoming a popular movie.

And whenever the presentation tier tried to render this movie, it hits an infinite loop. So thread by thread ... And we had no idea why it was, the hell we were losing ... Every presentation tier machine at Netflix in the entire site started losing threads to an infinite loop, and they would just gradually slow down. And we could restart them and they'd go back again, but as this movie got popular, but it never actually rendered to a customer.

So we couldn't render the customer. Its just pages would hang and it would just loop. Took a long time to figure out, near hours to figure out exactly what was going wrong and get in there and basically kill that movie so that it didn't try and render anymore. But the code was six years old. It was a latent bug. You couldn't roll back to fix this bug. It was a content bug. And you tend to see these things like packets of death or something like that. A condition comes up that the code's never seen before and it will break. Another type of failure that's hard to guard against, how to test for, but something you should think about.

And then there's always the wrong config or bad syntax. Who's uploaded a syntax for a configuration file with a missing semicolon in it or something like that? We've seen quite a few cases of that taking out fairly large systems. And then just incompatible mixes. you got your versioning wrong.

A few more. Lots of different ways this can go wrong. Cascading failures. If your system sees a small error and then it crashes or it blows up in some way, it can just keep cascading, and the error just propagates through the system. It just gets worse and worse. So a small error gets magnified into a bigger error. Also, sometimes a system gets unhappy. And let's say it's getting too slow. Its response to getting too slow is to log lots of messages about, "Hey, I'm getting unhappy." And the logging messages is extra work, so it gets even slower. And you've basically got a runaway cascading overload. And excessive logging is one way that you can actually kill a system, lock contention, those kinds of things.

And then hysteresis, which is an interesting concept. So basically, this is actually one example of things that cause hysteresis is a retry storm, because when you have too many retries, you get work amplification, right? You're actually doing the work twice, because you're trying it and then you're trying it again. So if you get the system into a bad state, because it's got too much traffic, say, then you drop the traffic back to what would have been a good level of traffic that it should be able to cope with. Because of the work amplification, it is still stuck in the bad state. So that's hysteresis. The system has a memory of the state it was in, and you quite often have to drop the traffic level really low to get it to clear, or even reboot the machine.

So bad timeout strategies, I did a whole talk on this a few years ago. But if you just do a timeout strategy, you will create retry storms and work amplification. So something to think about. And actually the [inaudible 00:23:10] did a post a few days ago about some testing that they did with DynamoDB. And I noticed that the timeout strategy was actually the problem that they uncovered there. That's a great example of exactly that problem.

All aright, operations failures. What can go wrong for a perfectly good system that got no bugs in it at all? Well, you didn't expect the traffic to look like that, right? This big spike in workload. There was just too much stuff coming in. The mix of work is wrong. You can do all kinds of things which are put under the heading of capacity planning. You just did not plan. The system is not designed to handle the actual work that's coming in.

Incident management. What if you don't have an incident management process or you couldn't find the person that was supposed to be on call when something happened or even the system goes wrong, but no one got called? You had an incident management process, but the call never happened. You get on the call and nobody can find the right monitoring dashboards, or they found them, but they can't log in to them, or all those kinds of problems. These are the things you find in game days.

Insufficient observability of systems. Like, okay, I have all my dashboards, but they don't tell me what's wrong. Probably because I'm confused, I now do the wrong corrective action, and I make it worse and break it. Very common situation.

So let's just look a little bit at the layers of mitigation. If you're trying to mitigate things going wrong, you tend to try and replicate data. The core thing you have to do is move the data so it's in more than one place and replicate things. And the areas you typically do that ... The sort of one way of doing this is storage block level. There are products you can get for data centers. It will just take every block on disk and make it be in somewhere else. And then you just start up your databases on the other side after you've done a failover.

Or you can do database-to-database replication that's more structured at the database level. This is kind of thing that we're doing with sort of MySQL or DynamoDB, Aurora, things like that. We have products that will help you do database replication. Then to really control what's gonna happen, a lot of the times you do application-level replication, so you have logic in the application, which kind of understands that when things go wrong, we have to hand off the data somewhere else. So we're dual-processing all the traffic and synchronizing as we go. So those are sort of the three levels to think about, how you're mitigating all of those different kinds of faults.

So I'm gonna talk bit about a bit of historical context and just the past, present, and future of resilience. And to start with, it used to be called disaster recovery. So I'll talk a bit about what that looked like. Now, we have chaos engineering. And where are we going? I think what we're working on right now is really resilient critical systems, because we are starting to move entire data centers to the cloud and all the stuff that's in there, regardless of how critical that workload is. It may be safety critical, life critical, mission critical to a business. It may be moving enormous amounts of money around. What does it look like when we do that?

So there was a company years ago called Sungard. That actually is ... [inaudible 00:26:22] I corrected that typo, but it didn't catch. Mainframe batch backup. It was the Sun Corp., which was the Sunoco, Sun Oil Company. And they had a backup mainframe company that they spun off. And Gard, it was G-A-R-D, guaranteed access to replicated data.

And that's where we started defining these things like recovery point objective, which means do you do daily backups? That means you can recover to that point that day. If you do hourly backups, your recovery point is now an hour. And then recovery time objective, how long it takes to find the tapes, bring them back, restore. And people started setting SLAs is around this.

This is the one where I corrected the spelling of Sungard. So there's a whole market around business continuity. And there are standards around it. And those standards have nice things and glossaries of terms and stuff like that. So if you actually, rather than redefining what all the terminology should be for your chaos engineering architecture and figuring out what all of your internal business processes should be around chaos engineering, it's already been defined. The business processes are the same. We want to keep the system running. We should do all the due diligence around it.

The implementation is very different. And I'll get onto why that's different a bit later. But there's a couple of these that are relevant, one around Infosec and one around what's called Societal Security, business continuity management.

And I'm just gonna wind forward a bit because there was a whole market here. But let's look at a few more interesting data points. Around about 2004, Jesse Robbins's title at Amazon was Master of Disaster. He went around unplugging data centers, as I mentioned, simulating data centers on fire to try and make sure that the Amazon website would stay up when bad things happen. And there's a few people here from Amazon AWS still working in that area. We have teams at Amazon that work on making sure that we do the right internal simulations of things to cause problems and figure out how to do it.

So in about 2010, when we were figuring out the Netflix cloud architecture, Greg Orzell ... His Twitter handle's chaossimia. It's a bit of a giveaway. He built the first implementation of Chaos Monkey, and we had this decision that said, what we really wanted to do was get all of our developers to write stateless services. We wanted to be able to kill any instance. We wanted to be able to send traffic to any instance. We didn't want any session state. We wanted to auto-scale. And an auto-scaler has to be able to scale down as well as up.

In other words, they auto-scaler will kill instances as it scales down. So the system, basically the idea was we will have an auto-scaler. We will test the auto-scaler by killing machines occasionally. And every now and again, the auto-scaler will say, "Hi, I'm missing an instance. I'll add another one back." So we were exercising the entire system that way.

And this is a designed control for an architectural principle of statelessness. That was the original real reason for it. And the other thing was to get people used to the idea that they were running on commodity infrastructure, which was supposed to be able to fail and go away and that that should not impact anything that happened in the application. So that's where we got to.

A few years later, I think we wrote a blog post on it in 2011. We open-sourced it in 2012. 2016, Gremlin was founded. Kolton was working at Netflix during some of this time. 2017, we had the Chaos Engineering book. We also saw the Chaos Toolkit Project come out, which I think is an interesting place for us to sort of share where we can share ideas. And I think what we're seeing right now is chaos concepts are starting to be adopted pretty widely. And it's great to see this conference.

So I'm gonna talk a bit more about chaos from an architectural perspective. Some of you may have seen this section of the slides before, but I think it's worthwhile going over. What we're trying to do is have no single point of failure in our infrastructure. We get globally distributed systems now. We can deploy stuff anywhere. The cloud gives us this capability. So that's our infrastructure.

But on top of that, we need to be able to route traffic. So how do you switch and interconnect your customers or your consumers to the services? It's really important, and the problem we have here ... Let's say something goes wrong in some region and we reroute the customers to another region, and then it comes back. We have to decide, is it actually back? Okay, have we done the right anti-entropy recovery? And how can we switch customers back to it?

The code that does this, switching and interconnecting code is usually the least-well-tested code in any infrastructure. And it's some of the most critical code. And this is a basic principle of reliability. If you're trying to switch between two things, the switch itself needs to be an order of magnitude more reliable than the things it's switching between, because otherwise, overall, you're less reliable. If by adding a second thing you think you're making yourself twice as reliable or more reliable, but you'll actually be less reliable if the switching is less reliable than overall.

So it's a really key principle. And what we're doing by running game days and chaos engineering is we're testing this layer. We're testing the way that the system fails over. We're testing our human processes and our automation. So that's switching.

Then we have application failures. What does your application do if it gets an error or a slow response, or it just drops the connection? Does it keel over? Does it carry on nicely? This is relatively easy to test. Every little test framework, every microservice should have the error injection to do this as in your test environment before it even sees production. But you should be testing this in production, too. So that's the app.

Then there's these pesky users and operators. Quite often a perfectly good working system will get taken down by its operators because they got confused. And there's some great examples of this. The Chernobyl nuclear disaster is actually an example of this. They were running through some test procedures. There was actually nothing wrong with the nuclear reactor at the time that they started the test. They got confused. They started ignoring alarms because they thought it was part of the test and all this kind of stuff.

And basically a human interface, human experience, a usability failure of the interaction between the systems that were operating the plant and the operators themselves caused an entire meltdown. So there's a number of examples of that.

Usually rebooting it is the wrong answer. Just like if somebody in the middle of an outage says, "Let's reboot it," do like, "Hang on a minute. Are you really sure? Because if you're just doing it because you couldn't think of anything else, you'll probably make everything worse."

So people training. I've given this talk quite a few times around the world. This is usually when the fire alarm goes off. So hopefully we won't all have to clear out at this point. But we make everyone leave the building and go and stand in the parking lot. I've been around the world, and every single elevator has a sign on it that says, "In the event of fire, do not use the elevator." That is universal. It's amazing how universal it is. And what that means is that when there really is a fire, people universally know how to behave and how to get out. That must have saved an awful lot of lives around the world.

So who runs the fire drill for IT? Even your system, when your system is on fire, in quotes, how does everyone, how did you train everybody to deal with that? And this is I think one of the key functions of the chaos engineering team. They're like the fire safety facilities people that run the fire drills to make sure that your building is safe. And what you're doing is you're doing it for the application.

So this is a great book with some great ideas how to do this. And there's a bunch of tools, and the tools operate at different levels. So I think Gremlin and tools like that operate at the infrastructure level maybe the switching level and application level, they're working their way up. The chaos automation platform from Netflix is their internal tooling. Chaos toolkit is a way to find experiments and figure out how to schedule results, look at the results of it. Simian Army, a lot of a Netflix open-source things. And then just the whole idea of having game days to exercise the people's really important.

And I think there's analogy here with the security red team approach where the red team are using tools to try and break into the system. And if you don't have a red team, you don't really know how good your security is. So they use tools for social networking. Safestack AVA is a tool for doing spear fishing on your on your employees. Infection Monkey is from GuardiCore. It will actually just try and infect your network and take over your systems. In fact, if you want to map everything in a data center, it's actually probably a good tool for that, although it is a bit scary.

Chaosslingr was an interesting tool. It generates policy violations, things like if you say, "I should never have any S3 buckets open to the world," Chaosslingr will create one and then sit back and say, "Did you notice? Did you complain? You're supposed to notice and complain when I do that." So it creates these security policy violations to prove that you have the correct controls and corrective actions in place to detect it. And there's a few others that are more network-level for attacking things, more products.

So I think there's an attitude here, which is we're trying to break it to make it better. But in some senses, we're really breaking it to make it safer. And there's this whole other area, and if you're not familiar with this, if you're into safety in general, there's a lot of work going on. I've mentioned Sydney Decker a few times. John Allspaw has done a lot of work on integrating the general industry ... There is a safety industry. And if we just build our stuff in our own little ivory tower over here and software and say, "We don't care about that," we're missing a lot.

So the Stella Report, that's a URL, Stella.report, was when he brought together people from industrial safety and people from software to work together to help define things and start communicating. And I listen to Todd Conklin's Pre-accident Podcast. He wrote a really good book called "Workplace Fatalities." You don't necessarily want to read a book about people dying at work. But the really interesting point in this book is that nobody comes today work every day saying, "I think we're gonna kill a few people today," unless they work in a battlefield or a hospital. There's some exceptions.

But generally speaking, in most environments, you do not expect and you do not plan for people to die. And it's such an outlier that it doesn't fit the model. So the point here is your failure model will not include the outlier that breaks everything. It's not part of the continuous distribution of types of failures. And the other interesting thing is that there's a real survey here that looked at airline failures, airline fatalities. The airlines that reported the most incidents had the fewest fatalities. The airlines that reported the fewest incidents had the most fatalities.

And most people seem to think intuitively that if you have lots of incidents, you're worse, so you're going to have more fatalities. So it's backwards. And the reason is psychological. People, if you say, "I don't want any incidents because I don't want anything bad to happen, so I don't want small things to happen either," people under-report and they just stop reporting the small things. And the small things build up, and eventually drift into failure. You get a big thing.

So it's really important to have a culture of sharing and learning. And this is the most important thing, and the whole blameless postmortem, the blameless culture it comes out of this. Very critical because if you start suppressing failures and outliers, you will actually get a big failure the some point.

So key thing here, failures are a system problem. It's a lack of safety margin, not something with a root cause. That is typically if you have a component or a human error, it should not cause an outage. It may cause a failure, but it should be mitigated. Your system should deal with it.

And I have a little example thought experiment here. If I was blindfolded walking around here and just started running around on stage, I'd eventually fall off, right? But what I would actually do is be pretty careful and reach out with my foot and make sure that I knew where the edge was. Or I could take the blindfold off. But quite often we can't see things well.

So what we're really doing with hypothesis testing we're saying, "I think I have safety margin. I think there's a cliff over there. But I don't know how far away it is. I'm gonna run a hypothesis test, an experiment which pushes the system in that direction far enough that I know I've got some margin, but not far enough to fall off the cliff." And that's a way of thinking about what a chaos engineering experiment is. Test obviously, but you really only know where these margins are in production. That's where it really matters.

So what we're trying to get to is a point where we have experienced staff, they know how to be on call, they know how to work through an outage, know how to sort things out. They've been through game days. They're using robust applications that don't keel over at the first sign of failure and have sensible retry and timeout strategies. We've got a dependable switching fabric, so we really can switch traffic between alternatives when something really is broken. And we're running on a redundant service foundation, meaning we have storage redundancy, data redundancy, CPU redundancy.

So that's basically where we need to get to. So that's kind of summary of the chaos engineering bit. I'm gonna talk next a bit about some possible future directions, a few other things that I think are interesting that maybe we haven't looked at really in that much detail before.

Three different areas really. Observability of systems. We can talk a bit about epidemic failure modes and then a bit about how we can use automation to get to what maybe is called continuous chaos. Observability, and this is something where I think at some point, people started arguing about it, "A new term we should be defining and arguing about." It's actually an old term. It comes from 1961. A really good definition by Kalaman, who was one of the founders of control theory, "A system is observable if the behavior of the entire system can be determined by only looking at its inputs and outputs."

That is a control theory definition. It's actually works perfectly well. And I'd rather people learn some control theory than reinvented terminology and redefined terminology. The CNCF paper redefines observability as some waffly thing. It's like, I changed it once to be the definition. Somebody changed it back again. So at some point I give up. But it would be really nice if we didn't reinvent terms that have strong definitions that are well understood in the industry. So that's just a little mini rant, if you like.

But the key thing here is that if you have a system that is got very linear and well-behaved, you poke it and it does a thing, and needs very little logging. The more complicated the system, the more internal observability you need. So there's this kind of range here. If you think about a function with no side effects, like a lambda function, you call it, it does a thing. You call it 10 times, it does the thing 10 times.

It takes a certain length of time. It's got no side effects. It's got no internal state. It's going to do the same thing every time. It is inherently very observable. That's one of the reasons that the lambda, the functional models that we have there have inherently pretty good reliability and observability.

If you get a microservice, a nice small one that does one thing, again, it maybe have a bit more internal state this time. It may be a little bit more complex, but it is much more observable than say a monolith. Even with tracing and logging, it's really hard to reason about what a monolith will do. You give it a bunch of inputs, you look at the outputs, but it's hard to model it because it's got so many different things it does. It has all these different operations. It can perform. It's got internal state. It doesn't do the same thing if you ask it to do it twice quite often.

So this is thinking about ... This is I think one of the reasons chaos engineering appeared at Netflix at the time that it did was that we had moved to microservices and we had single-function services. And we had built them to be stateless with basically functions with no side effects but implemented as services. We had built systems which were inherently tractable for the problems of trying to build chaos engineering on top.

And when we go about it, people think, "You're crazy. You can't go around rebooting machines. My monolith would die." Yes, it would. Monoliths don't like chaos engineering. That was one of the ... So the move to microservices is intimately tied with the rise of chaos engineering, although the disaster recovery principles apply.

The other point is around automation. I'll talk a bit about that later. So here's some thoughts on failures. You can have independent failures. And this is the way lots of modeling thinks. Well, this goes wrong, we're fine. If this goes wrong, we're fine. So we're good, right? Well, if they both go wrong at once, well, that won't happen. But then they do go wrong at once. And you get correlated knock-on effects. This thing failed. It caused this thing to fail, which caused this thing to fail. Now you have three failures, and your system wasn't designed to cope with three failures. So that's a pretty common thing to worry about how to avoid correlated failures.

And then there's epidemic failures. I'm gonna talk a bit about classic epidemic failures. Think about crops. You have a field of wheat and it's all genetically the same wheat. And some bug comes in or some disease comes in, and it kills all the wheat in one go. That's the kind of epidemic failure. Or we see human epidemics when diseases and the flu season comes by every year and everyone gets flu. That's an epidemic, right?

Here's the examples from computing. That Linux leap second bug, remember that? There was a version of Linux where we hit the leap second, everything died. They just stopped. I remember seeing entire websites just stopped responding. They went off the internet until they figured out how to patch them. I was at Netflix at the time. We lost some of our Hadoop clusters. Most of our machines we hadn't patched. We had most of our systems were too old. And the rest of them were too new. So we had a diversity of versions of Linux running across our infrastructure, and we got taken out but some of them.

I mentioned this Sun bit flip. This was really annoying. Hardware is really hard to fix. We had to do workarounds, which were expensive, shipping people new CPU modules. We had to do software workarounds, which slowed the system down. It was difficult. I won't go into exactly why it happened. [inaudible 00:45:45] good, good reasons why this chip was designed the way it was. It turns out the cache was much bigger than they thought it was gonna be. And the error rate of the chips the cache was built with weren't within spec. So they were getting little alpha particles, flipping bits inside them.

A cloud zone or region failure. This is an epidemic. It takes out a whole chunk of machines in one go. DNS failure, again, the same way. Think how would you guard against DNS failure? Let's say you would go back to the SaaS company, what would you do? Well, maybe you have a secondary DNS address with a different name. You register it at a different DNS provider. There's a bunch of different things you could do. You could have your software failover, your mobile app, if this name doesn't work, try another name. If you test for it, you can build around these things.

And then security configuration. And there's one of the cloud providers managed to ... global company managed to take out its entire network for a short period once by propagating a security update that missed a semicolon somewhere globally distributed.

So what you need is quarantine. You have to be able to ... If you have an epidemic, you have to quarantine it. You start ... That's why when you go through airports in Asia, they have these little scanners that see whether you've got a temperature or they want to take you off and see what kind of bug you've got.

So here's a few ideas how you could quarantine or contain this class of failures. Why are you running everything on Linux? Maybe you should also be able to deploy on Windows, because if you're looking for diversity of code base, Windows is about as different as you can get from Linux. Maybe there's a few if the TCP handling code might be copied or the same basic packages. But most of the code on Windows, the kernel is different. So you're unlikely to get a zero day or a leap second bug on both at the same time, or maybe BSD, or maybe ... If you just different kernel versions maybe on Linux.

So think about how much diversity you have because the power of all of the automation we have now is we can make everything the same. In the old data center world, we had a pile of different systems. They'd fail more independently. Now, we're able to cookie-cutter everything to be identical. It's the same AMI they've built. We've deployed it 10,000 copies in an auto-scale group or 10,000 copies of the same container, they all fail the same way at the same time. So we should think about programmatically introducing diversity into the systems we're building.

Use a variety of CPU implementations. Obviously cross-zone and region replication. Think about multiple domains and providers. And then when you do deployments, don't deploy globally. Somebody said, "To error is human. To error and deploy globally is devops," or something. There was a tweet. I have to go find that tweet. But that's kind of the problem. So you want to deploy. And Amazon's very good at just deploying one zone at a time, and then it does a region. And then if that looks like it works, it waits a while, and then does another region, and walks around the world.

So we tend to deploy updates in a very rolling fashion to make sure that we don't take anything out. And Netflix does the same thing with all its code deployments. It does a series of canary tests around the world.

So this is the real problem. And this is a larger-scale problem. You can uncover some kinds of these things with chaos engineering, but think about how much the idea ... The future idea I've got is that we should be managing automatically the diversity in our systems and deliberately introducing it. And there's some discussion of this in the second edition of "Release It" book, because the first time I discussed this problem was with Michael Nygard a few years ago.

All right, so now I'm gonna talk a bit about a few ideas about actually putting mechanisms in practice and where we're going here and talk about AWS and Kubernetes in the last few minutes here. So how does AWS do isolation? We have no global network, which is annoying, right? You can't create a global network address space on AWS. And we do it very deliberately. It's part of our isolation strategy. And some other cloud providers make it easy for you. And they are also opening themselves up to a class of failure that we don't have. So that's a philosophical idea that we want to have completely independent regions.

We have ways of replicating things across regions. But we don't want any failure to propagate from one region to another. And this is a really deep philosophy that we work at very hard. And there are places where there has been coupling in the past where we are systematically removing that coupling. So there's some very old S3 options which are a bit more global. We're removing those, and we're re-implementing them. We just don't use them, right? We don't take away old things that worked. We don't like shutting stuff down.

But think carefully if you're using a global feature of a cloud. Regions are made up of availability zones, they say between 10 and 100 kilometers apart, which is close enough to be a few milliseconds for synchronous access but far enough apart they're in separate flood plains, separate electricity supply, separate network drops. That's been the combination we've always tried to go for.

Each zone, the big zones are multiple buildings. The small zones may be one building. But potentially multiple buildings. But even some of these buildings have got so big that within the buildings, we actually subdivide. So when you get that annoying, "Hey, there was a problem with some of your machines on AWS," it's very rare that it affects every customer even within a zone or a region. It's typically just a rack or a subset of a zone. So we've got a lot of bulkheads in the system at lots of layers, layers that you can't see deep in the system. We have lots and lots of bulk-heading going on.

And then there's this redundant private network around the world. So we run our own network. If we had to, we can route traffic across the internet, but we generally don't do that nowadays. And in certain cases, we guarantee that we don't if you want to be sure about for security purposes.

So here's a couple of things just make sure people are aware we're doing a few useful things. You can actually do fault injection on databases on Aurora now. So you do alter system simulate percentage of failure. You can crash masters and replicas. You can cause disk failures and congestion. So it's because the underlying storage layer, which is implementing this, is hidden from you. It's under the hood. It just looks like a MySQL or Postgres database. But you need to be able to get under it and simulate what if this happened? So really encourage people to try these kinds of things out in their chaos experiments.

Another one if you're doing multi-region is that you can now scope API calls to a region. So let's say you're using DynamoDB or S3 and you have a multi-region application. You can actually cut all traffic to a region at the IAM [role 00:52:59] level at a service-by-service level. So we're using the identity access management system permissions to dynamically create the inability to call an API in a region. So it doesn't work for every possible thing you might want to do. But it at least gives you another knob you can turn if you're trying to prove that something is working right in a multi-region manner.

We also have been doing quite a bit of work with Kubernetes recently. Gremlin has a bunch of dedicated tools for driving Kubernetes. The open-source Chaos Toolkit has also got some interesting work here. There's drivers here for Gremlin. You can actually coordinate Gremlin from the open-source toolkit, but also talk directly to EC2 and to Kubernetes. And we built a workshop for running Kubernetes on AWS. And the Chaos Toolkit people contributed a module to the workshop for how to do it.

And the CNCF has created this chaos working group. It's quite interesting. They're trying to define things. They're trying to come up with a common place where people that care about this can get together. And you can do things like kill your control plane instances or kill your nodes and make sure your Kubernetes cluster recovers. So there's obviously much more sophisticated things to do here.

The interesting thing here is that this is a common place where everybody can come together and share open-source implementations of experiments and everybody can take that away and use it. So the fact that Kubernetes is a very common API across data centers and multiple cloud vendors gives us a place where we can actually start to have a concerted group effort to come up with a better way of doing things that then everybody can leverage rather than doing it piecemeal one-by-one. So I encourage you to have a look at that.

But the other thing that cloud did was it provided the automation that led to chaos engineering. I think the reason why this came up at Netflix was microservices and the API-driven automation. It's just too difficult to do in a data center. But what we're seeing is data centers moved to cloud. The fragile manual disaster recovery processes that they've used in the past which they, if you ask them they really don't test that often in many cases, the new architectures use chaos engineering to replace disaster recovery.

So if you're working in this space, I'd really recommend going and reading up on the disaster recovery terminology, go read those ISO standards. They're not too bad to read. You have to pay for downloading the PDF, but I managed to Google around and find a copy that was openly downloadable. So there's a lot of nice summaries here where you can go find stuff and learn about the challenges of taking a more traditional disaster recovery system and updating it to be chaos engineering.

So I think what we're looking at we're moving from a scary annual experience where the auditors come and make you do it to automated continuous chaos. This is something that should be running all the time. And the chaos automation platform that Netflix talked about a year or so ago, they're running 1,500 automated experiments every day. That is a lot to keep running. But think of it as a continuously automated process that is just making sure that you always have the resilience to do something, not a thing where you gather everyone together in a room then go, "Oh God, we're gonna try this thing and hopefully it won't break."

So I think that's where we're trying to get to. So say thanks for listening. And there's some AWS people here, and I'm particularly interested in this. If [inaudible 00:56:44] things that you think AWS should be doing or implementing that can help you build better chaos experiments or control what's going on, we're really interested in doing this. We're trying to figure out right now what should our products be in this space? How should we approach it? What capabilities and products should AWS be bringing to the market to help you build these systems?

So thanks. Yeah, that's all I had. So thanks very much.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Try Gremlin for free and see how you can harness chaos to build resilient systems.