Dave Rensin: Chaos Engineering for People Systems - Chaos Conf 2019
The following is a transcript from Google Senior Director of Engineering, Dave Rensin’s talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
Good morning everyone. Let's try that again, good morning everyone. Thank you, we're live on Twitch. We need to be interactive, like I need to be playing Mario Kart or something. I am Dave, I do work at Google. Before I get started this morning, I need to give you a small trigger warning. These slides that you are about to see in this presentation were built by me, and I am an engineer, and yes I have worked in lots of parts of Google, support, SRE, even finance, but I've never worked in marketing. And you will be able to tell that instantaneously.
That is my contact information, I would like to hear from you. Okay, so trigger warning. Oh, and the other thing is I am going to say some things this morning, some of them you may find provocative, I kind of hope you do. Some of them you may even mildly disagree with, that is working as intended. I will be at the show today. Find me, let's talk.
All right, that having been said, let's do a little audience participation. Everyone raise your left hands right now, high in the air. Let's go. Up, up, up, come on, everyone means everyone. Leave them up. With any good experiment we want to document it. Excellent. All right, you can put them down, thank you.
Okay, now that we're all in the hand-raising mood, let me ask a question. Show of hands, how many of you are actively practicing Chaos Engineering of some kind intentionally? We all practice it unintentionally. Intentionally in your companies today, let me see hands. Oh, that's a good, goodly number of people. Thank you. All right. Everyone else I assume is here to learn about Chaos Engineering, maybe learn how to get started. We'll call you the chaos curious. That is fine. It's a perfectly natural part of development to be chaos curious.
So, why do we do Chaos Engineering? For those of you maybe who are a little newer to this. The thing I hear a lot from people is, "Oh, this is just testing, right?" You tech people, you just like to invent new terms for things. No, actually this is different. When you test a system, you think you have some idea of how it works. At a minimum, you have some idea of what the properties of the system are. That is the foundation of the test. A test is pass, fail. Did it do the thing I expected? The problem is is that systems that are diverse, that have lots of moving pieces, that are large and distributed exhibit emergent properties. Over time, they do things we did not explicitly engineer into the system, and we need to discover these properties.
Chaos engineering is about running experiments so that we can discover the new properties of our system, so we can actually discover how our system really, not how we thought it worked when we deployed it to production. And that is a real thing. Systems really do drift, the larger they are, the more it happens.
Why do we do this? Like why is that important? Because it is a metaphysical certainty that if we don't do this, these properties will be discovered just by our users. And when our users discover these new and interesting properties of our systems, they will be very excited and they will come and tell us in a very excited way about them. We call these customer support. Or if they get really excited, they'll want to tell everybody about these new and exciting properties, via say Twitter. We call that a bad day. If we have too many bad days over any period of time, we will soon find we have no users. We call that bad luck.
So, Chaos Engineering, if you're looking for an unnecessarily compacted definition, this is my unnecessarily compact definition, is a discipline for systematically minimizing bad luck. Because bad luck is, well, bad. We do not want that. If you'd like something a little more memorable or tweet worthy, I always enjoyed this quote from Eliyahu Goldratt, "Good luck is when opportunity meets preparation." I think we all know that. "While bad luck is when lack of preparation meets reality."
All right, so what are the kinds of things we do to our systems to discover these properties? We have a number of techniques that we do. I have a few favorites. Like we might start with fault injection. So fault injection, for those of you who don't know, is when we tell our servers, let's say, to periodically just report an error, not an incorrect value, an error. Just every so often say something went wrong, sorry, and see how the rest of our systems handle that.
Or we can go a step further, we can do fuzz testing. Fuzz testing is when we intentionally inject plausible but incorrect data into a stream to make sure we can handle it. Like weird Unicode characters in a tech stream. Those of you who have ever heard of SQL injection bugs, that's because we didn't do enough fuzz testing in a system.
We might do artificial resource constriction, which is a fancy way of saying we turn a bunch of stuff off, right? So we turn off a bunch of back ends to make sure that our load balancing and our load shedding works the way we think it might. We'll discover all sorts of weird and interesting dependencies that way. Or, if we're feeling particularly nasty, we'll generate a huge amount of load to one part of our system, wait for it to compensate, and then swing it to a completely different part of the system to see how it all reacts and how the things interplay with one another. Pejoratively we call that laser beaming or hot spotting. More formally, we call that randomize load swings.
All right, so those are examples of the kinds of things, kinds of ways we abuse our systems to discover their new and emerging properties before our users so. But here's the thing, if you accept the property that complex distributed systems gain emergent properties over time, and they do, that is not conjecture, and you accept the theory or the premise that finding these properties before our users do is better for us, I think so, but maybe we could argue it. Then we want to do it to all of our complex large distributed systems, right? I mean, we apply these principles to software and hardware.
But here's the thing. Our software and hardware system are not the most complex distributed or largest systems we deal with in our day to day lives. Our companies are. Companies are large distributed systems. And guess what? Most of the complexity in that large distributed system is not in the software or the hardware, it is in the humans. So maybe we shouldn't try to apply some of these principles of Chaos Engineering in a reasonable, principled way to our companies so that we can discover the properties of our businesses before our customers do. And then you can really have bad days.
I want to be very clear about a thing. Like a lot of people, I like analogies. I think they are useful frameworks to reason about a problem. This is not an analogy. Companies are large distributed systems, they're not like large distributed systems. This is an actual truth about large human systems. Actually, any human system with more than like 10 people are going to start to exhibit these properties. All right? This is not an analogy. This is an analogy. In our large distributed systems called a company, our human beings look like semi-autonomous units of execution with inconsistent outputs and opaque systems internals. Essentially buggy, biological microservices. Yeah, everyone laughs at that line.
But here is a true statement. If you give any collection of humans in your company even a simple task, something very straight forward and ask them to do, some relatively predictable percentage of them will do it incorrectly every time. Don't believe me, let's just go look at the photographic evidence. By my count, I'd say 20-ish percent of you raised your right hand when I asked you to raise your left hand. And okay, a shocking number of you raised both hands. [inaudible 00:08:33] I don't really know what that's about, but all right.
Don't feel too bad about this, actually. This is a very important part of what it is to be human. We don't just do things by rote, we don't do these things consistently, we always, for at least a moment, interpose some judgment between when we decide to do a thing and when we actually do it. And that's where all of our innovation and creativity comes from, is people, as humans. This is a feature of people, not a bug. And this is not a new thought, by the way, this thought is very old in the world. Seneca, the great stoic philosopher. We all know what that means, right? No.
"Errare humanum est, sed perseverare diabolicum." It just means to err is human, but to persist in error knowingly is of the devil, is diabolical. Meaning like it's totally reasonable to make mistakes. It is completely awful once you know you are making a mistake to continue doing the same mistake ridden thing, please stop doing that. So, we've known this for ... this quote's on the order of 2000 years old. Cicero said something before even this, and it probably goes back much further in antiquity.
Okay, so the goal here is not to make humans more reliably execute the things, because they won't, that is contrary to what it means to be a human being. We have large collections of these mostly reliable but completely non-inspectable systems that all interact in a semi-autonomous way in our large distributed system called a company, we are going to get emergent properties. In fact, we do all the time in any company with more than a few people. And we don't want our partners or our customers to discover these properties before we, do that can actually be existential to our businesses.
So, how might we apply the principles of Chaos Engineering to our people systems? I am going to talk about four experiments you can run. You can run many more. I'm going to pick these four because I have actually run these four, sometimes with hilarious results, sometimes with terrifying results, always with interesting results. My argument to you today is that you should be doing things similar to this, things, activities that rhyme with these in your businesses, otherwise you are going to continuously get surprised. And the larger your organizations grow, as they do with success, the likelier is you're going to be unpleasantly surprised, the likelier it is you will have bad days, and the likelier it is someday you will have bad luck.
So, exercise number one, which I like to think of as a game, The Wheel of Staycation. This is easy to do. You can do this tomorrow with your teams. Once a week, I usually do it on a Monday, we pick a random team member and they get a staycation. Now what that means is they don't go home, they stay at work but they set their out of office, they answer no work questions, they have no work conversations. They spend that day working on whatever project it is they would like to have time to work on, a whole concentrated day, they spend the whole concentrated day working on it. It's kind of nice. The reason they stay at work, by the way, is because we might actually need them, we may have to break glass in this experiment.
Like any good experiment, this experiment has a proctor. You designate a proctor, the proctor decides when it is you have to break glass, right? When when the pain of not having this person, or having had this person suddenly disappear is too much. But just because you're not going to talk to them about work doesn't mean you like ... they're not pariahs, have lunch, hang out, like be humans with them, but do respect the rules.
Why do we do this? We want to find hidden information SPOFs in our company. Now you might be thinking to yourself, "Wait a minute, like my people take vacations, right? And then they're gone for more than a day, like I do this already." No. Vacations are planned outages. Responsible people, when they're getting ready to take vacation, do work to make sure things can run well when they leave. Irresponsible people should not be working for you, they shouldn't come back, right? They do transition documents or whatever. Tell people they're going to be on vacation, right? This is if what happens if you suddenly lose someone for a day. You want to discover who are the single points of failure? What bits of tribal knowledge do they know that suddenly you needed and didn't have access to? Right?
In companies where you have components that are on call, like for support, or SRE, or whatever, what happens if the super technical, awesome expert human is suddenly not there when the big terrible thing happens? Anytime you find an information SPOF like that, some bastion of tribal knowledge that's only in one person's head, you need to rectify it very quickly. Because some day that person could quit or whatever, and then you'll be in a lot of trouble.
Obviously if you need to break glass, then you knew you had found a SPOF that is good or productive. The goal here, the team really should notice when a random human goes missing for a day, like the team should feel some impact, it should not be friction-free. If it is friction-free, that might be a signal that that person has successfully worked themselves out of a job and should go to another project or another team. Don't laugh, this happened to me. I was running a team at Google and we were running a game very much like this, and I noticed after a few times where I was randomly selected that when we were gathering at the end of a month to sort of collect the last four weeks worth of experiments, that the impact when I was gone was asymptotically zero. That was my signal as a manager that I had successfully worked myself out of a job. You know what? I picked a successor and I moved on to a different part of the business.
Okay, so that's easy to do. The Wheel of Staycation, very straightforward. We're going to move up the difficulty stack a little bit here. Tortoise Time. I like this one. This is my second favorite one. My most favorite one comes next. You're going to select 20% of your team, give or take, again at random, very important. So we have proctor and for a workweek, five continuous work days, preferably Monday through Friday but if you have to do Wednesday through Tuesday, whatever, that's fine. And for that entire week, those humans may not respond to any work thing in less than one hour. You get an email at 9:00 AM, you can't respond before 10:00 AM. We are introducing latency into the system. And again, the proctor decides if and when you break glass, okay? The proctor has a very important role, set of roles here.
Anyway, and you can play with the latency time, right? So why do we do this? Well, it's like in any distributed system you'll insert latency bits between layers to see how brittle they've become. Your company, my company, Google, my teams even have hidden layers that develop over time that I don't discover, hidden dependencies between humans that I don't know about. Right? And by introducing some latency, we can test the brittleness and the recoverability of these systems. What we want to know is how much pain did the person asking the question feel if they needed an hour before they could get the answer? An hour doesn't seem like very long, but you'd be surprised kind of at what that expectation looks like for people.
Do they like redirect to some other source to get an answer, etc. We want to know how long can you stand to run this experiment before you have to break glass? Pro tip, the first time you run it, you'll be lucky to go two days into your five which 20% of your team introducing latency. That's how brittle some of your information paths will turn out to be in the company.
How quickly did people asking, the people sending the email, go to alternate sources of truth, IE intuition or making it up? That's a good thing to also discover. So again, we want to find the hidden layers of our business, and we want to find hidden latency dependencies that we did not know existed. Again, relatively straightforward, a little riskier than the Wheel of Staycation, but entirely doable.
This next one is my absolute favorite one, it is the funnest thing you can possibly do. It is a little more risky, so you have to be careful about how you design this experiment, but it's super awesome and people love it. It's a good team building thing. Ready? It's called Liar Liar. Yeah, you're going to enjoy this, I promise.
Once a month and only once a month because my gosh, this can spiral out of control, we pick a couple of people on the team and they are our designated liars. The way this works is the proctor would pick two people say me, and Tammy, [inaudible 00:17:34] in the audience someplace. And then the proctor would tell me, "Okay Dave, 20%, one out of five of your answers are going to be lies. And Tammy, one out of three of your answers today are going to be lies." And only the proctor and you know your percentage, right? And for that day you are going to give incorrect but plausible answers to questions you get.
"Dave, what is the budget for this project?" 4 billion ... Oh no, wait, I work at Google. $4 trillion. That is an implausible answer, even our CFO will know that that is not true. So you want to get plausible but definitely incorrect answers. You definitely want to keep a list of all the wrong answers you gave that day and tell people the next day, that's pretty important, please do that. Just saying. Otherwise another bad thing happens. I do recommend to people that when you do this for things like email, or away messages on Slack or whatever, that you do some kind of disclaimer. The one I happen to use is today I am the designated liar and I have been randomly selected to be buggy. If you ask me a question, some of my answers will be intentionally incorrect. Can you tell which ones? You do want to give people some idea that ...
All right, why do we do this? This is a fuzz testing exercise and it's super fun, by the way. And it's hard to do well, by the way. We want to know, can the recipients of the incorrect information, do they have the ability to discern the difference between correct and incorrect? I mean not are they smart, I mean is it plausible? Is there some other place they can go to check if this information is correct or not? Or again, are they using intuition to do that, which is a terrible way to do anything, we call that luck. If the answer is no, then we have found an information SPOF in our system, the critical dependency, which we should probably figure out.
I mean look at it this way, how many people are software engineers? Like you write code, put your hands up. Yeah, most people are, that's what I would figure. When we call APIs, right, we perform some kind of testing on those results, everyone really needs to be nodding their head yes, to make sure that they are likely to be correct before just passing them through in our systems. Right? This is what we're doing here too. And these decisions have a much bigger blast radius, a wrong piece of information.
If we are finding that recipients are pretty consistently able to tell when you're giving an incorrect answer, you're doing it wrong, these answers are not plausible enough. Learn to lie a little better, please. Bonus, you pick up a new skill. But the principal we are testing here is Nullius in verba, which just means nothing in the words. Don't take anyone's word for it. It was the motto of the Royal Society since I think what 1660? Yeah. The team I'm on now, that's actually our team motto, like take nobody's word for it. And it's a good exercise, by the way, to get your company into. Like if I ask a question and I get an answer am I able to apply some filter to it that's not just my judgment? To know whether this is not just whether this is the correct answer, even if it just seems plausible? Ton of fun to do. You've got to be careful about how you design it, but super fun to do.
Okay, number four, here we go. This experiment breaks some rules. So if you've been doing Chaos Engineering, you know that a very good piece of advice, particularly for people who are just sort of starting this process, is don't design an experiment that could actually destroy production, like for realsies if it gets out of control, that's a bad thing to do. Except in this case, I'm going to encourage you to design an experiment that could actually destroy production, or in this case the company. So, no pressure.
Obviously this is the most difficult one to do. We call it War of the Worlds. People probably know the H.G. Wells book, War of the Worlds, martians invade earth, we can't beat them, and then die from the common cold. That's a super short summary. Maybe you saw the bad Tom Cruise movie. Sorry, Tom. Some of you might even know that in 1938 Orson Wells and the Mercury Radio Theater did a Halloween evening presentation of the story of War of the Worlds. They presented it, very interesting presentation, like a straight up newscast. And they had a small disclaimer at the very front of the broadcast, but of course people were dial surfing the way we channel surf and a lot of people missed the disclaimer and they caused a mild panic in the United States that Martians were invading a suburb of New Jersey, because I mean obviously that's where the Martians would go.
It's a fascinating story, by the way, if you ever read an account that says it was all an accident and they didn't mean to scare people, that is completely a lie. I will point you to interviews with Wells near when he died when he said the opposite.
The point is is we want to do kind of a War of the Worlds here. We want to realistically simulate the most existential events we can think of, plausibly existential events I should say, that can happen to our company. I do not mean martian invasions, please do not stimulate martian invasions, it's probably a waste of time. Or meteor strikes. I mean things like massive security breach. Let's really drill a massive security breach in our company, all of our customer data is on the [inaudible 00:23:17]. Or regulatory failure. Let's simulate, if you happen to work at a public company, what happens if our 10K was wrong? If we just gave the SEC wrong information? What happens if we caught some executive embezzling? How do we simulate that? How do we react?
How about a major customer meltdown, because everyone's going to have this anyway. Your biggest simulate that your biggest customer calls you and says, "I'm done, I'm going to your biggest competitor, I'm out of here, mic drop, bye." Okay? The key thing here is you need an uncomfortably small number of humans to know that you are doing this exercise, right? The CEO, that seems like a pretty good idea. Maybe your head of PR or comms you'd like to know. Someone in legal also feels like a pretty good idea, these could have implications. And of course the proctor. And practically no one else. And you want to do this like at least once, I prefer twice a year.
How many of you know the expression we train like we fight because we will fight like we've trained? Anyone ever heard that? Okay, well now you have. If we don't get people in the habit of responding to unbelievably awful things, we don't know how they will respond to unbelievably awful things. And that is a property of our companies and it changes all the time because we hire new people, and our culture, etc.
Okay, so will people do the right things? I don't just mean follow the manual, I mean will they act in a way that is compatible with the culture we think is in our company, or to our ethical standards however we're defining that in our businesses. Do they panic? Pro tip, the more you do this, the less they will panic. Does it leak? I mean, I wish we didn't live in that world, but we do live in that world. Customers are not the only people who are enthusiastic about using Twitter to say things. Sometimes our employees are too. A, you're not running this experiment to find leakers, you're running it to find out if you really have the culture you say you have, right?
Like a lot of companies ... well, I'll be very open and honest because it's all been in the press, right? For many years at Google, our feeling was like we have this very open culture internally, we argue with each other, we argue with our CEO all the time because it all kind of stays in the family, like there's a certain safety there. And recently that has not been as true, and so the first few times that happened, that that stuff went to the Twitter, and New York Times, or wherever, it took us a minute to adjust to this idea that the environment we thought we had wasn't the environment we actually have, or not completely the environment we actually have. No value judgment there, I'm just saying. That smacked us in the face a little bit, right?
Again, you'd like to discover this sort of proactively if you can. So the goal here is to make sure that the company can react calmly, deliberately, ethically, with the best interests of our users, and our company, or our shareholders if you're public or whatever, in mind. Right? We're testing did we suddenly, or over time did we evolve a culture or a way of understanding that is different than what we thought we had as a company? Also, even people who are not on call should have the experience of having to respond to emergency events. That's a good skill to have, the more you drill and practice that skill, the less you are likely to dump cortisol and adrenaline when a bad thing happens, which means you are more likely to actually be able to use the logic part of your brain as opposed to the lizard part of your brain to make decisions. Okay? So a determined reaction prevents overreaction.
All right, so those are four examples of kinds of experiments you can run, and I have actually run at different times in different companies on different teams with interesting results, which I'm happy to talk to you about later.
We're getting close to break, all right, let me just do the last couple of slides here, you ready? First thing, buy in is not the same as all in. Start small, please. Pretty please. I mean, if that was good advice when we were going to do Chaos Engineering for our software and hardware systems, it's really, really good advice when we're doing it for our human systems or people systems. But then ramp up. So you can go and do at least the first two of these things tomorrow on your teams. You can do it just with yourself if you want to, right, no permission required. And you should, you'll discover really interesting things.
And then you can build up and gradually other people will see you doing it and they'll think it's kind of neat, and then they'll want to do it with you, which will lead us to our next thing. The more you can do these things cross functionally, the better your results will be. The major components of the business, legal, finance, marketing, engineering, etc. etc. are like big systems with APIs, right, and we often don't exercise those APIs a lot. How often do your engineers talk to the lawyer cats in your company? Like excluding the, "I have to talk to you because something terrible happens." Like routinely, how often do they talk to them? How often do they have relationships with them? The answer is probably never. Asymptotically never. You don't want these relationships just to be at your whatever, director, VP, senior leader human level. You want it to be at the line level.
Also, some of these things particularly like Liar Liar, which is really great, or War of the Worlds, are really good team building exercises. Like they're actually fun to do, and they get the different silos of your business in the habit of working together day to day. So that I don't have to ask, is this a thing a lawyer should know about and how do I talk to a lawyer? I can make an educated judgment based on actual experience having run these drills, yes it is, and here's who I go talk to and I know Sally, or I know Fred and we can go have this conversation, right? So it's a great way to build relationships across the company.
If you're at a small company, maybe you don't think you care, but if you're successful, you'll blink and suddenly your small company will be a big company. And big company, by the way, is like a company large enough that you can't comfortably remember everyone's name off the top of your head. So that's for most people on the order of a hundred. You don't have to get very big to get big all of a sudden.
Okay. Last thing, since I don't want to be the person who makes you late. You can absolutely do this. There's no magic here. You don't have to have any extra special skills. You just have to be constructively devious, let's say. Let's face it, everybody has a little chaos engineer in them. It's part of what makes it fun, embrace it. It's part of being human. Go be human in your companies and do good for them in the process. Right? So this is completely doable.
That is my contact information because we're not doing Q&A because it's a keynote. I will be here at the show for a few hours, or you can email me, or you can do that Twitter thing and tell everybody how completely wrong I was, that's fine too. Of course it's on Twitch, so I guess everyone knows already. It's all good. Let me know, let me know what you think and I'm happy to share stories, and best practices, or in my case, worst practices when they have happened.
All right. Thank you to everyone at Gremlin for inviting me to come today. Thank you Kolton, thank you Tammy and Lauren who reached out. Mostly though, thank you to all of you for being engaging and patience, you didn't throw anything at me, so that was awesome audience and I look forward to talking to you again. Thank you very much.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more