September 26, 2019

Subbu Allamaraju: Forming Failure Hypotheses - Chaos Conf 2019

Subbu Allamaraju: Forming Failure Hypotheses - Chaos Conf 2019

The following is a transcript from Expedia Vice President, Subbu Allamaraju’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.

Twitter: @sallamar


I'm really glad that this conference did not happen yesterday, because I was here in the afternoon to check my slides and this place was super hot. I thought my head would explode. I was going to get some duct tape. So today is much better, but backstage it's still really hot.

So today I want to take a slightly different angle. It's probably an extension of the previous talk, the speaker from Google. This discipline called Chaos Engineering has been around for about 10 plus years, 10 years approximately. This is unlike other industries, like healthcare, patient safety, fire and rescue, aeronautics and space, and industrial engineering and manufacturing and so on and so forth. Those industries had decades of experience dealing with failure, and they have practiced and built some mental models over a long period of time, how to make systems safe and just not the people, but also the technology, everything. And as a consequence, when we hear about Chaos Engineering, oftentimes we hear about this is why we should do it, all the reasons why it is important for you to do it, and these are all the tools and technologies available to you to practice chaos engineering. And then, by the way, here are some nice success stories. That's what we often hear about when we hear about this topic.

But reality is different and you know pretty much, most of you I have talked to in the recent weeks, that we often go through a bumpy road to get to the outcomes that we want. The road is not safe, a lot of disappointments. A lot of excitement initially, but then over a period of time things seem to die down. Many chaos engineering programs die eventually because of certain reasons that I get into today.

And I want to offer a hypothesis in this talk. My hypothesis is that unless your culture, your organization, is set up to learn from incidents, thank you, your Chaos Engineering program may not succeed. Those companies that understand, study incidents, learn from it, figure out the what and why of this program, this Chaos Engineering program. Those that do not go through the same journey, may end up getting disillusioned.

And the third part is that once you build a culture of learning from incidents, you will likely find out the what and why and the value of Chaos Engineering techniques. That's the gist of my talk.

So my name is Subbu Allamaraju. I live in Seattle, that's why I can't take this heat here, and that's my Twitter handle over there. And I will publish my slides on my blog, hopefully by this time tomorrow, with the speaker notes. So you don't need to worry about taking screen shots. And you'll also find a number of pride articles I wrote on this very subject and my lessons I learned over a period of time.

So before I get into the talk, I just want to take a cursory look at this definition of Chaos Engineering. Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. The reason I want to put this out is because a number of us in the community still believe that this is about randomly killing stuff, randomly taking off systems down to see what happens. No, it is not. To me, this is really about a scientific method. This is like AB testing. This is about conducting a hypothesis and see what happens, learn from the hypothesis in a safe condition so that you're not impacting your customers in the process.

So let me use this visual to describe what I mean by this. Imagine your system. It's between these nice two bumps in a stable condition and beyond those two bumps is the danger zone, unstable zone. And you introduce [inaudible 00:04:16] to that. You push the system away from its stable condition. Under the assumption that your assumed fault boundary or the blast radius is that vertical arrow going up. That's your assumption. So you push the system towards that, not all the way, and the systems comes back to the stable zone. It's good. Or if your assumption is wrong the default zone is actually much before, you will find that the system goes into the unstable zone. That's always about chaos engineering. You're conducting a test in safe conditions to see what happens.

So as we go through the rest of the talk, I want you to keep four questions in mind. I will answer some of these directly and some of these indirectly. The first, what is the system? Is it just the cord, the microservices and monoliths and databases and the network and the storage and so on and so forth? Or does it also include your people, processes, culture?

How do you form hypothesis? Do you just copy what someone else did? What Netflix did, what Amazon did or what someone else did? Or you conduct your own journey to find your hypothesis? How do you ensure system safety? Where does the system safety come from? Does it come from just the software, or does it also come to people that are handling the software?

Lastly, why should anyone care about this work? How can you make sense of someone saying it is important to invest in this effort? It's not a waste of time. It is not an esoteric experiment that we are doing. There is some value in it. How do you convince someone else that this is important? So those are the four questions I want to consider for this discussion.

So let's take a fun journey. That's how my journey started. Your journey may be similar. Usually you start to embrace the newer and greener technology like your cloud migrations or some cloud native technologies like Kubernetes and whatnot, and you build your muscle, you start building a lot of microservices and all the fun new applications and maybe tearing down some monoliths in that process. And things seem to go well. You're getting agility and people are happy about what they're doing, and then suddenly you realize that you're not really resilient yet. You'll find some issues here and there. Some changes that you introduce in production are knocking down your system sometimes, or maybe there's a cloud outage or some network outage in your on prem systems, and then you start to worry about these things. Because cloud is not robust, microservices is not robust, you still need to do something. You realize that reality, "Oh I got to do something more than this." That's when you discover this thing called Chaos Engineering. Aha. There is this thing that other pioneers have practiced. Let's try that out.

That's how the journey usually starts. Then you find out the literature, you read about it, you read books and listen to podcasts and blog posts and come to conferences like this, and then you get down to work. Usually this is what happens after that.

Some people will say, "Yes [inaudible 00:07:29]. We are ready. Attack us. We are ready to get attacked." Some people will say that. But I'm sure there are people that say, "Don't touch our systems, because we are the most [inaudible 00:07:41] company. Don't block us." [inaudible 00:07:43] and then maybe others saying, "Not right now. We are busy with this other thing. Come back in two weeks, [inaudible 00:07:51] whatever. [inaudible 00:07:53]" So what can you do? Being a good citizen you will isolate your systems from your attacks [inaudible 00:08:13] and then you practice [inaudible 00:08:13] because the people that has said, "Attack us," they've already [inaudible 00:08:17] did the work of maybe they were using some [inaudible 00:08:21], maybe they were using cloud formation with auto scaling setup, and they have done some nicer things so they're ready to embrace this discipline. So you won't find much in that exercise. In fact, this is what happened to me at work. We didn't find miraculous false issues with the system.

And then of course you won't test things that you know are going to break. Right? Let's say you have a network infrastructure between your active and passive data centers, audio cloud and on-prem. You're not going to touch that network because you know if you touch it, stuff is going to break. So you're going to skip that. So you're not going to the danger zone. So you're limiting yourself. That's when you start to doubt yourself. Should I really be doing this work? Should I find something else to do? You will get to doubt your interest and passion in this topic and then you get into the Valley of Misery. You doubt yourself and you don't know what to do. You will probably wind off the project at that time.

And so, how do you get out of that? I'm sure this happens to some of us here. We may not talk about it, but this happens. I believe we have technologies so that we can learn how to get out of this Valley of Misery.

So let me offer you a different approach. Let's imagine there is this thing called null hypothesis. Let's imagine that chaos engineering has nothing to do with system capability to withstand turbulent conditions. Let's just imagine that hypothesis, because it is important to have that hypothesis, because that's what I went through, sorry about that, last year. And it's also important because the pioneers to tell the success stories, you know who you are, that this thing miraculously saved the day, they went through a journey to get to the conclusion. But have we gone through the same journey? I'm sure most of us have not gone through the journey and starting with a null hypothesis might help us to go through the same journey. So instead of asking how do we make the system withstand turbulent conditions, so that question aside, ask how is the system behaving today, as it is today. Don't bother about what we want to change. It's like going to the doctor. Doctor won't prescribe something unless the doctor conducts a lot of tests on you so he or she understands what condition you are in so that he or she can do the best for you.

That brings us to "as designed" versus "as it is". As some software developers and architects and SRE people, or maybe not SRE people, but we spend a lot of time on the "as designed" state of the system. We document, we create diagrams, we create code. All those are our understanding of how the system should work. Not the whole system is working. Even the metrics we collect, alerts we need respond to, the logs we capture, they also reflect our bias on how the system is supposed to work. That's why we go through the cycles of alert [inaudible 00:11:32] over and over. We create alerts, turn off alerts. We create more alerts, we turn off alerts. We go through the cycle, because these things reflect our bias of how the system is supposed to work.

But that's not how the system is working. The system is working in a complex environment. Stuff fail you, stuff happens. We always don't know why things break. We take time. We play war room. We create war rooms that have incidents. In fact, we bring in the war analogy to incidents, because the real world is different from how we see what happens in production.

So these are two different points, different phases, states, of a system. We need to spend time on both sides. Most of us, unfortunately, spend the most amount of time on the left side, not on the right side. That needs to change.

So instead of figuring out how to improve resiliency, let's understand what happens in the real world. That's what I did last year, because I started doubting myself about this time, actually in summer last year, about whether this technique is the way to go forward. So I spent over a period of several months in multiple phases. I went through a lot of incidents at work. [inaudible 00:12:52] analyze all 1500 incidents during this time. And I went down to see the correction of errors and postmortems when we had them. Some days I went through the service now tickets to see the notes that the different people put in there. Sometimes I found conference links. I did whatever I could to learn from the incidents. It's a very laborious exercise, not automated. I was going one by one to see what patterns am I observing from the incidents, because I want to understand what is working, what is not working today before I prescribe something to the system. And I have those findings published on my blog post. Feel free to take a look at it and challenge if you have a different point of view.

But this is what I found. The number one cause for failures, it seems to be the changes we create. Makes sense. A lot of failures at our company, as well as most companies in the internet, are ... things fail when there is a change in progress or was recently made. And depending on the sample size you may find certain percentage of incidents triggered by changes. In my case, about 50% of incidents in my latest analysis were triggered by a change. I'm not saying change is the root cause, but change surfaced those many incidents. It's a large number. Right?

And the second observation I made in this process ... again, this is based on my understanding of our company's journey as well as listening to what is going on in chat rooms during incidents. Like many of you here, our production systems are complex. I know we tell we're adopting microservices, Kubernetes and all the good stuff, but we have technologies that our production environment are complicated. Something we always feel proud of, being current today. We have older systems, technical data that we can't get rid of yet. It's complex, expensive to get rid of. We have systems that are retired and we are afraid of turning them off, because you don't know what breaks, and we have monoliths in the middle that are maybe getting released every other week sometimes, and then you have these fast changing microservices in the same environment. So our production environments are messy. We don't understand enough of them. As a developer you might know what's in your neighborhood, but you don't know how different parts of the systems are connected. Kolton this morning showed his Death Star architectures. It's the same.

So that's my second observation. Because of this second and higher order effects are hard to troubleshoot. We spend a lot of time when things go wrong.

The third observation I made from my study of incidents was we don't understand where a failure stops. I've seen cases where a change gets put in production, something happens, there's an incident, you go through all the changes, what cause, what cause. People say, "No, that's not my change. My change could not have impacted that thing over there." And then two hours later you realize holy crap, that's the thing that broke it. So we go through this journey in incidents, because we don't really understand the fault lines, our imaginary fault lines.

So these incidents, this study, tells me a few things. First of all, we have to, at work, improve release safety through progressive delivery. Basic technique, we all know this. That reminded me, "Oh crap, we got to do that before doing anything else." So we went down chasing release safety. And then of course ensure tighter fault domains so that our "as designed" state has nicer fault domains that survive failures. That takes architecture investments. Third, implement safety in the "as designed" state. Safety could be tuning your timeouts, all your resiliency techniques, being able to fail failure, being able to share traffic, all the good techniques. But you got to invest in this.

Those second and third points on this slide are what will help us prepare for this journey of Chaos Engineering. Then you can push your envelope. Then you can go beyond your trivial failures. It takes work. You've got to prepare for those things. That's what I realized.

But one thing I want to admit is that my observations are relevant. They're all important. They are very generic for most enterprises and yet they're not. Active learning from general incidents is much more important. That's the realization I have come to. Because the "as it is" state, the more we understand from the "as it is" state, as it is architecture, the monitoring metrics and other stuff in production, as well as the people, process culture, all that will tell us what we really should do. So I would say spend time in learning from incidents before embarking on these journeys, because that's what I believe the pioneers in this space have gone through. They don't come up with these ideas from thin air. They went through these dialogues over a period of time to come to these conclusions. And let's embark the same journey.

So how do you prioritize such work? This is hard stuff. I was in a meeting yesterday and talking to fellow practitioners and everybody's journey is similar. How do you make the case for this? Most do this as a grassroots movement, but they get into roadblocks. People don't listen to them. The adoption of these technologies remains low. Why is that? And this is my belief, is that we have to pick the most critical areas. Of course, no-brainer. The second, we need to be able to articulate the value of this work, because as engineers we believe that this is important. I'm sure your leadership believes that it is important and yet when they have to make trade offs between doing something for the product or the customer versus doing this kind of testing, someone has to make the trade off. How do they make a trade off like this? Unless they know the value, they can't make a good trade off.

I'll give an example. About a year plus ago, we were in a large conference room at work and we were debating should we invest for multi-region deployments and failover between those regions for a critical part of our stack at Expedia group. Less than half the room said, yes, we should do it. It's important, it's the right thing to do. It's the right architectural practice, no doubt about it. The majority said, "Oh no, this is going to cost us more because we are putting a new deployment, a second deployment. It costs us more, no questions about that. And it's going to take more time. We need to spend six to eight months more to go through the journey and then test the failures and all that stuff."

And we were going through this debate for a very long time and then one of my colleagues did some math on a piece of paper. How much does this part of the stack make for the company? Rough number. He produced a number, some big number. That's the money we make in a year from this part of the work. How much does it translate to per minute? Okay, this many hundreds of thousands of dollars. Okay, how much downtime can we afford for this? Two hours? Three hours? One hour? 15 minutes? We went through the debate and the people said, "Yeah, 15 minutes sounds reasonable. You shouldn't fail beyond that." Everybody said that. 15 minutes. Okay. How do you make 15 minutes happen without a second deployment being able to fail over quickly? Debate settled.

So instead of pushing for the hygiene, by pushing for value I believe we can actually help prioritize some of this kind of work.

So my journey from the Valley of Misery was like this. Number one, learn from incidents. So you go through the same journey that other pioneers went through. Number two, make value based decisions. No anxiety, no frustration about chaos engineering not getting the attention it deserves at work. If you focus instead on the value, the debate will change and then you can actually figure out where it makes sense and how to test, what to test and when to test. Those become natural, because somebody has to make trade offs.

But how do you learn from incidents? I don't know. I went through a journey to learn from incidents, but I believe we all have to go through similar journeys to an extent. Not just each of you, but your teams at work. The developers and testers and your product people, they have to walk through this journey to understand how systems are working when they're working, and how they're not working when they're not working.

But I can tell you how it feels when you start to learn from incidents. Number one, you have better mental models, how the system works when it does, and how it doesn't when it doesn't. You will get to understand the physics of your complex system, not just the microservices or the database that you're focusing on. You'll understand when people are failing, what kind of training people need, what processes are working, what processes are not working. You will discover that as you study from incidents, because incidents tell you the "as it is" state, not "as imagined" state.

Number two, you're not chasing symptoms. For each incident you don't have a backlog of 10 things you want to do and that backlogs keep growing after every incident. Instead of that, you're actually looking at systemically what implements you need to make as an organization so that you're getting the best value from your production systems. You're not chasing symptoms.

The third, you start to understand the role of people, processes and tools for success as well as failure. This is fundamental. Like the previous speaker Dave talked about, this is about us not just about the code. It's about us, how we are dealing with our systems and production. The culture we bring in, the temper. The temper we bring into incidents will change once you start to listen from incidents.

The fourth and more important for all the points is that you are able to articulate the value of hygiene investments. It doesn't have to be Chaos Engineering. It could be some other hygiene improvement that you want to make in your company and you're stuck. You don't know how to make the case for value. But I think once you start to think of the value, how your systems are bringing in value [inaudible 00:23:48], you may have a chance of changing the conversation like the example I gave you.

So what are the lessons learned from this? Number one, learn from incidents. They will tell you a story of the "as it is" state, not "as designed" or "as imagined" state. We will change our assumptions as we learn from incidents. And the number two, even more important, there's no number two. Spend as much time on the right side of the slide on learning from "as it is" state. Today that percentage happens to be very, very low in most enterprises. So spend more time after the incident to really understand what is going on in production systems before embarking on improvements that we have. Because we are making bets, we are making hypothesis, and we have to ground those hypothesis based on reality and the reality comes from incidents.

So thank you. Thanks for having me for the last 20 plus minutes. And any questions you can ask me afterwards. Today I'll be hanging out here. As I said before, my slides and speaker notes will be up on that website and medium by about this time tomorrow. Thank you very much. Enjoy the conference.

See our recap of the entire Chaos Conf 2019 event.


Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free