November 9, 2018

Vilas Veeraraghaven: "Practicing Chaos Engineering at Walmart" - Chaos Conf 2018

The title of my talk is Practicing Chaos Engineering. That goes against the grain of work we really do because the goal really is resiliency. I'm gonna change that title and I'll explain why. The title is gonna change from Practicing Chaos Engineering to Practicing Resilience Engineering. The key thing to understand here is we are not trying to practice inserting chaos in a system. We are trying to practice how resilient we can be when there is chaos in a system. Taking this, we'll make sure that we carry this meeting through the rest of the slides.

What are my goals at Walmart? I joined Walmart about 10 months or so ago. I used to work at Netflix before, so the ideas that I had at Netflix I'm carrying it on to Walmart. The first point that resonates with me and with Walmart as a company is the customer always comes first, which means the customer experience is paramount. Anything that impacts the customer experience and causes revenue loss is the first thing we should be fixing.

Teams own resiliency, which means each application team owns the individual ... They are essentially responsible for ensuring that the quality of their product is absolutely the best. We can obviously federate a bunch of rules and we can say "Here is how you monitor your system. Here is how you run tests," but they own resiliency. Teams should be encouraged to fail fast, which means if you make a mistake, you learn about it very quickly. They need to fail often, which means you run a lot of gamedays to figure out exactly where your vulnerabilities are. Once you do that, my goals are at least fulfilled.

This is what my team does at Walmart. The role of our team is to make sure that we centralize the best practices, we provide all the tools that the teams need, we provide all the techniques that they need in order to run gamedays, in order to run tests, we enforce and facilitate gamedays. By enforce, we mean we ensure that this is applied to every team's natural release policy, and if there is one vulnerability checks and vulnerability reduction that's attached to each release. We also make sure that we create tools that are part of every stage of the CD pipeline. We don't want someone to just basically say "I just tested in prod and I'm never gonna test it in dev." We want it to be tested all the way from dev to prod. We also monitor acceptable levels of resiliency. The reason we do that is because we want to call out risks to the greater business, and we want to say "Here is where our weakest links are and here is how we will get affected on key business days," for example Thanksgiving, Black Friday right after.

I'm gonna skip this slide because every speaker has spoken about this. These are the things that the apps should be resilient to. The most important thing from our perspective in addition to application dependencies and infrastructure issues is how are you deployed, which is a question that a lot of people skip over. Deployment itself comes with a certain kind of constraint that ensure that you're resilient or not. We wanted to make sure that our teams, our app teams have all the tools they need. However, what is not clear is what the process will be in order to get from zero to completely resilient. In most cases, people want to get into chaos engineering immediately. The first couple of weeks some teams came to me and said "Chaos engineering. Let's just go break something." I was like "Fine. Do you have a DR playbook? A disaster recovery playbook?", and they were like "What is that?" I said "Go away. Do your homework."

We have to stop that. In order that we reduce the number of teams that just come and say "I want to switch off the data center plug and cause some issues in the edge," we need to have some prerequisites. In order to get through this entire process, we established a series of levels. Instead of saying you are resilient today and not tomorrow, we said let's go through some steps, let's go through level one through five. As the levels change from the color red to the color green, your support costs and your revenue loss gets lower as your resilience gets higher. By the time you reach green, which is all the way at level five, you are causing little over 100 or so dollars of support costs whereas if you're at one, you're probably causing hundreds of thousands of dollars in support costs.

This is how we motivated teams. It has a bit of positive reinforcement along with telling them "Here is why this is important." The senior management team obviously looks at the left-hand side, and the teams who are actually doing this look in the right-hand side and making sure that they can get from one level to another. The prerequisites for us was do you have a DR failover playbook, how would you manage all the traffic that is coming into multiple data centers, can you just exist in one and support everything, what are your critical dependencies. Anything that you cannot survive without, those are your critical dependencies. If you cannot function, if you cannot provide what the customer needs, then that is your critical dependency and you need to design a good fallback against it.

You want to have a playbook for what happens when those critical dependencies fail or do not give you what you need. We need to define non-critical dependencies too. You may have a database that you have as your L2 cache that you're going to, but what is the amount of time after which that also becomes critical? Those are things that need to be well-defined so that all of your stakeholders, all of the people who use your products know exactly what to expect.

The first thing after completing all of these prerequisites is we wanted everyone to get a check up. The team wrote a tool called the Resiliency Doctor. The Resiliency Doctor is literally a debugging tool for the entire application deployment. No matter where you are deployed, they give you a one page report saying here is your issue, if you're highly available, here is what your vulnerabilities are. If you're active-passive, then where are your rules to make sure that the transition is smooth? How do you ensure that it's low data loss? This became the first step for every resiliency exercise that we did at Walmart.

Level one. Once you have all of your prerequisites completed, level one was given as a first step for every team to even start educating themselves about what is resiliency. All of the prerequisites are stored in a well-known place. We have agreement on those playbooks. It's not like I write something and then someone else doesn't even know how to run it. We have to make sure that it's in a language that makes sense to everyone. We end that level one by making sure that we can do a failover exercise manually that verifies that the playbook actually works.

It's again additive. All of these levels are additive. Level two, the only thing that changes is making sure that you can do failure injection tests for all of your dependencies, for application dependencies. Level three is where automation starts kicking in. We are pushing teams to start using tools, tools like Gremlin, we have internal tools which they can use to basically push their either infrastructure or their applications to fail and then see what the response looks like, and make sure that they have good playbooks to fix that and reduce the revenue loss over time.

Level four increases the amount of automation, and by the time you are at level five, you are completely automated and literally the only support that someone in SRE have to give you is a few hours of engineering time to ensure that the right buttons are clicked. We have a long way to go. Obviously these five levels are not something that we can just jump and get to, but the reason for having these levels is to ensure that teams know that there is a step ladder to success. They don't need to jump from one level to another. It also means it gives us a greater amount of time to support these teams. If there are teams at level one, we know that we can form a certain kind of support group for them, a community for them internally. Walmart has thousands of teams.

To create that community internally, and create a community of chaos practitioners, we are able to move teams together from level to level two, to level three and so on. That's really the key thing that we got out of this process. What have we seen so far? The neck effect is that we have more than 50 teams. This is in the last five months or six months since the entire team joined. About 50 teams have already passed level two and are reaching level three. There is a consolidated effort, or I should say a consolidated educational effort where we have actually found chaos practitioners internally at Walmart who have been able to lead the effort by saying "Okay. I see something new. I want to talk to you about this," and then gathering all these people doing internal meetups, talking to them and understanding exactly where we want to be.

There have been chaos champions internally who have been able to take not just their team but also a multitude of teams in their pilar together on the resiliency journey. That has been a big win. We have suffered some outages. Some weeks back, there was a massive storm in Texas and there was a huge outage in the Dallas data centers. Microsoft suffered this huge outage, and the same thing happened to us too. However, there were lots of teams who had already begun using the failover playbooks that they had developed over time. In a way they were prepared. Of course, everyone wasn't as prepared as them but that helped reduce the kind of disaster that it could've been.

I think the biggest takeaway for me out of this is seeing that teams themselves find that they are empowered. There is no more silos. There exist silos in most teams that have been doing this technology work for years, and especially in a place like Walmart where technology has been growing for the last 30 years. There have been devs, there have been QAs, there have been performance engineers. This was the first time I could see all of these folks come together as a team saying "No, this is all us." There is a sense of ownership, a sense of passion when it comes to executing resiliency exercises and making sure that Walmart doesn't suffer any kind of revenue loss. It brought people together, and I think that was really awesome to see.

I think overall because of all of the influence that Netflix had on me in terms of the freedom and responsibility culture, that's something that I have seen grow at Walmart. I'm hoping that all of the efforts that we did is something that pushes them in that direction as well. There is increasingly a culture of accountability, which means if someone is making a change, it behooves them to also note down exactly what it impacts, what the revenue loss would be if something goes down, and what the playbook should be if a dependency that is a new injection, if that goes down, what happens. It's coming naturally to people, and I think that's a big win in my book. Those are the results so far.

Additionally, what we are trying to do in the future is trying to automate and create tools, and make sure that we can either open source them or we can work with the open source community on figuring out the best possible open source model for those tools. That's gonna be the next set of steps that we're gonna start working on. We have a pretty long way to go. We have started slow. There is lots of teams spanning the entire globe, and resiliency at Walmart doesn't stay only to the application layer. It goes down all the way to the infrastructure and even out of buying failures. We've had issues where literally an engineer drives to the data center and fixes something. That is new to me, but I'm educating myself as I go as well.

We have a pretty long way to go, but we are in the right track. There is definitely a lot of help from the chaos community which is growing every day, and as people are learning new ideas, the sense of partnership on different kinds of projects is increasing, and I'm hoping that this conference becomes that kind of birthplace for all these new ideas, all these new collaborations ... I think I'm going back. Sorry. That's all I had. Thank you.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Try Gremlin for free and see how you can harness chaos to build resilient systems.