Lenny Sharpe and Brian Lee: Finding the Joy in Chaos Engineering - Chaos Conf 2019
The following is a transcript from Target Director of IT Resiliency Engineering, Lenny Sharpe's and Target Assistant Manager, Brian Lee's, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
Lenny Sharpe: So, hi. My name is Lenny Sharpe. I'm the director of Resiliency Engineering Enablement at Target. And as my title implies, my team's responsible for helping to enable teams across Target to drive up the resiliency through our disaster recovery and Chaos Engineering capabilities.
Prior to this role, I've led teams in Target on business continuity, crisis management, and corporate security. So I bring a unique perspective as a non IT professional earlier in my career to today. One thing has become true though. Resiliency is definitely a passion of mine.
Brian Lee: I'm Brian Lee. I'm the lead engineer on Lenny's team. I started at Target four years ago on the cloud platform engineering team. We built out the underlying platform that target.com currently runs on. A year ago I switched over to Lenny's team to help bring Chaos Engineering to the greater Target.
Lenny Sharpe: One of these dogs is real. One of them is not. See if you can figure that out.
So everything we do at Target, it revolves around our guests and making sure that while our guests are in our stores looking at their mobile device or on .com, they're seeing a little bit of joy. And that's really what our focus is. That's our purpose. It's helping guests find the joy in everyday life. But with that, there's an expectation that whenever they want something, it's always going to be available. Downtime is unacceptable.
Whether you're going to the store online, we have a ton of options for you. This isn't a commercial, but really it's a reality. The more features we give our guests, the more complex our systems become.
As I mentioned, downtime isn't anything that they want to see. Anyone little hiccup across this chain could cause a problem. Our guests won't be able to get what they want, when they want it, and how they want it. And that means we can't give them the joy that they expect.
Kolton was really kind to us earlier. He didn't include us as one of the companies that had any outages, but we're standing here today saying, "We do have outages." It's pretty evident. Any of you that tried to shop at Target that ever couldn't get an order online or in store, know what I'm talking about. But that's why we're here today. We realize that Chaos Engineering gives us a competitive advantage. It gives us an opportunity to potentially reduce shortage ... not shortages, but disruptions from occurring and prevent outages from occurring in the future.
So where we started, I almost ended. Our company was going through a major transformation a few years ago. And as that transformation was happening, the disaster recovery team, which was the team that I ran, was not getting the love that we wanted. As I looked at my team, we're getting burnt out. It wasn't exciting anymore.
Teams were building new technologies, and those technologies are more resilient, more redundant. They had more high availability. They weren't thinking about doing disaster recovery testing. And when we went to approach them around disaster recovery testing, most of the time they told us they didn't have time or we weren't important. It was then that we realized we need to do something different. We need to be innovative.
We started looking at different blog posts around the Simian Army as I'm sure many of you have done the same. We started playing around with different chaos tools. The problem was we are a disaster recovery team. When we approached people to talk about chaos, they laughed. "You guys are worried about recovering, not braking. What do you know about my systems?" So that's where it started.
There was no one telling us to do this. One of my mentors early in my career always said, "Don't wait for your ticket to the dance, just show up." And that was the gamble that I had to take. We were going to expand beyond disaster recovery and start to think about Chaos Engineering. We were going to be the resiliency engineering enablement team. We were going to find the tools. We were going to build the processes. We're going to get people enthusiastic around how they can find flaws or weaknesses in their systems before something happened.
The team around me had to upskill. Many of them weren't sure what the future was going to bring, but today they're seeing the fruits of all the work that we put in over the past few years.
Brian Lee: Target is big. The sheer size and the scope of the organization and the technology and newness gives us some unique challenges to address. We have over 1,800 stores in all 50 states, 39 distribution centers, hundreds of thousand mobile devices, thousands of applications. Each store and distribution center has its own Kubernetes cluster, as well as bare metal and VM infrastructure. We have a public cloud, as well as our own internal clouds. We have a lot to think about and plan when it comes to this environment.
With the mixture of legacy and cutting edge technologies presents unique and interesting challenges when approaching chaos in engineering as a one-size-fits-all approach. We had an uphill battle ahead of us. At the time Chaos Engineering was still a new term. People had heard of it. They thought it was more of a load testing type of tool. Across Target, we did not have good steady state defined and observability was lacking. Leaders were skeptical about it, as well as and they had the right to be. We were competing for resources at the time when the focus was rolling out new features and not trying to break things intentionally. We had to find teams out there that were willing to try to do this for us like Lenny said.
In late 2017 we had our first game day using Gremlin. For 2018 our team had the goal of a game day a quarter. We needed to learn the processes that worked best to get teams incited and trying at Target. In 2019, we focused on that at least one game day a month, several times we had more than one, as well as offering a self-service team for the more ... self-service option for the more mature teams.
We had to find the tools that would work across the organization. So we started vetting existing chaos tools, building when we needed to, and finding the ones that fit best depending on where the applications were running in Target. Every game day we uncovered new areas for improvements and continue to increase the scale and complexity of our experiments.
The team I joined at Target, we would do fire drills against our own services. The subject matter expert would go and break the service, hopefully in a way that would alert the team on it. We then validated our playbooks and our recovery procedures and hopefully they worked as expected. If they didn't, that was time for us to go and correct that and fix it.
This helped us a lot. It really meant for the team that they were ready for peak time. They were confident in their systems and ready to go for it. Peak season is very big deal for Target. Our leadership saw this as a huge benefit. Now when we think about it, it's kind of obvious, right? But back then it was very, very new. They allowed us the time to run the fire drills as often as we wanted and towards the end there we were doing it every other week.
Like I said earlier, we had our first game day with Gremlin. Using this tool allowed us to have a quick POC with one of our supply chains teams, doing our first like true chaos experimentation. During these experiments, we uncovered a few issues with their application. On top of that, the team had ... was responsible for several applications across the portfolio. So finding this issue, they were able to fix it across the board before we had an incident that caused it to bubble up.
After the game day, the engineers were energized and excited to correct the problems that they uncovered. More importantly, they wanted to do it again.
Lenny Sharpe: So we got lucky with this game day. The manager of this team came from a company that used to do Chaos Engineering and run attacks on a more regular basis. He was an advocate from day one and was surprised we weren't already doing this more often. His leader was actually going around and manually breaking things and watching how his team would respond. Not something they liked because every day could be a fire drill. So going through a more organized game day exercise made the team feel more relaxed.
The other very big benefit out of this was that that manager, after we found defects, was able to put dollar amounts of those. Had they not have corrected defects in production environments and told other teams with similar configurations to correct theirs, we could have had shortage along the way. So we had definite cost avoidance. This was one of the first times we're able to actually take a dollar amount and tie it to a chaos experiment and show this to our leaders that there definitely was value here.
So we're experts in, all right? We've run one game day. We found something. Not at all. But it didn't stop us from having to act that way. Teams across Target started to hear about it. They wanted to know more. But we weren't going to be able to do it alone. So what we started to do is expand beyond Target.
We co-sponsored the Twin Cities chaos meetup, which started out very small, but today has almost 400 members. The meetup is a great opportunity, as I'm sure many of you are participating in meetups in your own areas, but for us it was a way for us to interact and socialize with people from Target, just not inside Target. We made a lot of connections at the meetup and were able to schedule game days and our backlog kept getting bigger and bigger. We're also able to bring in some industry experts to these meetups so it could validate to people, other companies within the Twin Cities, that this was very valuable.
Inside of Target we have a lot of opportunities for conferences or learning. Learning is a big focus for us. So we started to bring in industry experts to help validate to our leadership that there is ... there's something here. We could see that our maturity model and our roadmap was making sense and people started to listen.
Kolton came and spoke at one of our inner con conventions, and actually one of his speeches is one of the speeches that's gotten a lot of publicity and a lot of team members at Target say it's one of the best they've ever seen. So it made us feel really good that we could bring people in and they could help validate where we are at.
Finally, we started participating in hackathons internally, first to drive awareness around what we were doing with chaos experimentation and injecting faults and failures, but secondly to start hacking our own things. How could we build different tools? How could we look for different integrations? This gave my engineers a lot of learnings and were able to apply those to where we are today.
As I mentioned, I used to lead the disaster recovery team. And when we would look at demo days, an the area of our company where we can go and share and inspire others, the disaster recovery wasn't something we were out there sharing with anyone. We actually probably would never even be selected to submit on that topic. The minute we started doing Chaos Engineering, we were in.
Our first demo day booth we presented on the game day we did with the supply chain team. We had an actual user that said, "This is valuable, we need to be doing more of it." Since then, we've participated in all demo days, both in Minneapolis and Bangalore, and we've been able to showcase how game days can be run, what types of experimentation should you be doing, what types of tools can you be using, and it's been very, very valuable for us.
Hands up if you like broccoli. A lot more people than I thought. Okay. Anyways, this will make a lot of sense in a moment. So you're probably wondering what this has to do with anything when it comes to Chaos Engineering. My boss and I were talking recently and we said, "Chaos engineering is kind of like broccoli. You know broccoli's good for you, but maybe it doesn't taste great." Chaos engineering, we see the value. Everyone here, you understand the value. That's why you're here. But how much cheese do you need to put on the broccoli to get people to eat it? It was the same thing for Chaos Engineering. We are getting there. There was value, but people just weren't consuming our product. We needed to find more creative ways to do that, and so we did.
Prior to the demo day experience, I challenged my engineers, "How do we market this differently?" They said, "Let's come up with a logo. Let's come up with some stickers. Let's have the disaster dog. But more importantly, why don't we make it really easy to consume?" So we made a Chinese menu and that menu had all the attacks. Teams could Slack us the type of attack they wanted to run. It was really, really intuitive and easy. For those teams that would grab us and leave, we also would give them a fortune cookie. And those fortune cookies had chaos teams inside. So later on when they'd open it, they'd be greeted with a funny surprise.
This is just one of the small things we started to do to drive greater awareness. We started to make it fun. Leadership started to see that there was something more to this than they didn't see initially, which is great for us.
Brian Lee: Because of all the marketing, we started getting a lot of visibility. Teams were excited to try it out. Each month we got more and more requests for game days for us to facilitate or for teams to conduct their own experiments on their own. Our team continued to grow. Once was a team of four is now a team of 10 and growing.
In order to support the scale of Target, we develop TRAP. It's a unified API and UI that allows teams to use both the pay services and opensource tools behind the scenes. When we find a tool that fits well into Target's ecosystem, we can add it to our tool box transparently to the end user. This allows teams a control mechanism to easily start and stop experiments as well being self-service to them.
We also integrated with other Target services to provide notifications when experiments are running to the teams that depend on those services. While those experiments are running, they can then see and validate how their service is running in the [inaudible 00:14:34] state from the other experiments. This allowed teams to get information ... Sorry. These joint learnings were invaluable to the teams.
Soon we'll automate the gathering of the metrics and the dashboards, pulling them in to create a report automatically for the users. This puts less burden on the engineers and will make them more inclined to use the tool more often.
We built this tool with one focus in mind, to find faults before they become incidents. So we started the feel of joy of Chaos Engineering through our game days and our self-service experimentation. Teams are finding defects, proving the observability and processes. So for them, the experiments around Chaos Engineering is exciting and fun. For us, we started to realize the success of Chaos Engineering and the joy it brings to our own team seeing the results and the interest only increase.
Lenny Sharpe: We've come a long way in the last few years, but we have a long, long way to go. There's some pigs flying that we want to help guide to the ground, and there's a few things we just want to share about our roadmap and where we're going to be taking this.
The first is really around having OKRs, objectives and key results at the very highest levels in the organization. Some teams are doing this now, and it's definitely helped with getting teams to adopt Chaos Engineering principles and drive accountability where maybe it might be lacking. It's no longer my team telling them they need to do it. It's their own leadership telling them they need to be doing it.
The next is really focused on more production type testing and automating that testing, so seeing it more continuously done. For us, that'll be integrating in Target's application platform. So that as teams are deploying, they can conduct chaos experimentation more regularly.
For us, the way that we measure reliability and resiliency is through something we call opsPI, ops product intelligence. There's some reliability factors and there's also some resiliency factors. Three that are a part of resiliency that my team drives are do our runbook's up to date, recovery procedure's up to date, have they been tested and validated, and then does the architecture follow the resiliency principles that we've set forth?
We're going to be adding in metrics around our chaos experiments, not just the volume of experiments, but how many defects have we found, how many had been remediated, how many other teams have been conducting it based off of those learnings?
In order to do that, we're going to have to increase our scope and scale. Our focus right now is on our guest facing applications and we want to continue to expand that and then start looking at other dependencies along the way.
And finally, we want to see our incident data along with our experiment data mixed, leveraging AI and ML. How can we tell teams what to do or they have the systems tell them what they should be testing before they even think about it. So as real incidents are occurring, teams can react faster and spend more time on the types of attacks that matter versus the ones that don't.
So along the way we realize failures are going to happen. After everything we've put into place, we can't prevent human errors from occurring or complex systems from something failing along the way. Our goal is to try to minimize and hopefully eliminate some of those from occurring in the future.
But we've experienced a lot of success, as I'm sure many of you, if you've been rolling out your programs. So being a kid from Buffalo, I had to include a Buffalo Bill's quote in here. And this is Marv Levy, one of the best coaches ever, that along the way, you're going to have failures, but you're going to have to enjoy a lot of those successes as well. So thank you very much for your time. Brian and I will be around later to answer some questions.
Thank you.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more