Paul Osman and Ana Medina: Embracing Chaos - Chaos Conf 2019
The following is a transcript from Under Armour's Senior Manager of SRE, Paul Osman's and Gremlin’s Chaos Engineer, Ana Medina's, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
Paul: Like Jacob said, my name is Paul Osman. I work for Under Armor, where I lead our SRE team.
Ana: My name is Ana Medina. I currently work as a chaos engineer at Gremlin.
Paul: Today we're going to be talking to you about embracing chaos. This is tools and techniques and personal stories about introducing Chaos Engineering into your organization. So Ana, how did you get started with Chaos Engineering?
Ana: Well, in general, I have a pretty untraditional route to tech. I'm a self-taught coder that did a lot of web development and mobile applications. Then, I actually randomly ended up as Uber's first cyber reliability engineering intern, but that wasn't just that enough chaos. I was also placed on the Chaos Engineering team, which basically later meant that I was on call for the first time ever, my third week at Uber, and that also meant production on call. It was something. I learned a lot of boss systems, SRE ops really, really quickly, and I had a great team to help me uncover that.
But that first page that I got, it was 3:50 in the morning on a Thursday. My manager was out of the office. That secondary person on call was also not around. I did what I got told to do. I was thrown a pager duty account and it was said, "Hey, you're on call. You'll figure it out," and was given a list of runbooks. So when I opened up that runbook as I get paged, well, that runbook was last updated about 180 days before I got paged. Well, that really made my confidence be hella strong. This was basically me, chugging water, four in the morning, texting my girlfriends back in Florida, and I'm like, "So something's going on. What do I do?" And they don't work in tech, so for them to answer was, "Can't you just make a copy of the file so if you delete them you have backup?" And I was like, "All right, I got to handle this. I'll talk to you later."
But I've been working now in Chaos Engineering for about three years. And one of the things I like really talking about is that we can help folks like me three years ago have had a better on call experience. And a lot of that could have been by doing Chaos Engineering, and in a way, that would have done Chaos Engineering to update those runbooks, make sure that they're up-to-date when someone gets paged, but also actually train people into it, have them have a more empathetic way to actually deal with their systems. And I kind of wish we would have done that then. But we're here to talk to you about some of those ways that we've learned how to embrace chaos throughout the years.
Paul: I love that idea of just getting thrown the pager and being told, "All right, here's the deep end. Good luck." How many people have a story that's something like that? I see a lot of hands. How many people are familiar with this picture? You've seen this on the internet. It's a very popular meme, disaster girl. The funny thing about this picture that I didn't know until recently is that this is not an unplanned event. Obviously it's a perfect picture, but the context that's missing from just seeing the meme is that this is actually a controlled exercise being employed by the fire department in Chapel Hill, North Carolina.
And I loved learning this because we see this meme all the time at ops-related or SRE-related conferences and it's actually a game day. This is a fire department training new hires by having a controlled burn where they have all sorts of things in place to limit the blast radius of the experiment so that the first time you're not running into a building is when it's completely chaotic and we have no controls in place. And so one of the themes of Chaos Engineering is, wouldn't it be nice if we did that for software engineers as well?
Like I said, my name is Paul Osman. I work currently as the senior engineering manager for SRE at Under Armor. We have products like MyFitnessPal, MapMyFitness, Endomondo, so large number of consumers. Before Under Armor, I worked at places like Pager Duty, SoundCloud, 500 Pixels, Mazola, and I was really fortunate to be able to participate in some kind of Chaos Engineering at each of those companies. That's me with my own chaos engineer, my daughter, who actually hilariously while we were doing dry runs of this talk kept interrupting our video sessions, so she's my chaos monkey.
Ana: My name is Ana Medina. I currently work as a chaos engineer at Gremlin. I've been there for about a year and a half, and I get to work a lot with our product teams, engineering teams, and I sit on the advocacy team. But prior to joining Gremlin, I've gotten a chance to work at Uber, Google, Quicken Loans as well as the South Florida Educational Federal Credit Union. And for me, I like having some nails that actually make me feel good when I'm causing some chaos, so these are some of my favorite designs that had some fire emojis on them.
But we wanted to set the tone on some of the terms that we're going to be talking about today. So with the first one, we had the definition of Chaos Engineering as, "Injecting precise and measured amount of harm to assist them for the purpose of improving the system resilience." And that also includes the human side of this resiliency. And the other term that we want to define is game days. This is a group activity where you come together with some folks to practice Chaos Engineering, but we also like calling them incidents that didn't happen.
Paul: So one of the things I always find interesting about talking about Chaos Engineering, especially with folks who are new to the concept, it's not something that you do in isolation, right? It's not like you have incidents and you want to get better at them, so just do more Chaos Engineering. You have to evolve Chaos Engineering in tandem and in parallel with a bunch of other capabilities in your organization.
And so Chaos Engineering can actually help you develop a lot of these things, and a lot of these things help you run better chaos experiments. And so this is stuff like making sure that when you're doing chaos experiments, your observability tooling is telling you what it needs to be telling you, your monitoring and alerting systems are working. You have an incident response process that you're using these as an opportunity to practice and hopefully you're doing some kind of blameless postmortems and if not Chaos Engineering, can be a really great way to practice doing blameless postmortems.
Let's talk about how we got started at Under Armor. Raise your hand if you've used MyFitnessPal. Awesome. How keep your hands up if you installed it on New Year's. All right. There's at least one or two people, so this is not an uncommon thing. New Year's is MyFitnessPal's Black Friday. Predictably, we see a huge uptake in new registrations and also returning users every year, and our job of course is to retain as many of those people as possible and help them meet their fitness goals year round. Of course we see a drop off, but we want to minimize that drop off as much as we can.
One great way of doing that is not by having any serious outages during that window, so we do a lot of preparation every year to make sure that we're ready for that New Year's surge. One year I was working as the engineering manager for the backend services team on MyFitnessPal and the team had put a lot of work into developing kill switches throughout the monolith. We still have a Ruby on Rails monolith that powers most of MyFitnessPal, so shout out to the previous talk talking about monoliths.
The team had instrumented all these kill switches so that if we started to have any problems with the activity feed or with our integrations, we could just switch those features off instead of having those errors cascade out to the rest of the system. And I wasn't satisfied just saying, "Hey, let's see if those work at 3:50 in the morning," so instead we all got around and did a controlled game day where we actually shut off parts of the app in production for a few minutes at a time just to make sure that that was a clean operation and that everything happened the way we expected. We actually learned a lot and ended up doing some fixes before New Year's so that we didn't learn at 3:50 in the morning.
So since then, we've been experimenting with different ways of implementing Chaos Engineering as part of our incident life cycle. John Allspaw, the former CTO of Etsy, has called incidents, "Unplanned investments in your company's survival," and I love that term because if that's true and incidents are unplanned investments in your company's survival, you can think of chaos experiments as planned exercises in your company's survival. Both of these have the same return on investment, which is learning, and that learning can take place in terms of updated documentation in terms of updated runbooks, in terms of tribal knowledge that's developed, or it can take place as action items that actually make your system more resilient to types of failure.
So we used to look at incidents like this, and Ana and I have jokingly called this the Game of Life for incidents. An incident happens, you hopefully have a post-mortem, you develop some action items, you do some work and then you do a chaos experiment to verify that those action items did what you intended for them to do. Just like the Game of Life, things aren't always that simple. So we have since evolved our process a little bit and we now have an incident happens, we do some form of incident analysis. In that incident analysis, we will figure out, do we have time to do a full post-mortem on this, to do everything that we want to do, or how in depth are we going to make this post-mortem process, because we don't have time to cover everything.
When we do that, there are going to be a number of things that come out of it. One thing that we look for when we're interviewing participants in incidents or when we're sitting together with a group of people who are involved and going over the timeline, we look for those aha moments or those surprise moments, those things that make engineers go, "Wait, that happened? That database failed over? I don't get that." And that's a really good thing for us to document. Then we look at action items. A lot of people are familiar with that, like, "Hey, this time mat was poorly configured." We can change that. And then when there's action items, we do schedule that with project managers and then track that work to completion.
If we look at this sort of life cycle, there's two really great opportunities to assert chaos experiments. Whenever you have some learning or some surprises that come out of incidents, I think of it as developing, deepening that rut, right? If you can do a chaos experiment that recreates those conditions or similar conditions so that you can see your observability tools so that you can respond to it with a large group of people around you, that's going to just deepen that rut of learning for you. Similarly, as before, when you do complete those action items, when you tweak that timeout or when you add that layer level of redundancy, you want to do a chaos experiment to give yourself confidence that that actually addressed the issues that it was intended to.
Ana: Well, thanks, Paul. It was really cool to learn about how Under Armour does things, but I wanted to share about the way that I saw things happen at Uber. I actually got a chance to join Uber as a Chaos Engineering tool had already been rolled out, but Uber actually started thinking about Chaos Engineering in 2015 after having one of the largest outages that they've had. From then, they started looking at ways to be a little bit more resilient. So of course, they turned to Chaos Engineering. And a lot of that at that time was talks about what Chaos Monkey was doing, Netflix's open source project, as well as Simian Army.
But they had a problem. They weren't running on the cloud. They were completely bare metal, so they can't use this open source tool, so they went ahead and actually built their own tooling. They called it You Destroy, and this was to support, by then there was around a thousand microservices. It's been three years since that, and I think I recently heard in a talk, they're up to 4,000 microservices. So testing the resiliency of all these microservices really meant a lot. And the way that this was kind of working is that the agent was deployed on every host that was running these services. This would talk to the workers that would actually schedule these experiments, and there was a UI and a CLI that actually allowed you to interact with the tool. For Uber, it was just very much trying to make sure that the things that were critical to their business were constantly up and resilient, hoping to have five nines of reliability.
This are screenshots from what the UI actually looked like in 2016, so we have some things. We actually test things like database speed, reachable from both bare metal data centers out in DC and out in California. We have things of what actually happens if you lose about 10% of your hosts or workers. We do some own Chaos Engineering on blocking outbound traffic from the Chaos Engineering tool that folks were using. The way that you schedule them, you can actually have some start time. You can have it to be repeating, and there are certain things that you can do. But there was a lot of other things a went behind this to make sure that folks actually adopted Chaos Engineering.
But first, I wanted to talk a little bit about the fact that for them, at that point, they didn't have an expansion of products like they now do, including Uber Eats. So I can't necessarily talk about what the life of the journey was then because we didn't have those graphics that actually told us what were the 4,200 microservices we needed to make sure that someone gets a ride, a driver gets connected to it and we do all that payment and location and mapping and receipts and a lot of other stuff.
But I was able to find this photo of one of the folks that has been running production engineering at Uber for a few years, Donald Sangre, where he actually talks a little bit about the way that they do prod, and a lot of that is very much on those core services. You have things that run Uber Eats and you have things to do your money side of things, the maps as well as marketplace product engineering. So a lot of work goes into getting all these microservices to work together.
But for the way that they actually implemented and embraced chaos, there was a lot of things that went on. Uber did a whole week of onboarding engineers into all of their teams, and part of that is that every two weeks they will run a one hour class on Chaos Engineering where one of the engineers or one of the folks that was on the SRE team would come and talk about why we do Chaos Engineering, why it was important, how we got started, but actually how to add your service and actually perform Chaos Engineering. This leader expanded to having some office hours that folks can kind of chime in, as well as a dedicated Slack channel for support.
But we didn't stop just there. We did a lot of other things. There would be a lot of drills that would go into place. The Black Fridays for Uber were Halloween and New Year's, two days that folks like being out and they don't want to be driving or be responsible for the cars, so they do a lot of ride sharing. In order to actually make sure that there was capacity planning set in place, the folks knew how to fail over from one data center to the other in case something was to go wrong, they did some Chaos Engineering, but they also used some other internal tools such as load testing, which that's a tool that's called Hailstorm.
And as I was leaving Uber, one of my friends was actually working on another tool by the name of Gatekeeper. The idea was that this tool was basically going to be the gatekeeper for anyone that wanted to deploy their service to production. It acted as a checklist and one of those checklist items was onboard your service to You Destroy and run an experiment or schedule it.
Now, I would like to talk a little bit more about the way that we're doing Chaos Engineering at my current place of employment, and that's Gremlin. As you've heard, we're a resiliency company and we have a Chaos Engineering platform and we just launched a new product today call Scenarios. And the way that we do Chaos Engineering is that we do various forms of it. We do feature testing where we all come together in a very game day specific format where we have some scenarios that are rolling out, but we also have certain use cases that every single folks are going to be using just to test that user experience.
We also have set game days that our engineering team as well as our principal [inaudible 00:16:15] has been really involved in running and I'll be sharing a little bit more about our game days soon. We also have the scheduled attacks that run on our staging and our production just to make sure that we're being resilient as we continue building out stuff. And another fun thing that we get to do with Chaos Engineering that is actually on the team that I sit on is that we try all these open source tools or just AWS services and things on other cloud providers and we kind of test the resiliency of them and it's really nice to be able to say, "Hey, I actually know what the pain points of these tools and databases are, so actually let me make sure that they're resilient before I actually bring them into my company."
But when we actually come together to do game days, there's a lot of prep work that goes into this. One of the biggest things is that we actually have a workbook that gets put in place and with that comes some roles that first got defined. We have the role of the chaos general. This person basically owns the experiments that are going to be run that game day. Then we have the role of the chaos commander. This person is actually the one in charge of implementing these Chaos Engineering experiments using the Gremlin tool, and we have the two other roles of chaos scribe and chaos observer, which basically gets scaled depending how many folks we have joining those in his game day. So the chaos scribe is actually observing and writing down all the notes as this game day happens, while the chaos observer is the one testing that user experience, looking at monitoring, alerting dashboards.
And one of the cool things with this is that we actually invite the entire company. We have folks in roles that are not necessarily engineering-specific that are actually helping us be chaos observers too, which is a really great way to actually bring your entire company in understanding that resiliency is a core part of your business. A week before game day, we send out a calendar invite to the entire company. We have a dedicated Slack room where all the chat about this game day is going to happen. We have a Zoom call because we're a remote-friendly company, and then we also share that game day workbook for folks who actually know how we're going to run through them.
And when game day happens, that's one hour that is blocked off in people's calendars, and the plan is all we hope to run through three experiments. And then after this game day wraps up, we make sure to share those results but actually also schedule some action items out of it. And as we do all that, we do this one thing that we like calling just an executive summary where we take those wins on the game day and actually share it with our founders and our executive leadership team.
And we run game days on staging and in production. This is a little bit of what we learned on staging game day that we ran and we saw a lot of stuff go really well, but we also saw that our monitoring took a little bit of time to actually detect a lot of these things, and there was some stuff that we actually wanted to go ahead and automate. So out of his game day, we learned a lot of things and we had some action items come out of it. And from those included updating a lot of the staging dashboards, but because we were actually making improvements on the dashboards on staging, those changes basically just also improved our production game days. So we just made our production environment more resilient by first day testing on staging, which is really awesome. This entire writeup of this game day can be found at Gremlin.com/Community as well as other game days that we've been running.
Paul: That's really awesome to hear about. There's a few key takeaways that I think we can sort of unwind from this as we close the talk. Incidents are unplanned investments. Chaos experiments can be planned investments, so maximize your return on those investments. Have a feedback loop in place. Make sure that you're documenting this stuff somewhere and turn incidents and chaos into stories in your organization so that you can actually entrench that learning.
Ana: The last of the key takeaways that we have is around onboarding and identifying those themes. Look at the key things that your business provides. What are those things that are really valuable that your company must have up and running all the time? And then you want to focus on actually using this as a learning opportunity and being able to help folks learn how to be on call, learn how to be incident commanders, share the results, and in a way, you also share that tribal knowledge for those folks that have been in your companies for a few years and they know those outages that happened five years ago. Well, game days actually let you come together and make sure that the new grads that you're getting into your team, those interns, are actually still learning from the fires you had to fight a few years ago.
Paul: Thank you very much.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more