September 26, 2019

Robert Ross & Tammy Butow: Incident Repro & Playbook Validation with Chaos Engineering - Chaos Conf

Robert Ross & Tammy Butow: Incident Repro & Playbook Validation with Chaos Engineering - Chaos Conf

The following is a transcript from FireHydrant CEO, Robert Ross and Gremlin Principal SRE, Tammy Butow’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.

Twitter: @bobbytables and @tammybutow

Tammy Butow: Hi, everyone. We're really excited to be here today to share with you all Incident Reproduction and Playbook Validation with Chaos Engineering.

Bobby Tables: Learning From History (Getting an A by Going Back in Time).

Tammy Butow: Who are we? I'm Tammy Butow. I'm a principal SRE at Gremlin. I get to do a lot of cool stuff like running GameDays, which is actually using Gremlin on Gremlin with our entire engineering team, which Anna spoke about a little bit earlier. That's super cool. I get to do a lot of interesting stuff. Before that, I worked at Dropbox as well as an SRE manager on databases and block storage, and I worked at DigitalOcean before that with Bobby.

Bobby Tables: My name is Bobby. People like to call me Bobby Tables, if you're familiar with the xkcd comic. My name is also Robert Ross, so if you want to call me Bob Ross, both are acceptable. I am a CEO at FireHydrant, an incident response tool, FireHydrant.io. Before that, I worked at Namely. I was an on-call engineer, causing chaos accidentally a lot of the time. Before that, I worked at DigitalOcean where I had the pleasure of working with Tammy.

Tammy Butow: Who is better at going back in time and learning than these two? Shout out their names if you know who they are.

Audience: [crosstalk 00:01:22]

Tammy Butow: Yeah, that's awesome, everyone. If you don't know who they are, this is Bill and Ted, from Bill and Ted's Excellent Adventure. They're two [inaudible 00:01:30] high school dropouts who go back in time to learn from various experts in the past to be able to bring that information to the future and pass an exam.

Bobby Tables: Bob And Tam's Chaotic Adventure.

Tammy Butow: All right. Welcome to Feb 28, 2017, where our adventure begins.

Bobby Tables: T'was a peaceful morning, similar to today, somewhere near Seattle, probably. Some engineer somewhere was typing in this command: s3-what-could-happen.sh some-arg 1000. Dramatization added for effect. Then the Internet started experiencing some problems and some very curious things afoot at the Circle K. Trello, Slack, Cora, to name a few, started experiencing issues. Media outlets started picking it up, and this started happening.

Tammy Butow: While companies around the world were updating their status page, this tweet was posted by AWS. The way that the AWS status page was actually built meant that it had a dependency on S3, in specifically US-EAST-1, which was having an issue, so the status page was not possible to get updated. "The dashboard not changing color is related to the S3 issue," is what was posted. "See the banner instead at the top of the dashboard for updates."

Tweets like this started to appear. This is a small example of the impact of US-EAST-1 being down. "Please. All my marketing goes through Amazon Web Services. We're losing more than $1,500 each hour." Then engineers started to recommend that AWS does some chaos engineering, which is cool. We're here at Chaos Conf a few years later.

I thought it'd be interesting to actually look through the post-mortem report, because they actually have a lot of information here. This was collected by Rich Burroughs on our team. We have 12:26 PM PST, "The index subsystem had activated enough capacity to begin servicing S3 GET, LIST and DELETE requests." Then we have a big jump. Almost an hour later, 1:18 PM PST, "Index subsystem has fully recovered." Then even more time up to 1:54 PM, "Placement subsystem has fully recovered."

Then some time after 1:54, not really sure, but I do know from when I was working on this incident ... I was actually the incident manager on call at Dropbox during the time, so I was responsible for anything that was impacted by US-EAST-1 being down, which actually ended up being thumbnails on dropbox.com. That was a hard one to figure out, but we fixed it. It was actually a five hour outage in total, so a pretty big one.

Then many issues happened. Many systems were impacted. Let's have a look at that. We can actually get that from the post-mortem report. S3 APIs were unavailable in US-EAST-1. The AWS Service Health Dashboard was impacted. The S3 console was impacted. You couldn't actually go and load it to even see what was happening. EC2 instances were not able to be created or launched, because there was a dependency on S3. EBS volumes were also impacted, and Lambda was impacted. This is just a few of the things that we impacted just by US-EAST-1 being down.

What was the impact for us at Gremlin at the time? Well, actually the way that we'd built Gremlin, our entire marketing site and our application were wholly inaccessible since they were built as static sites hosted out of S3 and fronted by CloudFront, and they were in US-EAST-1. So no one was able to use the Gremlin product at all via the web application for the entire outage. That's pretty bad.

Bobby Tables: Namely was also bit in this. All the profile images were actually served through a Ruby on Rails application to do some simple resizing. But when S3 went down, the Rails application was unable to retrieve the images, and started timing out, and actually created a request backlog that eventually tipped over, and the entire thing went down.

#DontGetBittenTwice, though. Right now, what we're going to do is we're going to go back to the future and use what we learned, Tam.

Tammy Butow: Yep. It's going to be exciting. Who here remembers the S3 outage? Put your hand up if you remember it. All right, keep your hand up if you were impacted by the S3 outage. All right, so quite a lot of hands. Keep your hand up if you were paged because of the S3 outage and it broke something. Okay. A few people, that's still quite a lot of people. All right, so yes, let's do it.

Bobby Tables: Let's do it.

Tammy Butow: We really need to get an A.

Bobby Tables: Sounds like it's incident reproduction time, Tam.

Tammy Butow: All right. What we're actually going to do is we're going to reproduce the S3 outage of 2017 using Gremlin. It's possible for us to do this, which is pretty cool. Let's take what we learned from 2017, and now go to today, 2019, September 26.

Today, everyone here and everyone on the livestreams, live stream works at [ChaosConfMisfits.com 00:07:22], you'll see everything's okay with our application. Images, assets are loading. We've got happy customers. You can easily browse what kind of misfit you might like to adopt. You know what you're getting yourself into if you bring one home. Everything it looks like it's going pretty good.

Just an FYI too, this is an AWS sample app called the microservices-demo application and it was built by AWS, so you can actually grab it on GitHub. It was created in Python, Go and Java. If you go to github.com/aws-samples, you'll find it. But yeah, we'll share the link later, so you can play around with this at home too later and at work.

This is the recommended architecture that Amazon explains that you should use when you start to use their sample application. Let's have a look and see if we can identify any issues based on what we learned from going back in time as Bob and Tam. Okay, so let's see. If we had the US-EAST-1 outage again, the sample site tells us that we should create an S3 bucket in US-EAST-1, and just create one bucket. It also tells us that we should store our index.html file, our CSS file, and all of our images in our one S3 bucket in US-EAST-1.

Bobby Tables: It's probably fine.

Tammy Butow: It's probably going to be fine. All right. This is before the US-EAST-1 outage. What happens next? It's not going to work at all. If that outage happened again, we would just have nothing. Nothing would load. Nothing would be available, no index page, no CSS page, no images, not very good graceful degradation. We also wouldn't be able to get to the AWS console, like I mentioned before. It would look something like this. That doesn't seem very good. We can do better.

Let's get reliable. Alrighty. This is where we started with the example of how we should build our site. We can do better though. Let's think about adding some elastic load balancing. We're also going to use EC2 for our index page and our CSS page with some auto scaling. We're also going to use Gremlin, and then we'll be having an S3 bucket in US-EAST-1, because we want to gradually iterate and improve. We're going to try it out. That bucket will store our images. Let's see how we go.

Alrighty. Here's our site before. We can scroll through and see what we have. What we're going to do is we're going to use Gremlin scenarios. There's actually a recommended scenario called Unavailable Dependency. It comes with scenarios out of the box, so you can just click on scenarios and try this one out.

Because of the way our applications are built these days, microservice applications have so many dependencies, and usually actually everything needs to work for you to have a good experience, unless you think about graceful degradation and really plan for it. You can have a lot of issues.

When we run this scenario, where we're going to say make US-EAST-1 unavailable, what we want to do is think about what our hypothesis is. This is really important. We're going to be saying, "Well, when our dependency's unavailable, maybe we think that the images won't load, but index and CSS page should load.

Here we select the host. We're going to do a black hole attack. We're going to select Network Black Hole 300 seconds. I'm choosing here S3 US-EAST-1. That's in the provider section. That enables me to actually black hole everything. I just click Run Scenario. That's it.

All right, so now we get to this page where it says," The scenario is set up." On this page here, you can see a nice calendar that gives you all your data from the past of scenarios you've run. As we scroll down later, we'll be able to fill in our results, our notes and observations, if we got what we expected. You can see here all the information. This is handy for when somebody else comes along and wants to know what scenarios you've been running.

Let's see what happens when it's now running. Alrighty. It looks a little bit better than before. Our index page loads, our CSS page loads. But we don't have any images, so there are no images available at all. So we don't really know what misfit we're adopting. We don't know what we're getting ourselves into if we actually selected one of them. That's not too good. We could definitely do better than this as well.

What we want to do, since we're doing this iterative approach, is scroll down and actually store our details in this section in the results. We're going to say our images didn't load. It's what we expected to happen. We click Expected and Incident Detected, and we can just type it in here. That's how we do our scenario. Alrighty.

What else can we do to make it even more reliable after this? Well, we could think about adding an additional S3 bucket. S3 enables you to turn on S3 replication, and then you can have a bucket that you can fail over to in a different region. We could have a bucket in, say, US-WEST-1. It's also important to think about what other areas of your architecture are you using S3 buckets. In this example, we're using it for enriched click data too, so we better not forget about that. But there are so many things you can do to actually improve your reliability. It doesn't take a lot of work, so it's definitely worth it.

We could do S3 bucket replication, like I mentioned, S3 bucket failover. We could think about multi-cloud storage. We could also look at using a CDN. We can think about multi-cloud CDN. We can look into origin shield, which are also really cool, enable you to easily use multi-CDNs. We could also do something where we handle image failure on the front end using React, which is a really cool thing to chat to your front-end team about.

Bobby Tables: What about playbooks though? One thing about breaking things on purpose and chaos engineering is that it's not just for software. You can use the same principles of chaos engineering to validate things in your processes. You can use it to identify gaps in knowledge. You can use it to also onboard people.

With playbooks, what they are is they write down the steps necessary to complete certain tasks. For example, how many people in your organization know how to update your status page during an incident? There's a couple of ways that you can do this. One of the ways that you can do and check if this process works, or if you have a lack of a process, is to do a surprise meeting, which was what I did to my coworker. That's my coworker.

Tammy Butow: Click once more. Yeah. Oh, there he is.

Bobby Tables: I have [Dylan 00:14:31] here. I scheduled a surprise meeting for he and I. Dylan, what I would like you to do is pretend there is a SEV1 incident for FireHydrant, and update our status page. Dylan: Okay, great. I'm going to go to [statuspage.io 00:14:53]. I'm not logged in, because only Bobby has credentials for status pages. Now I'm stuck and don't know what to do.

Tammy Butow: That's not very good.

Dylan: Help. What would you ... I'm going to Slack you.

Bobby Tables: From MySlack.

Dylan: Yup.

Tammy Butow: This is really impacting [crosstalk 00:15:17].

Dylan: I'm going to Slack myself and say, "I please need the status page credentials."

Bobby Tables: I please need.

Dylan: Hey, it's a SEV1. Grammar's out the door. Help me now.

Bobby Tables: Okay, so you need an account.

Dylan: Yeah.

Bobby Tables: Okay. Let's [crosstalk 00:15:37]

Pretty quickly, we identified that if I was unavailable, and we had an incident, that we would not be able to adequately tell our customers about this incident, because nobody else knew how to update our status page.

Tools can actually help guide us through these problems, and help by creating playbooks and storing inside something like FireHydrant. One thing I'll say about playbooks is that they should guide, not prescribe. They should be something that helps you, but doesn't tell you what to do.

What we've built is we've built the ability to actually store what we call runbooks in FireHydrant. In this, we're able to define our standard process for a SEV2 and attach it to our incident. Inside of our tool, we can say, "Post a status page." If you'd like, you can include something that says, "Keep calm and have agency," to remind your teams that you have agency during an incident. Playbooks can help remind your engineers that they have the ability to do what needs to be done to resolve incidents too.

Lessons Learned by Bob and Tam, our top three.

Tammy Butow: Alrighty. First one. It's really good to reproduce your incidents using chaos engineering. This helps you ensure that they don't happen again, which is actually possible. One of the best questions I've ever heard in a post-mortem meeting was, "How do we make sure that this never happens again?"

The example of an incident that can continuously happen again would be, for example, a batch job always hitting the database at Tuesday night at 10:00 PM. It's always the same one over and over and over, but it never gets fixed. How do we make sure it never happens again?

Bobby Tables: Playbooks need to be validated. You can't write something, put it in confluence, and expect it to work. You have to practice the playbook, e.g. how to update your status page.

Tammy Butow: The next thing you want to do is test your system and your team's processes. If you don't have a process, then you probably have a hidden process. Like we saw with the status page example, it's hard to know what to do if nobody ever tells you.

Bobby Tables: If the word process scares your team, use the words team traditions instead, something that Tammy was telling me that she implemented implemented at Dropbox.

Tammy Butow: Yeah, I actually did do this, because whenever I would say, "Oh, we need to probably fix this process," everyone just run out of the building. Like, "What? Okay, guys. Okay, everybody come back in. Let's talk about this."

But I realized, when you think about it as traditions, that makes a lot of sense. Think about your family and favorite traditions you might have, say, in America for Thanksgiving. There are certain foods you love to eat. The same thing happens at work. There are certain traditions that we do that actually we enjoy, and it helps us work better together.

Bobby Tables: With that, thank you.

Tammy Butow: Thanks so much.

See our recap of the entire Chaos Conf 2019 event.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Request a Demo