Caroline Dickey: Think Big: Chaos Testing a Monolith - Chaos Conf 2019
The following is a transcript from Mailchimp Site Reliability Engineer, Caroline Dickey’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
I'm Caroline Dickey, I'm a site reliability engineer at Mailchimp, and I'm going to be talking about how to Chaos test a monolith, why you'd want to Chaos test a monolith, and then a little bit about Mailchimp's approach to Chaos Engineering. So let's just start off on the same page. What's a monolith? A monolith is a traditional way of designing a system. They're easier to develop and get started with, but the can be less reliable and more complex. The idea behind microservices is that you're breaking an application up into different functional components that are frequently connected using APIs. And so, one really common approach to Chaos testing is to test the dependencies between those different microservices.
When I started doing research for this talk, I noticed something really interesting about other conference presentations about monoliths. Maybe you can see it too. We're deconstructing, taming, breaking up, but we're also killing, wrecking, slain, grinding. I don't know if you know what that is. I don't. And then probably blowing up a monolith. All right. So, some people, they're out there, they don't like monoliths. They aren't trendy. And most people don't talk about how great they are. That's not what this talk is about either. But here's the thing. Mailchimp has a monolith. It's a big one, almost 23 million lines of code and most of them PHP. And that's not all. Although this will probably change in the future, the vast majority of our servers are bare metal co-located across two data centers.
But who cares? Right? Our customers depend on this monolith to communicate with their customers. They don't care about our tech stack. They just want the application to work. They've trusted us with their business and it's our job not to let them down. Chaos Engineering won't get rid of incidents. That's impossible, for all, it's going to have something, but it can reduce the frequency and severity. And that's true no matter what your architecture looks like.
So, even if you don't have a monolith, this talk is still for you. Because of the unique constraints put on us by our architecture, we've had to apply particular creativity and persistence to make Chaos Engineering reality. We weren't able to just go in and terminate instances Chaos Monkey style, and we knew that going in. But Chaos Engineering is about so much more than just Chaos monkey. And so, this talk is just as much about how we approach the problem as it is about a technical solution.
All right. So, we've heard this. When we started exploring Chaos Engineering at Mailchimp, this is something that we heard. We aren't ready for Chaos Engineering. And in our case, that may have been true. Our application wasn't inherently built with resiliency in mind. Like a game of Jenga, one bad push can take the whole thing down. So why bother? Well, in software engineering, there are always going to be things that we just can't control. There's bad pushes I was talking about. PostgreSQL deciding to vacuum the database at an inconvenient time. There's a story behind this, you can ask me about it later. And of course what are you going to do when the backhoe takes out your network? Can't do anything about that.
Luckily, there are plenty of things that we can influence and we can control. We can build redundancy into our database, load balancer and apps or our configurations. So there's never a single point of failure. We can control the error handling we build around our dependencies, and we can ensure that our monitoring and alerting will let us know quickly and accurately if anything does go wrong. Chaos Engineering is all about validating our assumptions about these things, the things that we know about, but also about making this bigger. All the things that if we did know about them, we could do something about them to make our application stronger.
So, I'd like to share some approaches for starting out with Chaos Engineering, starting with using an architecture diagram. This is a very basic architecture diagram of how MailChimp's web application works, and when we started using Chaos Engineering at MailChimp, this is the approach that the SRE team took. We didn't care about what the code was doing as much as we did about what the code was running on, sometimes called the infrastructure approach. And so, as I go through this presentation, I'm going to be sharing real scenarios that we've run at Mailchimp as well as some notes that we took.
So the first one I'd like to talk about is a load balancer failover. Here's the part of our architecture diagram that we were targeting. There's load balancers, and load balancers just distribute traffic across different computing resources like app servers. This is something that we take for granted. If one of our load balancers fails, we assume and we hope that the virtual IP will flip over to that secondary load balancer and traffic will continue as usual, our customers will never know anything went wrong. But we need to validate that's true, right? So we decided to have a Gameday around it. And so, here's the actual notes that we took, we discussed why the scenario was important, how we planned on monitoring the test and what we expected to see happen.
What was interesting is though even though we all had faith in our system, we believed that we would see that failover happen. We weren't quite on the same page exactly how it would happen, about exactly how it would happen. So, what did happen? Pretty much exactly what we expected. The virtual IP flipped to the secondary and we saw some alerting, and that's great. Even though we didn't have that aha moment that you sometimes hear about with Chaos Engineering experiments, we validated that a really critical part of our infrastructure works as expected.
All right. So our second scenario I'd like to share with you is messing with this part of our architecture diagram, these databases. And this case, making them read only. We wanted to understand what would happen to our application servers whenever they weren't able to write to the databases. We expected that anything that didn't need to write to databases would degrade gracefully. And then we expected to see some alerting also.
And this was the case for most of the application, but we identified some places where some legacy code didn't have proper error handling place and was expressing this beautiful sequel error to our users, which not only is a security concern, it's really confusing. It doesn't make any sense. So, of course we've gotten it fixed. I'd like to just highlight this. This is from an internal newsletter that I sent out after the Gameday, just to share with anyone who wasn't able to make it in person what we learned, the value we got out of it. And just to kind of sell it a little bit internally.
All right. So the next approach I'd like to talk about is validating changes. If you work anywhere like MailChimp, you can sometimes feel like change is the only thing that's a constant. And so, going in and identifying those changes that are about to be made and making sure that when they go live, they won't take your systems down is a really great scenario for a Gameday. So, what I'd like to share with you relates to a new file system that we were configuring on our application servers. The name of this file system is Ceph, you may have heard of it, and ceph-fuse is a client that melts the file system. So, let's kill ceph-fuse, right? What could go wrong?
As it turned out, that was actually a really good question. Everyone in the room was very confident that something bad would happen. But what that bad thing was, we didn't know, and this was one of our very first Gamedays. So we all felt very confident that some high quality alerting would occur. And as you may have guessed, there was no alerting. We discovered that one of our engineers had written a really cool Python script that did attempt to alert but it wasn't configured correctly and attempted to remount the file system but failed silently.
All right. So the next validating changes scenario I'd like to share relates to a new caching library that we had added. The name of this cache system is Memcached. It's a distributed caching system and it speeds up websites like Mailchimp by reducing the number of times database reference are made. It's distributed. Here's a fun architecture diagram of what that means for us. We have Memcached instances on all of our app servers which talk to each other in order to access that cache data.
So we'd made a fix to the new client that we'd installed because we were seeing some timeouts and we wanted to validate that our fix had actually worked. So we had a Gameday, and we learned a lot from this Gameday. We learned that our fix was almost completely effective with only one place that we hadn't made the fix that we had added. So, we were able to go back, make that fix and make things better for the future. We identified concrete values for how much latency our app and API servers could handle, and based on this knowledge, we were able to reduce that Memcached timeout that we had set from four seconds to 250 milliseconds so that if anything does go wrong with the connection between the app and other Memcached instances, it won't just take our system down, we'll be able to handle that instance gracefully and our customers won't see any heavy impact.
All right. So, a final approach I'd like to talk about is testing dependencies. So, internal dependencies is kind of cheating because this talk is supposed to be about a monolith, right? But hear me out. Just because you have a monolithic application doesn't mean that your application isn't talking to other applications internally still. And failing to identify those dependencies can have really negative consequences, like an incident. This incident happened when a application that had been historically resilient and wasn't well known to a lot of engineers became unavailable. The name of this system is requestmapper, and all it really does is it maps URLs from the pretty form we want our customers to see into the form that we store internally. And of course the Mailchimp monolith makes calls out into your requestmapper.
So, in our testing we were able to recreate the incident and yep, removing that connection between MailChimp and requestsmapper did in fact result in this load balancer error page getting displayed. So, we devoted the rest of this Gameday to digging in and trying to figure out what exactly was happening. So, once we were able to replicate the behavior, yeah, again, did some profiling investigation and found this. What's this? A 503 Service Unavailable response code was getting returned. So what was happening is that when the app server wasn't able to talk to requestmapper, it was returning this 503 response code back to the load balancer to tell the load balancer that the service, and in this case that would be the app server, wasn't available. So the load balancer showed our users the service was unavailable. And so, changing that 503 to a 500 internal server error actually got rid of the issue entirely.
And again, internally marketing Chaos Engineering is so important. And so, this is from an internal blog post that I wrote talking about the incident itself and then what we learned from our Chaos Gameday afterwards. So definitely I encourage you to try this out if it's something that you're able to do, maybe even an external blog post. All good. All right. So, everything has an API these days. We have an API, you have an API and you're probably using someone else's API. So, this is a great scenario to run with some developers in the room.
Mailchimp uses a lot of integrations. This isn't all of them, but it's some. And so, in our API Gameday that we did a couple of months ago, we focused on Contentful, Litmus, Facebook and Twitter. And for the sake of time, I'm just going to talk about our Contentful experiment. Contentful is a content management system that we use for loading static data, like our marketing site. When we blocked network traffic between MailChimp and Contentful, we expected to see anything in our popular paths cached and then hopefully some alerting. And we did see some caching and some alerting, but we also learned that our front time wasn't actually looking at the content of the message. It was only looking at the response code.
And so in this case, this is actually a 200 success response code, but it clearly isn't actually successful because it has that error message. So, making that small change where a front end looks to see if an error message is getting returned is a great way so that our customers, if this ever happens again, our customers don't have to see this confusing message.
All right. So, I know I said three approaches, but how you test your monolith or any application is just as important as what you're testing. So for us, carefully is the operative word. So here's a question. Should you Chaos test in production? Yes, no, maybe. Well, anyway, I'm not going to settle this debate today. There are obviously a lot of pros and cons to testing in a production environment, but I'll share a MailChimp's approach. We default to testing in a staging environment that closely resembles production.
We over-communicate about our Gamedays using Slack and our engineering calendar and we make sure that anyone who needs to know about the Gameday knows before we do it. If anyone in the room isn't comfortable moving from stage to production, we'll make a note of it and we won't. And finally, if we do run a Gameday in production, we are careful to start with the smallest amount of failure that can teach us something. So, we'll start with something that called the blast radius and then we'll increment the magnitude of the attack slowly from there.
All right. Onto my final section, some alternative use cases for Chaos for Gameday. So the Gameday format other than the Chaos Engineering Gameday format that I've been talking about so far. So, training is one. Dave mentioned this a little bit earlier, it's kind of in the same train of thought where you're trying to inject a little bit of Chaos into the way that you work as a team. And so, this example is for a networking Gameday that we did with our networking team. But you could do this with group of engineers like your industrial team, group of engineers about to go on call, up to you.
So, in this Gameday, we had a senior engineer on our networking team go in and break things manually and then other engineers on the networking team tried to figure out what he had done and how they would fix it. And we learned a lot from this Gameday. We saw that we needed better documentation. So that's almost a good takeaway from Gamedays, you need better documentation. And for us it was between our networking and our systems teams. But it was important because the communication started to break down. We observed the stress that our engineers felt even though this was low stakes, this was a practice incident. So, coming up with a way to develop that muscle memory for panic as a part of the response process is so important.
And finally, I think Dave mentioned this also, giving the wrong information can have some really negative consequences, especially when people trust you to give the right information. So, figuring out ways to make sure engineers are both informed and confident and comfortable asking for help if they don't know the answer is again so important.
All right. So, post-mortem counterpart, what does this mean? Well, I mean, incidents happen, they happen to us, they happen to you too. And at Mailchimp, we like to run blameless post-mortems after incidents to explore the human factors of what contributed to that incident. We try to avoid root cause analysis whenever we can because that can be limiting. And sometimes, even after our post-mortems and after the incident's resolved, there are still things that just, there's some technical causes or maybe some questions left. So, running these type of Gamedays is a really great weight to make sure that you know what happened and make sure it won't happen again.
So a post-mortem Gameday or just recreating incidents is my last example of the day. But this may look familiar because it's actually a repeat of the requestmapper scenario that I shared earlier where we had an internal dependency and it went down and so did our app. So, this is another screenshot from the internal blog post that I sent out where we used the term perfect storm to describe this incident, because it was. This application had never gone down before.
It had been resilient and during a routine maintenance we did a bunch of other applications too, requestmapper, just for whatever reason didn't come back up. It was misconfigured. So, because it was misconfigured, then we were able to find that 503 response code issue that we would never have known about otherwise. Again, Chaos Engineering is a great approach that allows us to recreate and diagnose the incident and prevent it from ever happening again. And then also make the application stronger for the future.
The final use case I'd like to talk about is application performance. This is a little bit different. So, here's a diagram of how we view these different types of Gamedays. We're all really familiar with the Chaos Engineering Gameday you're going and you're breaking things on purpose. And then some of these other formats where the training and the post-mortem are a little bit more focused on something very specific. But what's application performance? We're calling this a time-boxed opportunity to deep dive into something very specific. Maybe you're pulling from engineering reports or support tickets. You're trying to get a group of people in the room. This is planned exactly like any other Gameday and do high impact work quickly.
That was... sorry. The background behind the next picture is that although this doesn't exactly fit into the breaking things on purpose that we talk about with Chaos Engineering, it's important to remember that things are sometimes already broken and not on purpose. For example, here is a graph that shows page load time before and after one of these Gamedays. I think you can probably guess which day we had the Gameday on.
I'd like to wrap this presentation up by reviewing the takeaways I hope to leave you with and also showing off this really cool shirt that our design team made for one of our internal Gamedays. First of all, Chaos Engineering can help make any application more resilient regardless of architecture. You can and should Chaos test a monolith. If you don't know where to start, try looking at an architecture diagram or identifying some changes about to be rolled out. Obviously the best time to find a vulnerability is always going to be before the incident happens, but if that's not possible, then recreating it after the incident is a really great way to make sure that you never see that incident again. And I think this goes without saying, Chaos Engineering is an effective tool for sharing knowledge and building empathy. Thank you.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more