Jose Esquivel: A Roadmap Towards Chaos Engineering - Chaos Conf 2019
The following is a transcript from Backcountry Engineer Manager, Jose Esquivel’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
My name is Jose Esquivel, Engineer Manager for Backcountry. I want to talk about a road map towards Chaos Engineering for you guys. The agenda for today is first we're going to look at a road map that got us to Chaos Engineering. The important part here is that there are some things you need to do first before you make that jump, so I want to make sure I share my experience with you. The second one is eight patterns for stability, because when you run the test and something breaks, that you want to know how to fix it. Right? How do you achieve stability for your system? And finally, four ways to achieve observability, because you want to know what's going on.
Okay, so this is the road map that we've taken at Backcountry. If you look at things vertically, you'll see products. If you look at things horizontally, you will see capabilities. The first capability is observability. We have tools that we have implemented and then deprecated. And we also have many tools for what we want to achieve. We're not afraid of tool proliferation, because what we care about is to answer the questions that we have of our systems. Second, alerting, in this section, we do use PagerDuty. Very simple, they solved the problem pretty well and have some cool analytics that you can see how good your stability has been throughout the days and years.
Number three is a process. Incident management is about how do you deal with a problem when it's going on. How do you know the impact? How do you know the severity of it? And RCAs or post mortems are about ... RCA is root cost analysis, so this is about understanding the impact that it had on you, what was the time to discover, what was the time to route, to diagnose, and to remediate, which is the total time of impact. And then test harness here, we have different tools. As you can see, this is very important because this even start with unit integration testing, and it evolved to low testing penetration and even right now, Gremlin. I'm going to focus right here, right before we started doing Gremlin, what are those things that you will find out are a must have. And I'm going to do that through our experience.
Before running any chaos experimentation, you need to understand stability patterns and observability, because that's the way you're going to be able to understand what's going on and then fix it if you find a problem. Eight stability patterns, principle is everything will fail. We know that. And the first pattern are timeouts and retries, as a pair. So what's the problem if you don't have a well-tuned timeout? Well, you're going to deplete your HTTP pool, your database pool, and you don't want that. Be careful because regularly, the default timeouts in the frameworks are incredibly long, like two minutes I've seen, and that's an eternity in computing science. So make sure you tune them based on your needs. For databases, be careful because you have two variables there. Number one, the connection, the time that it takes to connect to the database, and then the time that it takes the query to be executed. So tune the properly again.
Pay close attention to achieving comprehensive retries, because you can basically make a problem worse. I'm going to tell you a story. We had a database performance issue. Our APIs were slow. And guess what? They were timing out and retrying and retrying and retrying. We made things worse. It took longer to recover, and it was because our timeouts were not comprehensive, and there's nothing wrong on giving up either.
The next one is the circuit breaker. Here is, once again, about being a good fellow client. This has three states. The first one is the circuit breaker is closed. You have a normal operation where you have dependency; you can do calls. But then you can set up a threshold that you say, "When this happens, I don't feel comfortable," and therefore, you can open the circuit breaker and go into graceful degradation. For us, an example I have is that we have a tax API that interfaces with our tax provider. Sometimes they are slow. They're down. So what do we do? We do our flat tax. Flat tax is the minimum tax possible that we can apply for an order. So we get some taxes for the customers, the customers can continue to check out, and we lose less money. Right? So once the circuit breaker is open, you can do the graceful degradation, and then you can set up a delay under which the circuit breaker will be half open. This means that it is going to try to see if the dependency is back on, and then if it's back on, it will go to close, and everything just recovered. If not, it will continue to be open.
This, you can do it. You can code it. I recommend Hystrix. They just solve it pretty well, so check it out, from Netflix. It's pretty well implemented. It's super easy to use. Last thing, be sure to include this in your monitoring.
Bulkhead pattern, when we see the microservices architecture, we immediately think about a bulkhead pattern. You have your [inaudible 00:06:00] API that can be failing, but you have your order API continuing to take orders. But for the bulkhead pattern, if you go into the API, the individual API, what you could do is create connection pools individually for every end point. What does that mean? That if an end point requests a connection to the database or a query, and those tables are slow, because it's the connection pool for the end point, then the end point is the only one affected. The rest can continue to work. They have their own connection pool, and even you can imagine that with individual connection pools, you can do individual configuration based on timeouts for the query or the connection.
Okay? Steady state, like a cartoon, "I don't know what to do, the more I run, the more weight I seem to put on." This is a very common scenario with logs. If you are storing your logs in your BM, rotate them, but more importantly, make sure you do not let the logs fill up the disk. That happened to us, and it's just a matter of a configuration. Be sensitive about not filling up your API. Also, the cache another example, you have your [inaudible 00:07:21] cache, and all that, but you let it grow indefinitely, and it ends up consuming your memory, to be sure to account for this one.
Okay, fail fast, if you imagine a client with a bulkhead pattern, circuit breaker, timeout, retry, a resilient client, then the other side of the equation is the server that can be doing fail fast. Because if you think about it, the only reason you have a timeout is because a resource is just taking too long. But what if the resource or the dependency just fail fast? You can use your SLAs. You can use the duration that you define so that the server says, "Oh no, I'm giving up, and I'm letting the client handle the error."
Another aspect of failing fast, which is very important, is early validation. If you get a request, and it already has errors or is already invalid, data integrity, malform, et cetera, why would you bother processing it, right? I know it adds an additional layer to your APIs, but it's worth it. It's also a security concern.
Handshaking. Handshaking is a way to know if an API can initiate a conversation with another dependency. It's about the API saying, "Hey, I'm about to start talking to you. Are you healthy? Can I initiate a load of a certain size?" This is an example of Spring Boot. This is a health end point, and it's not only telling you if the service is up or down, but it also tells you the dependencies, if they're happy, if I good disk space, if I have a proper rapid connection to the server and a connection to the Postgres service, because if you are down, you won't understand why. Right? Do I have problems internally?
Uncoupling via Middleware, this is a very interesting one because this is kind of a tool in your tool belt. Just like you have REST calls, SOAP calls, these have a space where, for example, if you are issuing a request, and you don't need a response back, maybe you can just fire and forget. You don't really need to have a REST call that can overwhelm the other side. Or, even if you need a response, you can send the message, wait for your response somewhere else, while you can do other stuff while waiting for the server to respond to you. The key here is that with this approach, the server side will never be overwhelmed. Things might get longer to process, but there is not such a thing as taking down a server because you push millions of messages for it to consume. It's just going to take longer.
Test harness, more than a pattern, I look at it as a pyramid here. If you see the first three parts of the pyramid, we're talking about unit integration and UI tests, so these are things that most of us have been doing for a while. This is about understanding that the system is functioning as it's supposed to function. But, when you get at the top, you get into more non-functional features like performance, security, and at the top, chaos tests. What's the point here? If you haven't invest a lot of time here, or if you haven't solved this two or this three, I recommend you do that before you jump into Chaos Engineering. Okay.
We're going to look at four ways to achieve observability, because when you're running a test, or when somethings are in production, you want to know what tools you can have to answer all the questions you have about your system, and yes, by all I mean, all the questions. So these are the four things that I see as observability, logging, metrics and reports, alerting, and tracing. Let's look at them.
Okay, so logging, you need to be ... Before we go into what's logging, let's first say what are the questions you have about your system. Let's start there before we choose tools, because it doesn't matter if you have one tool, or many tools, if you are answering the questions you have about your system, it's okay to have many tools. Personally, or in Backcountry, we do have many tools to solve, to answer these questions. We're not afraid, once again, of tool proliferation.
So logging can happen in two ways, because somebody intentionally wrote a log line, and also if you have an APM, they can capture a lot of metrics for you. APMs are great. Use them. They will be super useful, but in my opinion, the most valuable logging is the one that comes with intention. You can see here in the first example, our tax department said, "Okay, I know we do flat taxes sometimes. How about you tell me the discrepancies that we're creating? All right?" "I'll give you a discrepancy of $42 were found on this order." Okay? So, I intentionally wrote for it. The other one is more for engineers. As you can see, we wrote an error for that exception. This is actually a invalid request for a reference in a SKU. I didn't output the whole message, but we do output the whole stack trace. I think this is key because do not be afraid of, "Oh, this is too much data." No, I would put the whole stack trace. You won't regret it.
Okay, tracing is about causality across systems. Like we've been hearing, one end point can make slow other end points, and they can ripple into other APIs. How do we know who's the problematic one, right? Well, tracing comes to give you that causality. By doing trace IDs ... Well, we can use trace IDs and object IDs. Trace IDs are hard, because this is one of the hardest thing to code, because you're going to need to pass on the context on every API. You need to come up with a convention and say, "Hey, the trace ID is going to this field when it's an HTTP request, or is going in this field when it's a message." And all of the teams across your organization need to understand where that's field so they can capture it, pass it onto the next request. Yes, that's why it's hard. But you can also just output object IDs. If you have the opportunity to log the order, that includes the order ID, you can find where that order object has been in different APIs. Or the email of your customer, output it. You want a way to uniquely identify that request.
Let me show you what are the benefits of tracing. This is a custom trace that we built. Many tools out there. I recommend you also look at what things are out there, because building stuff requires maintenance and maybe sometimes we don't have a lot of time for that. But, as you can see here, in this event log, I can clearly see the trace IDs or event IDs, and I can see the sources. This means this request went through the order API at this time. It went back to the order API and then went to the shipment API. So that gives me an idea of this payload. As you can see, the payload, I can see it in full, the times, and even you can see and move into different ... This is two days later, so I know what happened in a single transaction and throughout the lifetime of an order.
Metrics and reports, this is now when you have your logging, now you can aggregate that data and give you information, historical information. In this case, for both metrics and reports, you can have people go and look at your dashboards, or you can push the results to people through email. The first three examples that we looked at, the log lines, this is actually how they were converted into metrics. As you can see, with just those two lines, the second one ... I can see the second one created the first dashboard, and I can immediately see the status code ratio by minute of my API. Look at this. There's a lot of 400s, so I might say, "Maybe my clients are not calling me properly. I'm throwing a 400 because I'm validating their request." So I might have a conversation with them.
The second one, this is for the tax department. As you can see, now they understand what are the discrepancies by each of our sites as far as tax discrepancies, so they can now have information to account for this money or even catch problems. For example, if I can see right here Motorsport has a really high number of orders with discrepancies, so if they are in different platforms, if I go and start here, I'll get more value. This is the nice thing about exposing this kind of information. So, logging goes beyond engineering. It's about the business being involved. And once you tell the business you can give this to them, trust me, they will not stop asking you for this kind of things.
The last one is alerting. For alerting, I want to emphasize on a few things. Be sure you have a difference between warnings and criticals. A warning, you can send an email. You can put it up on the dashboard. But a critical event must wake up somebody. That is very important. You do not send an email for a critical. [inaudible 00:17:35] send the email and call that person. And then if you do that right, which is the basic, I'm going to show you six practices to make great alerting. The first one, which I just mentioned, stop using emails for criticals. Write runbooks. I just saw a product downstairs that it had the ability that when you look at a dashboard, at a critical dashboard, you have a set of links that you can customize. What if you have a PagerDuty, or you get paged, and immediately you have a set of links talking about the API, what this would do, what kind of problems have been seeing. Maybe you had the post mortems linked to it. That is going to be valuable for the person resolving the problem.
Delete and tune alerts. This means you need to continue ... This is an iteration process. You never write the perfect alerting once. You will find yourself deleting stuff that is not useful. Things change so you will have to tune them. Use maintenance periods. If you're deploying something that is going to seem like an outage, use a maintenance period. You don't want to page people just because of a deployment. Attempt self healing, but be careful. Let me tell you a story. We have, for those of you familiar with a Node.js and Forever, Forever is really useful because it saves you from an outage. But, if you're not monitoring Forever, what happened to us is that our application basically never worked. It was going down every five minutes. Forever came to the rescue, and we didn't see that, until obviously our busy day, most busiest day, Forever just gave up, and we had to find it the bad way.
Overcome arbitrary static thresholds, this is very interesting, because at the beginning, for example, with tools like Nagios, it's very easy to say, "Alert me if the disk usage is at 80%." Okay, but wouldn't you want to understand or get paged when the disk usage went from 10% to 70% in one minute? And because you have just a static threshold, you just lost that. I recommend you look at statistical options here like percentiles, means, outliers. Honestly, the best way is also when you have an alerting system that has a time series database, because again, you want to know how, through time, that application has been behaving and also alert based on maybe when we exceed a standard deviation many times.
Okay, let's do a summary. We looked at the maturity model in the form of an ongoing road map that can get us to Chaos Engineering. We looked at eight stability patterns that can be applied to software, hardware, network, so when you run your Chaos Engineering, find a problem, you have this tool in your tool belt. And finally, four ways to achieve observability and be able to answer all the questions that we have about our systems. Thank you very much. Like we say in Costa Rica, Pura Vida!
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.