Charity Majors: "Closing the Loop on Chaos with Observability"

The following is a transcript of Charity Major's talk on Observability and Chaos Engineering at Chaos Conf 2018. Charity is the founder and CEO of Honeycomb.io.

Okay. Cool. I think we can all agree that chaos is a fancy marketing term for running tests somewhat later in the software development cycle than we're used to. Software engineers, engineers really hate being marketed to, except when it's done well, and they don't notice.

In my day, we called this testing in production. I don't know if you've seen my shirt, I test in prod. Testing in production has gotten a bad rap, and I blame this dude. It's a great meme, but it has this implied false dichotomy that you can only do one or the other. What if I told you, you could both be a responsible engineer who writes tests, and also responsibly test in prod?

I have a couple more memes because that really seemed to be the thing you should do with this. But honestly, our entire idea of what the software development lifecycle looks like seems to me like it's overdue for an upgrade. Because to start with, deploying code is not a switch. You don't switch it on or off. It's more like a process of baking. When you deploy you begin the very long and sometimes indefinitely long process of gaining confidence in your code under a lot of different scenarios.

This world of a switch is one that exists. I know in my managers' minds, it's a beautiful world. But actually the world looks a lot more like there's a lot of deploys going on. What do they do? Well some of them break, some of them roll back, some of them don't, there's some feature flags, there's some cherry picking, there's some shit that breaks, there's staged roll outs, and rolling out. What are we even doing here? These are various ... What are these? I don't even know.

It's clear that it's very complicated. It's not exactly just your on off switch. Observability is something that should span the entire thing, from start to finish. It's not a thing that you can just include once like, "Oh, we're done deploying this. Let's add some monitoring checks and we'll be good." Then, you understand why some people look at this and go, "And you want to add chaos to this?" Completely understand that impulse.

But why now? Why are things changing now? What driving this? I am not a computer scientist, I'm a music major. I drew a graph for you, but it looks like this. Complexity is going up. I don't know. But let's look at some architectures. There is the humble LAMP stack which powers still most of the world's business. Believe me, if you can solve your problems with a LAMP stack, please do so. Please.

This is Parse's infrastructure a couple years ago. That blob in the middle is a few hundred MongoDB replica sets running developers queries from all over the world. Never do this. Somewhere up there is a bunch of containers. You could write your own JavaScript, and just upload it, and we had to just make it work and not conflict with anyone else's shitty queries. Never mind. Never do this.

Here's an actually electrical grid. Increasingly, this is what should be in your mind when you're buildings systems or software. There is some problems that you're only gonna find when you're hyperlocal. A tree fell over on Main Street in Dubuque, Iowa. Couldn't have predicted that. My capture replay script did not tell me this was going to happen today, but here it is.

Some problems are hyperlocal, some you can only see when you zoom way out, like if every bolt that was manufactured in 1982 is rusting five times as fast. There are all these different kinds of systemic problems at different levels. Capturing the right level of detail to be able to answer questions is challenging. But this is a parallel for this shift from monitoring systems to observability, which is really a story of shifting from known unknowns, to unknown unknowns.

Why does this matter? Well we're all distributed systems engineers now. I think this means we get raises. I'm not sure. But the unknowns especially because distributed systems are incredibly hostile to being cloned, or imitated, or monitored, or staged. Trying to mirror your staging environment to production is a fool's errand. You should just give up now.

In terms of the problems that we have, basically distributed systems have this, I'm assuming a functional engineering team, which is a large assumption, I'll grant you. But if you have a functional engineering team, everything you have to care about is this infinitely long tale of things that almost never happen, except once they do. Or five different impossible things have to intersect before you see this behavior. How are you going to find that in staging? Spoiler alert, you're not.

Trying is a black hole for engineering time. I really like the way Martin Fowler said that you must be this tall to ride this ride when it comes to microservices. To me, that means that operational literacy is no longer a nice-to-have. You should never promote someone to be a senior engineer unless they know how to own their code. This is relevant because without observability, you don't actually have Chaos Engineering, you just have chaos.

Observability is how we close the loop. We've all worked with that one asshole who just run around doing shit to prod. You're just like, "Stop it." Well, if you can't actually tell what they're doing, if you can see the consequences of what you're doing, you've just got that guy. You don't actually have science. You don't actually have intentionality there.

When I say observability, there are bunch of different definitions banging around out there. Some people say it's just another synonym for telemetry. I use this in a very particular technical sense. There's a definition that I want to run through with you so that you can understand what I think you need, in terms of your tooling in order to be successful with Chaos Engineering.

Monitoring, I define the way Greg defined it, which is just, it's the action of observing and checking the behavior and outputs of a system and its components over time. You've got one piece of software watching another pieces of the software, just checking up on it all the time. You have to tell it what to check for. It's very valuable, but it's not observability.

Frankly, monitoring systems have not changed that much in the last 20 years. They've changed a lot, they've grown, but the principles, actionable alerts, every alert must have an action associated with it. You shouldn't have to look at the dashboards all day because your system should tell you when it has a problem. All of these are super great best practices for monitoring, but they're not observability. Monitoring itself is not enough for complex systems.

Observability like the classic definition from control theory is, it's a measure of how well your internal states can be inferred from your knowledge of the external states. Like I said, I'm a music major. This doesn't mean a lot to me. But when I apply this to software engineers, it seems clear to me that it should mean, how much can you understand about the insides of your systems, just by asking questions via your tooling from the outside? Key point, without shipping new code. It's always answer to easy a new question if you ship out new code. It's much harder and more sophisticated to gather the right level of detail that will let you ask any question without shipping custom code.

You have an observable system when you can understand what's happening, just by interrogating it with your tools. This is also interesting to me because it represents a perspective shift. Monitoring has traditionally been this one piece of software checking up on the other piece of software. It's a third party observer. But observability is very much about a first-person perspective. It's about getting inside the software and making it explain itself back to you. In theory, this is a more reliable narrator. Well usually it is. It can be, let's put it that way. It can be. But why?

Well complexity is going up and believe me, it bothers me too when people say complexity. It usually means they don't want to have to think too hard about what it is they're actually talking about. I'm gonna do it anyway. Complexity is increasing, but our tools are very much designed for this predictable world of the LAMP stack. Where you could see at a glance which component is timing out, what's at fault, a dashboard was great, you could fit it all into your brain and reason about it without having too much trouble. I love that. I love being that dude who just shows up, and gaze at a dashboard and like, and it's. Really fun. Doesn't really work anymore.

Also, a key characteristic of these systems is that we've built all of these tools that answer questions. They answer questions really fast, really well. But the types of problems that I feel like I've been having for the past few years are not ones where I know what the question is. I have a bunch of reports that may or not be connected from very unreliable narrators. I may have an intuition that something is wrong, or weird, and I'm just trying to explore my system. Just trying to figure out what the question is. But if I know what the question is, I can figure out the answer real fast. But the hard part is figuring out what is the question. In other words, welcome to distributed systems. It's fine. It's all fine.

First lesson of distributed systems is this, your system is never actually up. That wonderful feeling of joy and pride that you get in your authoring, when you stare at the wall of green dashboards, it's a lie. You've always known deep down that it was a lie, but now you can't really avoid it. So many catastrophes exist in your systems right now, so sleep tight.

Testing in production. We act as though this is something we can avoid. It's not. Every single unique combination of this point in time of your infrastructure, this deploy artifact, this deploy script is unique. As anyone who's ever typoed code production knows, you could only stage so much.

I've been talking about this at a very high abstract level. Let's look at some examples to illustrative the difference that I'm talking about, and the shift between monitoring observability. These are taken from the LAMP stack. Some problems, photos are loading slowly, but only for some people. Why? Every single one of you has to debug this problem before. Database is running out of connections, woo. Okay, cool. Let's examine some examples from Parse and Instagram.

I'm not sure what I'm supposed to monitor for here. I've got more. I can do this all day. The push notifications is one that remains one of my favorites. They're like, push us down. When the push is not down down, it's the queue and I'm getting pushes. Ergo, push is not down. We push is still down. Go and look at it.

It was because Android used to have to keep a persistent connection open to subscribe to its pushes. We'd round robin DNS that day, and we added some more capacity to it. It exceeded the EDP packet size in a response one day, which is fine right? DNS is supposed to fail over to TCP, and it did. Everywhere in the world except this one route, Eastern Europe. Once again, I ask you, what exactly am I supposed to monitor for here.

You know the classic workflow of you have an outage, you call postmortem. You're like, "What monitoring checks can we add so that we can find this problem immediate the next time? What dashboard can we create so that we can find this immediately next time?" It just falls apart. You can spend all your days creating all these dashboards and monitoring checks, but they never happen again. It's never gonna happen more than once. It's just, it's the wrong model. It's this model of there is a finite list of things that can go wrong, so we'll find them all an monitor for them. No. It's like an infinitely long tale. You just need to take more open ended exploratory and debugging approach to it.

Let's get down to bits on disk. What does your tooling need to do, and what do your teams need to do in order to support this kind of ad hoc exploratory debugging? Well first of all, this is 1000% an instrumentation game. For all the software that you control, you have the ability to get into its head and explain itself back to you, and your past self, and your future self, and your team. You should do that. I'm not gonna go into all best practices for instrumentation, I will just say this. Start with wrapping every network call with the request, for the database query, the normalize query, the time that it took, because in this systems they frequently loop back into themselves. When you have a distributed system that loops back into itself, as most platforms do, you can have any individual node or process infect the entire thing. Latency will go up for everything. You need to be able to trace it back to its source, so network.

The hardest part is figuring out where the problem lies. The hard part is not debugging the code. The hard part is figuring which part of the code to debug. This is what observability is. It's about events, not metrics. The fatal flaw of metrics is that it discards all of the context, and re-adds some in form of tags, but what you want is basically all the metrics, but aggregated by the request ID. Because that is the perspective of the code, is that it is at the request, is that it is traversing your code. That's all that matters to you. You don't actually care about the health of the system as an application developer. You care about the health of every single request. You care about being able to figure out why every single failure failed. That means the viewpoint of the event. You care about high cardinality.

If you ask your vendors one thing, one things, ask them how they handle high cardinality. The Parse acquisition, it was traumatic. Around the time that Facebook acquired us, I had come to the conclusion that we built a system that was basically debuggable by some of the best engineers in the world. Couldn't be done by any tooling out there. We started using Scuba, which was this aggressively hostile users tool that lets you slice and dice, in real time on high cardinality dimensions, and basically the ability to break down by one in 10 million users, and any combination of anything else saved our ass.

Sorry, when I say cardinality I mean, imagine you have a collection of a hundred million users. The highest cardinality dimensions would be any unique ID. Lower but still quite high would be first name and last name, very low would be gender, and presumably species equals human is the lowest of all. What out of all those fields, what is going to be useful to you in debugging? It's all the high cardinality stuff, all of it. 'Cause that's more unique. It's more identifying. Yet, metics based tools are shit at this. Sorry I didn't ask about the swearing. I get excited. High cardinality is not nice to have.

Dashboard are a relic. Honestly, we love our dashboards, we love our world at a glance, you know whatever. But as Ben says, do not confuse visualizing the search base, but reducing the search base. Researching the search base, like narrowing down where those problems are coming from is everything. Every dashboard is basically just an artifact of some past failure. Every postmortem. Oh, let's make dashboard so we can find this right away next time. How many dashboards do you have after three years? How many of them still work? It's ridiculous. That's not science.

BI has nice things, we should have nice things in systems. In BI, they don't sit there with a handful of dashboards going, "Huh, which one of these correspond to the user behavior that I'm trying to understand?" That would be ridiculous. No, they ask a question. Based on the answer they ask another question. They follow the bread crumbs where the data takes them. Why don't we have nice things? I don't know.

Aggregation is a one-way trip, even if it's very small. Even if you have a one millisecond interval, you're still smooshing all of the thousands of events that happened in that timeframe into that one value. Once you've smooshed it, you can never unsmoosh it. In order to have observability, you have to be able to ask new question of your raw events. Keep a sample of them if you care about cost. Of course everyone cares about cost. But you have to be able to ask new questions because anytime that you aggregate or index for that matter, anytime that you make these decisions about the way bits are laid down on disks that prevents you from the flexibility of asking new questions, you've taken a step away from observability, and you screwed your future self.

In conclusion, you really can't expect to hunt needles if your tools don't handle things like this. That's what Chaos Engineering is all about. It's the entire philosophy of you don't know what's gonna break. You don't know what's next. You have to be prepared for everything. The practice of Chaos Engineering is just taking a step in the direction of practicing these principles and philosophies that we claim to hold. I'm a fan of it, even if it is a marketing term, 'cause I'm a CEO now. I'm supposed to love marketing. It's fine.

We spend way too much time looking at these elaborately falsified environments, and not enough time looking at prod. I really believe this. There is no substitute for looking at production. If you're not looking at it when things are okay, good luck find problems when things aren't okay. Every engineer who is ... I'm a big fan of software ownership. To me, being a software owner means you write code, you have the ability and the permissions to deploy code, and to roll it back, and the ability and the permission to debug it in production. If you don't have any one of those three things, you're not really a owner of your software, and you're not really equipped for the next generation of computing.

Real data, real users, real traffic, real scale, real concurrency, real network, real deploys, real unpredictabilities. You can accept no substitute. Chaos Engineering is great, but this I think is at the core of what I wanted to say, is if you can't see what you're doing, don't do it. Just don't ... Fix that first. Fix the fact that you can't tell what you're doing before you actually go and add more chaos to the mix, okay. Thanks.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Charity Majors: "Closing the Loop on Chaos with Observability" - Chaos Conf 2018

What is Failure Flags? Build testable, reliable software—without touching infrastructure

Introducing Custom Reliability Test Suites, Scoring and Dashboards