The following is a transcript from Gremlin CEO, Kolton Andrus’ talk at Chaos Conf 2018.
It’s quite an exciting event to be a part of, and I’m honored to have the opportunity, I’m honored that many of you chose to travel from around the country and around the world to share this with us. We hope and expect that this will be a great day of learning, of sharing stories, of hearing about how people have faced similar challenges to you, and found solutions.
With that, I want to jump in. I want to take you through a little bit of the evolution of Chaos. In the beginning, there was Chaos Monkey. Chaos Monkey was a very useful tool at Netflix to help prepare for the cloud. It enforced a great behavior that hosts could be rebooted from underneath you at any point. The random nature of Chaos Monkey helped impart the thought that this could happen at any time, and it’s a great starting point. To begin with Chaos Monkey you don’t need a lot of maturity. A lot of people are willing to take that first step here.
One of the things that isn’t so great about Chaos Monkey is that it’s random. I just said it was good, it’s good for enforcing behavioral change, but that can make it difficult to measure it, to understand how it impacts your systems. It’s good as an organizational influence, but it may not be the best way to run experiments. Furthermore, it’s a single failure mode. It’s an interesting one and it’s one that we should start with, but as we mature and as we go along this journey having outages due to a single host rebooting becomes less and less likely, we’re able to handle that more. The nice thing of beginning here is it helps us to start thinking about how we’re going to handle these failures. What are we going to do when this occurs? What is our response? Just having that conversation be started, having us thinking about that is a very valuable place to begin.
As we mature we need a little bit more, and so what we found is that we need to move into a more diverse set of failures. It’s not enough that simply hosts can be rebooted, we could have services where our CPU is pegged, we could have an application with a memory leak. What happens when daylight savings happens? Or there’s a leap second? What happens if our processes die? Do we reset them? Do we restart them? As funny as it is and a little bit embarrassing to say, every company and almost everyone I’ve spoken to has had an outage related to a disk filling up and it’s not being handled correctly.
As we mature we need to handle a wider set of failures, and to do that we need to take a bit more of a disciplined approach. The random approach of just breaking everything when you start doing it across many different failure modes would start to become very noisy, when it might be painful to our customers. The goal here and the goal of Chaos Engineering in general is to prevent customer pain, and so we want to be very thoughtful about how we’re running these experiments. In this scenario, we need to be a bit more disciplined. We’re thinking in terms of hypotheses. What do we think the system will do when it fails? We’re thinking in terms of measuring outcomes. How do we ensure that the system fails correctly? What is the correct measure of failure in our system? It’s very important to be mindful of what our customers are experiencing and being thoughtful about preventing pain to them while we’re running experiments.
Really to do this, this requires that we have a degree of basic operational readiness. If we haven’t yet developed good monitoring, or good alerting then we may not be ready to cause these more in-depth failures. In fact, this becomes a great place for us to validate our operational maturity. The ability to test that we get engaged when something goes wrong, as silly as that may sound to some, I’ve been part of many outages where someone joined 20 or 30 minutes late because their engagement wasn’t set up correctly. Or our monitors and our dashboards track the things that we think they’re tracking. Again, it might sound silly, but I’ve seen outages where we spent 10 or 15 minutes looking in a region at a dashboard that was configured incorrectly, and it lead us down the wrong path. When we could’ve been more focused on the failure that happened.
This is good and the theme here is can I handle the things that happen to me? As we mature we need to go beyond that. The next phase is really one where we handle not only our failures, but we handle our dependency failures. There’s a great story about one of our providers at Gremlin and they had a couple of hours of downtime in the last month. Unfortunately, in their incident review they essentially pointed at Azure and said, “It was their fault.”
Now, as somebody who’s written a lot of software, and focused on resilience I can understand, and empathize with that statement, but ultimately it’s not acceptable. As a customer I pay them to be available, I pay them for their service. I don’t pay Azure, that’s an indirect relationship. We as the guardians of our customer behavior, of our customer trust have to be able to handle things that happen not only to us, but things that happen to our dependencies.
How do we do this? As we mature in the way we think about things we can start to look at network failures, so what happens when a data center fails? What happens when a region fails? What occurs when there’s a network partition or a large amount of packet loss? In this case, there’s inherently a larger blast radius. If you’re you new to Chaos Engineering the concept of the blast radius is that we always want to think about the risk of running an experiment, we coined that as the blast radius. Whenever we’re running experiments we always want to try to minimize that blast radius, we always want to try to de-risk things as much as possible.
It may be worth just stating explicitly that this idea of minimizing risk is really at the heart of what Chaos Engineering’s all about. We all make trade-offs, we all have risks in our system. How we deal with those risks, and how we quantify them, and how we judge how important they are, and how much time to spend on them is a key aspect of how we learn about how our systems fail. In this case, we need something that’s a little bit more wider scoped. Network failures inherently are distributed.
It’s funny, with the rise of microservices, with the rise of distributed systems while it allows us to move more quickly, and be more agile as teams we’ve also introduced the network between everything. There’s one of the classic programmer fallacies, “The network is reliable,” and the network is not reliable. It will cause pain, it will cause trouble.
When we’re running these experiments we don’t necessarily know the knock on effects. We could be impacting one of our dependencies, we could be testing what happens when we lose a zone or a region, but we may have an inadvertent side effect, or a knock on effect. That’s good, we want to find those. The cascading failures and the things that could go wrong later are absolutely important to discover, but as such it requires us to take a more coordinated effort where when we’re running Chaos Monkey or we’re running infrastructure failures maybe that’s just us, and our team, but when we’re running the network failures or larger scale ones we need to be doing it more like a game day. We need to be coordinating with other teams, we need an opportunity to have many eyes keeping an eye on things. This is a great opportunity to practice, the same as practicing our operations, practicing our coordination amongst teams because if a large-scale event happens, if there’s a network partition then we will have to work together as a company to get that opportunity to practice.
As a little tangent, I often joke the on-call training at every company I’ve been at, essentially, amounts to, “Here’s your pager. Good luck.” Maybe there’s a dashboard over there, who knows if it’s up to date? Maybe there’s a run book. Imagine instead a world where when you joined a team they said, “Look, we’re going to help you train for operations on our team. We’re going to run an experiment or a game day, maybe we’re going to tell you about it, maybe we’re not, but we’re going to do it in a safe way. We’re gonna to have a safety net. We’re gonna know what failures were introducing, but we want you to treat it like a real incident, to be engaged, to look at your run book, to look at your dashboard, to be able to ask questions during the day when things are a little bit more sane as opposed to in the middle of the night when everyone is urgently trying to address the problem.”
This game day process is a way for us to test how our teams work together, how our company’s organization, operational process fits together, and that’s a key aspect. Do we join a conference bridge? Do we hang out on Slack? Do we go to someone’s desk? What do we do as a company when things go wrong? It’s very valuable to be asking those questions and getting the answers. This sounds good, this sounds valuable, but there’s some toughness that comes with this, there’s some difficulties.
Let’s take a real scenario to talk about where I think this process starts to break down. We have an incident that recently occurred, and from the providence, from the metrics we’re able to tell that it really only impacted Android users. We think we’ve got a solution that we can test a fix for, but we want to roll it out in a thoughtful way, and we want to validate it by minimizing that blast radius. What we’d really like to do is test only against Android users and not against iPhone or other mobile users, we want to isolate the traffic.
How do we do this at the network level? One of the problems with the network level is that it’s packet focused, we’re dealing a lot in terms of IP addresses and ports, these streams between services. If we moved into a managed environment for using Kubernetes, for using service meshes we may not even know what the IP address is, or the ports of our services are, that may be abstracted from us. How do we construct this safe experiment using this knowledge?
The problem here is that operators don’t think in terms of packets, operators think in terms of requests, and when we look at the request level we see much more information that is available to us to make better decisions upon. What country did this request come from? What company does it belong to? What team? What user or user agent? Here, we start to have the availability to capture this information, and to be able to make more intelligent decisions. As we’re maturing we learned that the host level failures and the network level failures are necessary to building a robust and resilient system, but not sufficient. We need more, we need a finer granularity. We need to be able to control that blast radius to make experimentation easier and safer.
How do you accomplish that? One thing I’m very proud to announce today is a new product that we’ve launched at the Gremlin called ALFI, Application Level Fault Injection. What’s exciting about this approach is it help address this exact problem. We want to be able to go run very safe constrained experiments within our environments. We want to minimize the risk, and the overhead, and we want to keep this blast radius as small as possible.
With that, let me give you a little bit more information about how this works, and how you might apply this approach. The key at this level is really about validating the user experience. When your service fails, when your endpoint fails, when your application fails what does a user see? More than just ensuring that your hosts stay up, or that your service is healthy when things go wrong what is the customer going to think? What are they going to say? One of the ways that we’re able to validate this is to make it very easy for us to run these experiments, and see for ourselves what our users are seeing.
Now, in this case, it’s a bit more of an advanced technique in that we need monitoring, we need operational readiness, and if we’re starting to dip our toe into the world of, what debatably might be called, observability, the ability to really go in real time and understand how our system is behaving, and what’s going wrong. How do we achieve this greater position? The key here is these precision experiments.
There’s this concept of coordinates. At the end of the day, our applications essentially boil down to a lot of key-value pairs about how we slice and dice, and define our infrastructure. In this case, what do we have available to us that we could leverage to build safer experiments or to scope how we exercise these? There are things that occur at the application level, a user, a device type, a user agent. What AB test are they in? There are platform level concerns. What region did it come from? What service does it belong to? What dependency? Things on our side, things in our control.
Now, the nice thing, as programmers and engineers, is that this essentially results in a key-value pair that allows us to do slicing and dicing intelligently on our experimentation then, really anything we know about can be an attribute that we use. This is the key power, the key flexibility that, at the end of the day, whatever it is that’s important to you we, Gremlin, or me I don’t need to know that we can build it in a way that allows you to manage it.
To give you a quick code example of how this looks. Here’s an opportunity to define custom coordinates. In this case, it’s a relatively straightforward example. We care about customers, we care about devices, we care about country. One of the keys here is it doesn’t matter if your company uses customer ID, or CID, or C_ID, or device or device type, whatever it is can be available for us to exercise and to run our experiments on.
There’s a little bit of code integration that comes with this approach. When we’re doing application level we no longer have an agent on the box, we’re no longer directly impacting some cloud providers API. We need to be able to be in the application to have access to this information. Think of it like cut points or aspects, the ability to annotate a function, and be able to then arbitrarily inject failure in that function, maybe even just a subset of functions, subset of traffic.
With that, let me give you some examples, some use cases. It always works better with a story, it’s always a little bit easier to [gro-ck 00:16:35] when you can see real-world examples of how this is leveraged. The first one is something that I’ve used several times in running experiments in the past, being very targeted. I’ve discussed a bit about this concept of the blast radius, but in this case we want to be able to run our experiments from 0 to 100 in a safe way that mitigates the risk. Remember, the goal is never to break customers. The goal here is to prevent customers from feeling pain.
The way that we do that is we first test on ourselves, we run a single experiment, and we impact our user, or our device, or a device in our control. We have a device lab and maybe we’re going and running on a specific iPhone or Android device. We’re going to see what happens when we cause a failure. We may fail a service, we may fail a region, we may fail a subset of our platform functionality. When this occurs if the user experience for us is broken, and if it fails then we know it doesn’t meet our expectations, we know it doesn’t do what we want it to. If it does meet our expectations, if we gracefully degrade, if we handle that failure without diminishing the user experience then we have an opportunity to increase that failure scope, that blast radius. Now, we run for 1% of users, or maybe 1% of users in the East region, or maybe 1% of users in the East region that are Android users. We can compose these to be a very careful and thoughtful about who we’re going to impact.
Then, if we if that works the way we expect, and we are going to be very diligent, we’re going to be measuring carefully. For that 1% of users are they experiencing pain? We may have customer success in the room, or support listening, testing to see if there’s anything going wrong. Again, if at any point things go wrong we’re done, we’ve won, we’ve found something that shouldn’t be, and we can go fix it. If it works correctly we’ll scale it more, we’ll go to 10%, we’ll go to 25%, we’ll go up to 100%.
We learn different things at different scales. At the small scale, the single request, the user 1%, we might be learning whether or not we handle an exception, whether or not we’ve written a fallback that works with our user interface, whether or not we handle null correctly. At the large scale we’re testing a different set of criteria for our system. Do we gracefully degrade if we receive too much traffic? Do we shed load to protect ourselves? Are we a good citizen in backing off of downstream dependencies with an exponential or a back off strategy, or do we just fast retry and bury that service? It’s very interesting and useful to be able to test these things. These are key aspects of ensuring our systems behave correctly.
One of my pet peeves is timeouts. I’m a lazy engineer, so don’t take offense at this, but we tend to look at a graph, and draw a line, and say, “Nothing should ever pass this. Timeouts good, ship it.” Timeouts are built to protect us when things are going wrong, so if we haven’t actually see how they behave when our system’s under duress, when it’s underwater, when it’s having trouble then we may not know if that timeout actually protects us, or if it’s too lenient and it still allows us to get into trouble, so this is one approach.
Another approach is the ability to quickly, and without a lot of overhead reproduce an outage. This is a real story, there’s a member on my team that experienced an outage, and while this outage was going on he had a good idea about what might have been happening underneath the covers, he had a hypothesis, and what he was able to do was to go create an experiment impacting only his user, and testing this hypothesis. What happens if, let’s say, it’s the identity service or something of that nature, what if that fails? Is that what would cause this?
Now, this engineer was able to go run this experiment, this is in the context of an outage we’re 20 minutes into an outage, we’ve triaged things, we see that there’s this subset of users impacted, we’ve got this hypothesis. This engineer’s able to go and run does experiment against his own user testing this hypothesis. What happens is it fails. He pulls out his phone or he hits a webpage and he loads it, and he sees the same failures that users are seeing, so this confirms his suspicion. Now, he’s able to go log into boxes, find logs, find the metrics or the providence that show why that failed. From that, he was able to derive what the root cause of the problem was and how to fix it. In this case, 20 minutes after the triage of the outage was done a poll request was ready to fix it.
Now, this may sound like a simple thing, but there’s something inherently powerful about empowering engineers to go, and answer these questions quickly. When we’re running game days, when we’re running large scale failure tests there’s a lot of coordination. We want to do it safely, and there’s value in that coordination, but at times we would like to quickly and easily answer questions. This could happen while we we’re developing the code. Oftentimes some of this work gets left until it’s time to deploy or it’s time to harden a service, but if it’s easy to do then it may be something that your teams or your engineers are willing to do more ad hoc, or as part of the development process. It’s much easier to fix some of these things when you’re aware of them during design and early implementation as opposed to once things have been deployed.
This is something that we believe deeply in at Gremlin, making it easy for people to do the right things. I’ve seen this in action many times at companies I’ve been a part of. If you have a very clever tool and it’s difficult to use people won’t put the time into it, and they won’t get the value from it. If it’s straightforward, if it’s easy, we’ve seen a prevalence of good user interfaces and good dev ops tools, then people will be more willing to spend time because, at the end of the day, it’s a trade-off of how much time you have and what you’re able to accomplish, and so making it easy for people to quickly do their jobs, to save them time.
One of the core tenants, to me, of Chaos Engineering is that we’re investing this time up front to save time later. If we can spend 10% of our time today and not have to deal with 25% of our time dealt with outages later then we’ve just saved 15% of our time, and we can focus more on our features and on the aspects of our application.
Then, there’s this new task technology it’s got a little buzz, you might’ve heard of it Serverless Lambda, azure functions. There’s a lot of interest and research going on in this space, a lot of companies are deciding and debating which part of their applications to move, and how to run in these environments. The downside of Serverless is also it’s upside, there’s no host to manage, there’s no host to reboot, Chaos Monkey can’t do a whole lot here, there’s not processes to kill. A lot of the failure modes have been abstracted away from us, they allow us to focus on the code and on the application, but that doesn’t absolve us of having to deal with failure. Quite the opposite, the user experience is still in our hands, and whether that’s a developer’s user experience or an end user’s experience its key that we’re mindful of that responsibility. The ability to take this approach and apply it in a Serverless environment allows us to build trust and confidence as we migrate and move into that environment. That’s a common pattern I see for Chaos Engineering as well. People are looking at Kubernetes, they’re looking at Serverless, they’re looking at new ideas, but before they put all their eggs in one basket, move production over carte blanche they want to ensure that it fails and behaves the way they expect. This is a great opportunity to test that.
What does this look like? How do you use it? How would you run it? I’ve got a little a little demo here, a recorded one to show you what it looks like in action. Here, we have an opportunity to choose what kind of failure. Do we want to fail in Lambda, NEC2? We have an opportunity to filter down what traffic we’re going to impact. In this case, we’re going to impact all zones, all regions, all instances, but we’re going to narrow it down to a specific service only requests going from the picture service are going to be impacted, everything else will be left alone.
Then, we’re going to choose how we want to fail it. Are we failing inbound traffic, outbound traffic, data based traffic? In this case, we’re going to fail http traffic, Any verb, but any outbound call to the picture’s API is going to be impacted. In this case, we’re going to cause a 50% of our impact, 50% of the traffic will be impacted. In this case, we’re also going to further narrate down so that only Android traffic is receiving this pain, so everything else remains alone. We’re going to introduce some delay, 2 1/2 seconds. What’s do our users see? Or what do our Android users see if aspects of the application slow down, or begin to fail?
With that, let’s take a look at what this looks like in action. Here we have two phones maybe one’s an Android, maybe one’s an iPhone. On the left we see things behaving kind of as normal. It loads, it takes a little bit of time, but on the right we see this increased delay being added for a subset of pictures, 50% are being delayed even more. From this it really helps us to understand what do our users see when things slow down? Is this an acceptable user experience? Are we comfortable with this? Since we’re talking about mobile, if you’re testing mobile applications and you’re always testing right next to the server with a very fast ping time do you know what it looks for light for your customers on a 2G network on the other side of the world? It’s a useful tool to be able to answer and understand these questions.
What happens when things fail? If we introduce exceptions? In this case, we want to test that we have a fall back, or that we gracefully degrade. In this case, our application doesn’t crash, the whole thing doesn’t fail, we catch these exceptions. In this case, we’re showing the handsome Gremlin mascot in its lieu, but again if we hadn’t handled this exception or null well the entire screen might’ve broken, so maybe this is appropriate, maybe this is where now we layer in retries on those failures, maybe we have fallback images we want to use. The point is, we can see how it behaved for customers and we can test and execute that.
I’m proud to say that ALFI is available today for any Gremlin customers or anyone that’s interested in participating in one of our free trials. Thank you. There’s been a lot of great work by my team, and I want to recognize them in bringing this to fruition and building it, making it work in a generic way that everyone can leverage, and really take advantage of the strength of this approach.
With that, I’m also proud to announce that Gremlin has closed our Series B, we’re excited to welcome Redpoint Ventures … thank you.
We’re excited to welcome Redpoint Ventures into the Gremlin family. We’re going to take this opportunity, this support and this investment to continue focusing on what we think is important at Gremlin. We believe that we win when the industry wins. We believe that when all engineers get paged less, have more resilient systems that the world’s a better place. We live in a day and an age where the Internet and our software is paramount to not just our comfort, but our daily lives. When airlines fail people get delayed. When self-driving cars fail what happens? These things are critical to us, and so this opportunity for us to continue investing in educating the industry, and teaching people what we’ve learned, and helping to build community, and build events like this is very important. This will allow us to continue pushing the state of the art, and Chaos Engineering, and building resilient systems.
With that, thank you very much. I’m excited to have you all here. I think it’s going to be a great day of learning and education. Thank you all for coming.