Crystal Hirschorn: The Future of Chaos Engineering: In Pursuit of Unknown Unknowns

The following is a transcript from Condé Nast VP of Engineering, Crystal Hirschorn’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.

[If available, include these]Twitter: @cfhirschorn

Crystal H.: I'm the VP of engineering at a company called Condé Nast. I'll go on in a minute a little bit about what they do. I've been doing Chaos Engineering for quite a long time. And yeah, I'm going to share some of my experiences with you, but also where do we think chaos is going to evolve to in the future, and how can we pull things from resilience engineering into our learnings and practically apply them on a day-to-day basis.

So this is one of my favorite books that talks about human factors, system safety, resilience engineering. And the person who wrote this book was Sidney Dekker. He was actually an airline pilot and also somebody who's involved in human factors and system safety.

And the quote is "Complexity doesn't allow us to think in linear, unidirectional terms along which progress or regress can be plotted." So just for a minute there, just have a think. When we get on in a minute about root cause analysis and how ... In this book it talks about we have sociotechnical systems. There are a lot of parts to it. It's not just about the systems that we build, but it's also about the organizations in which we work, the factors that sit outside of that which have constraints on us and push us in different directions ... in our managements. There are lots of different factors to this.

But also like when we think about root cause, it often pushes us in this very linear direction of trying to find a single point of failure, which is probably a wrong way of thinking of it as well. I mean, given our systems and networks, it's very unlikely that there is a single root cause.

But yes, go read this book. It's amazing.

So I work at a company called Condé Nast. Often I get the question of what is Condé Nast? He was actually the owner of the company. The company itself is 120 years old. It was a publishing company, doing things from printing press and now we're doing the whole digital transformation journey. And you'll know a lot of these titles. I suppose Wired is a key one. We do actually have a couple of technical brands in our portfolio. But there's about 60 brands.

And just to give you a sense of the scale at which we run ... Sorry if this slide induces any kind of epilepsy, by the way. I do think it's quite a beautiful graphic, but it's kind of quite much. But as you can see, we are a global company. And we do have 500 million or so visitors per month across the globe. I think Dave mentioned a one-hour latency. We can't imagine the kind of latency that we have to deal with on a human scale, which is, you know, if I want to talk to my team in Japan, the Japanese engineering team, that's a 12-hour time zone difference. So it's very challenging. Sometimes resolving something can take a week, because of the amount of latency that we have.

We are running on Kubernetes. We use mainly AWS. We use some GCP for the data pipelines and data warehousing. We are also running in very challenging countries like Russia and China. We do have other CDN providers there. We're able to use AWS for now. But it's likely we'll be swapping to a multi-CDN, multi-cloud environment. And that's why we chose actually to use Kubernetes in the first place. It's mainly Node, React and JavaScript is the stack as well. And we use Fastly CDN in most places, which I think is also a really amazing CDN.

So my talk was about trying to identify an unknown unknowns, right? So I want to try and talk to you about what does this mean. So a lot of you probably have seen this before. So we have things that ... In the middle we have disorder. But there are different ... This quadrant kind of shows, okay, we have the very complex, which is the unknown unknowns. We have things which are complicated, which are known unknowns. We have things that are chaotic, which are unknowables. And we have things that are simple, which are known knowns.

And kind of going through this ... So I think Dave mentioned this morning, again, unknown unknowns or the emergent practice, like our systems, our architectures, even the way that people behave within the system, are always changing. So the properties will be emergent.

Good practice is more in the complicated space. And what I would say there is maybe you could leverage something like runbooks or playbooks in this space. But playbooks and runbooks will only get you so far, that they can only ... You can only use them so much in terms of a disaster recovery.

Chaotic is more like novel practice. And what they say here is a lot of that's like tacit, implicit knowledge, not to invest too much energy in that because it's extremely unlikely that you would get much value from it.

And simple is best practice. So this is where you might have documentation, right? Or your tests, right? This is how I feel about this quadrant anyway in terms of what we do.

So what can we do to kind of experiment effectively? So there was a keynote speaker here last year, Adrian Cockcroft, who's the VP of cloud architecture at AWS, and he showed this diagram. And I love this diagram. I show it a lot in my talks. But I think this is really interesting way of thinking of it as well. We have multiple layers. And this is, I appreciate, quite a simplified example, but still multiple layers in terms of how we would want to perform Chaos Engineering and how we have people involved in this as well. And that's where you get things like game days.

There are more tools becoming available out there. I think it's a shame that some of the bigger companies are deciding to close source rather than open source some of their tools. But there is quite a lot more out there today than what was represented here last year.

So I think there are a couple of Dungeons and Dragons fans who might recognize the dice. But this is what you would do, right? If you're going to perform chaos, you would define a steady state and your control groups. You'd form your hypothesis. You'd run an experiment. You'd verify the results. You'd tweak them in some way. You might run them again. That sounds pretty straightforward. But what about ... How do you even define the steady state? Because let's think about our architectures for a minute and how they have evolved recently as well.

So here we have the good old LAMP stack, which is perfectly fine. I think a few people have talked about monoliths today. And I think a lot of us probably still have monoliths somewhere in our organization, if not completely. But you know, it was quite a simple time, really. It's like we had the application or a set of applications running on a single database or maybe a database with other backups. We had a server, web application. It was much more easy times then.

That's not the case today as well. And what I would say as well is I think there's been a bit of hate on PHP today. I spent most of my years as an engineer writing PHP. I don't think it's terrible. But I think it probably still runs 80% of the internet, and 70% of that is probably WordPress. So microservice is great and all that. We're still getting there. And PHP isn't bad.

So how have we evolved recently? So even as far back as five years ago, 10 years ago, we were doing things more on physical machines. A lot of us talked about Bare Metal today, still running things in data center. That's very much my case now as well, is that my teams in London tend to do a lot of things. You know, cloud-first principles, cloud native. But I look after 11 teams, 12 teams in fact globally. And they're running this across the data centers and cloud as well. So it's a whole mix of things. And of course maturity as well. There's a ton of legacy there that needs to be dealt with as well.

So we have the units of scale here as well in terms of physical servers going up to machines, virtual machines, then you know, applications and containers. But even serverless, which isn't exactly serverless, it's just somebody else's servers, but we won't talk about that. And the way that we can spin these up really quickly now, and the serverless, it's like it's a different paradigm. We still have to care about that, but we need to think about these different kind of properties and characteristics of our architecture when we're designing our chaos experiments.

So this is a view of our current microservices architecture. And I just want to say one thing as well about architecture. It's never a static thing, right? I think we're always on the path of migration with architecture as well. Once we get to the end of this microservices approach, I'm sure will be already migrating to some other architectural paradigm. And normally in our businesses as well, we have multiple different architectures running in tandem. It's not this one usually, unless you're maybe a really early-stage startup, if you have some amount of maturity and legacy, there will be multiple different paradigms.

So we started moving to this microservices approach from a service-oriented approach about sort of 18 months ago. And at the minute, this is a slightly old picture, but we have now about 40 or 50 microservices running. I was going to point out as well, this is taken from Datadog. So I do quite like their APM, which gives you this nice trace, this nice trace of the architecture.

Also running Kubernetes, we have a service mesh, which runs on the control plane. And as you can see, it's quite a complicated architecture. You've got your services which are fronted by proxies, which also get configuration from different things like their TLS certificates, their configuration and also any other type of data that it needs basically for telemetry.

And then we have serverless as well. And serverless has lots of different types of paradigms. So you can do things just synchronously, just push. You can do things asynchronously through multiple requests using Lambda functions. This is quite good for us as well because we have a lot of Node developers. And I think serverless is quite nice for application/product developers to get used to because it fits into paradigms that we're already quite used to in terms of evented architecture. And then you have things like streaming as well.

And then again, another thing that we have where I work as well, we have a big portfolio of brands that run in multiple markets across the globe. So we decided when we built our platform that we would go multi-tenant as well in terms of web application.

And I think a lot of times when I hear people talk about Chaos Engineering, they focus a lot on the lower levels of the architecture around networking, switching, infrastructure attacks that you can do. But there's a lot of complexity in the application layers themselves and even up to the, you know, right to the edge nodes as well with the CDN caching.

We're quite lucky as a company because as a publisher, our content can have very long TTLs. So we can actually serve stale for quite long time. But it depends on your company, right? So that's the thing you have to bear in mind, is what kind of products or do you have? What kind of needs do your customers have? An eCommerce business would have a very different set of needs in terms of the experiments they need to run. And recently we did have a platform outage, which I'm going to talk about in a second, but because we've gone from this multi-tenant architecture, which is great in a lot of ways, it actually ended up affecting something like 25 of our websites because of the sort of shared architecture.

And then if we go up further the stack, so micro-frontends is taking on more of a sort of a trend at the minute. And this is sort of identifying how can we use something like the service-side approach that microservices and apply it to frontends? And this is something we're starting to do as well where I work. And a lot of big companies are starting to do this as well. It's becoming ... The complexity is on both sides. It's on the client. It's on the server. And we can't keep ignoring this as well.

So I think somebody asked a really good question at one of the most recent events I was at, which was Chaos Community Day in London. And they said, "How do you chaos test frontend, like your frontend architecture?" And it got me thinking. I don't think that this question comes often enough or that people are talking about this often enough. But at the minute, yeah, as you can see, there are a lot of JavaScript frameworks which already do this.

So I'm just going to tell a quick story about a platform outage that we had at work. This is quite recent. I think this was in late August. We do have outages, of course, like everyone. But I think it's just interesting to ... I don't tend to go into Slack and follow people when it's happening. But I always love to go back and review what actually happened and just see how do people react and how do they figure out tacitly this knowledge that they have in their heads? How did they work this out and who's playing what role as the thing's unfolding?

And as you can see, somebody's suddenly saying, "Oh, I'm starting to see DNS errors." Well, for a start, hopefully we'd have some alerting in place that would have told us that first. But we all know in practice that doesn't always happen.

And then we started seeing this error 503 actually coming into the browser, which is really bad user experience. And this was affecting, again, like I said before, 25 of the sites that we run. And it was kind of intermittent, which is always the best kind of failure, right?

And then you can start to see people are saying, "Okay, well, it only affects 8.5% of the requests. Some of them are 404s. Some are 503s. And they keep going through like, okay, let's see ... You can see gradual ramp up. Somebody identified that it started two hours before anyone started talking about it in Slack.

And then we move it over to an incident room. So like you would do with an incident process, we move it to an incident room so that people can go in there and not cloud up another channel with the conversation. So yeah, carries on.

One thing I wanted to point out is this person ... You know how incidents go. This guy's on holiday, but he still decided to go in there and have a look. We're like moths to a flame when there is an incident. It's like, "Oh, it's quite exciting. Going to jump in there and help." But it's kind of also causes more chaos as well.

So somebody actually wrote here a hypothesis. Here's a hypothesis of what's going wrong. Sort of like an out-of-memory, a crash is happening. It turns out they weren't that far wrong.

And so, yeah, different things were tried. But in the end, one of the things that we found is that Kubernetes can be a bit too ... It can make too many requests for things. And I don't know fully what happened, but it sounds like when it should have been making two requests, it was actually making about 15 to 20 requests. And it was doing this also on IPv4 and IPv6. So it was making like 10 sets of requests on both basically each time it needed to make a single request. And it just saturated DNS and just totally screwed up everything.

But anyway, so ... Yeah. I think in that scenario, it took quite a long time for people to play that out and to figure out where is the failure, but also like was it a single thing as well. So come on to that in a minute, like how do we identify actions? Because as you can see, people are trying to work out what's going on. They're using graphs. Some people are sort of suggesting what it could be. People are probably looking through SSH and trying different commands to figure out what's going on directly on the machines themselves. But I'll come on to that in a minute what kind of actions can you play out and how you can roll those into your experiments.

So there's been quite a lot of talk about tracing today, which is great. It's something I'm seeing more and more people pick up. And I'd say in terms of if you're going to run chaos experiments, you need to make sure that you're instrumenting your observability practices and things like tracing. It's not something that's a nice-to-have. It's a must-have. Please do not go and run chaos experiments if you don't have the observability to kind of back that up. You'll find your gaps because there will be gaps. But just don't go in there and start breaking stuff and being like, "Ah, shit. I don't know actually what happened." But yeah, you're breaking stuff. It's cool.

This is a really great blog post by the way. It's by a lady called Cindy Sridharan. I'd say follow her on Twitter. She's amazing. But also she talks about how to set up observability in distributed systems and how to do request racing properly. So, observability. I love this tweet, by the way. So yeah, this is how I feel every time I try to figure out what the fuck's going on.

Yeah. So one thing I noticed recently, it was during this outage we had, I was going to ... So we use Datadog quite a lot. I think Datadog's really amazing but almost too powerful in some ways. It's almost too easy to create extra dashboards and different monitors. I just remember going in and thinking, I don't even know where to start. Our dashboard had grown from a set of probably like eight different visual representations of different metrics to 60. I was like, "Where do I look?" So yeah, so we have all kinds of stuff, right?

So, like, this, this is not so tricky. This, I don't know what the fuck's going on. If somebody can tell me what this is showing me and what this is about, I'd still love to know. I'm hoping that the cloud platform engineers actually know what's going on. But yeah, so there's just a whole bunch of things here. I mean, this is looking a little bit more acceptable. And I quite love this feature as well. I think somebody showed this earlier, like the flame graphs that you can get as well from Datadog in terms of request tracing. And you can find all the tagging against that as well.

So I don't know if any of you have ever watched Halt and Catch Fire. It's an amazing series. I'd recommend it. It's all about what happened in the early days of Silicon Valley and how startups started. It's not based on a real story. It's kind of a dramatization. But it's a really amazing series so I recommend going and watching it. And this guy here, he's the product manager. And they're like, "Ooh, what's a product manager"? So it was like quite a new concept.

But this is one of the quotes that he said during the series. He said, "Progress depends on us changing the world to fit us, not the other way around." And I think this is true in terms of our systems. We build the systems. We're not, as operators and builders of the systems where ... You know, somebody said earlier, "Humans are fallible." That is true. But at the same time, we need to be able to mold the systems to fit our requirements and not the other way round essentially.

And just to give you ... This also is from a recent talk that I gave. So there's this concept of the sharp end and the blunt end of the spectrum in terms of what kind of pressures and constraints influence our work and our architectures and the outputs that we create as well. And this was created by two people which are really big in the field of resilience engineering called Richard Cook and David Woods. So again, please go and read stuff from them. They've written a lot of amazing books and white papers. A lot of them are freely accessible on the internet. So do go and read them.

But so one thing that I like to think about is there is a lot of outside influences, and we can't often control those, but they're there, right? So for us, things like regulations and regulatory environments is very important when you think about China and Russia. We can't get around that. And you know, we actually tested recently ... We need to be able to run a multi-region platform with different Kubernetes clusters. And we tried to send traffic ... We just did some rudimentary testing. We tried to send traffic to China. Different, just basic tests that we could run. But we found that it dropped about 20% of our packets, which is high. So then we started thinking, okay, well, what do we need to do here? Because we didn't expect that to happen. And do we need to actually set up some sort of direct connects there in order to make that work? Which obviously is more expensive. But we have this issue.

And a story I like to tell is the CTO in our China engineering team has been to the police station four times because of things that accidentally got published on the website. And he acts like it's just no big deal. He's like, "Yeah, I've been to the police station a few times." But for me, I'm like, "Holy shit. You went to the police station because you published something on your website?" But yeah, apparently it's just like this is what happens there. Anyway.

Yeah, and then we have all these geopolitical things. And when you run a global company, geopolitical things can have a big impact, right? It could be market dynamics in terms of economics as well. Sometimes we'll see in our markets, like the economics itself is crashing and therefore it puts existential risk on that part of the company, the brands that exist there as well.

And then you have things within your company as well, like what is your management like? What's their behaviors and their values? What kind of governance do we have in place? Can we make quick decisions or is it kind of like culture by committee, which can slow things down? You know, OPEX, CAPEX tradeoffs and pressures there. Your cultural norms. You know, are you a blame culture and a blameless culture? How do you deal with accountability? That sort of thing.

And management, right? Like now, me as a manager, because I was an engineer for a very long time, it's very difficult. Sometimes I'm lacking a lot of the detail. And it really pains me sometimes to be talking to engineers and feel like I'm really lacking that information. But we try to make the decisions that we think are best, but we're often without all information that we need.

And then, us or you as engineers here, you're at the sharp end, what they call the sharp end of the model. And I think one thing that people talk about is mental models. Our systems are now so ... We're not using LAMP stacks in most cases anymore, right? So our architectures are so complex that we can't hold them in our heads anymore. We just can't. It's just too, it's too complex. And so, the engineer that sits next to you might have a completely different view of what their mental model of the architecture looks like than you do, or even how the system's designed, what the inputs and outputs are, what the expected behavior is. And this is why things like Chaos Engineering and doing things like game days are super important.

And I think also another really key one here is esoteric knowledge. So there is a bit of research happening in resilience engineering called cognitive task analysis, which is how do people know the things that they know, and how do people gain esoteric knowledge in the work that they do, and how can we surface that up so that it becomes knowledge that we can share because often that is a very difficult thing to do.

But for me, in my company, it's like, you know, when we talk about Chaos Engineering, it's like, at what cost? And I think Gremlin have done quite a good job about talking about quantifying or using proxy metrics to kind of quantify what is the risk to your customers in terms of the monetary value. Maybe trust is another one. You have to think about what kind of customers are we talking about here, internal or external customers, as well. And they've written some really great blog posts about this.

But this is how I try to get sponsorship to do this anyway. And this is how I try to talk to my managers and the executive team in terms of getting buy-in. And I have been quite lucky anyway. I know some people here have said that as well, like "I have been quite lucky. My boss has been quite good at backing me up in order to do this." You know, I had the product director often saying, "They don't need any time. We need to just launch on this day." And my boss said, "No, just give them the time that they need in order to run their game days to make sure that they're happy with what they're doing." But yeah, you can talk to me about the long road afterwards, about how to get that.

Another thing that I think is changing and I'd like to see changing more is trying to take an alternative view and direction on post-mortems. And there's also ... I try to put a lot of research and things that I've read into my talk, by the way, as you might have noticed. But there's a guy called Steven Shorrock. Again, resilience, human factors expert who writes really, really well on this topic. And he talks about work-as-imagined versus work-as-done. But there's actually a few more things happening here. So, work-as-imagined. So we imagine how, in our minds, the work that we do or someone else's doing. There's also the work-as-prescribed by our managers or by other factors. There's work-as-disclosed, so what we might actually mention in terms of what we tell our colleagues, what we document, what goes into a post-mortem. And then there's the work that's actually done. So that's that esoteric knowledge that I talked about earlier as well, and how do we get that surfaced in a way that it becomes learning for the company?

And so, what do we do? So we started trying to change the language a little bit and call them post-incident reviews. And we actually try to make sure, not only just invite, but just make sure that there's quite a diverse audience that come along to our reviews. And that includes people from the exec team. It can also include people, depending on what customers it affects ... Like you wouldn't normally get outside customers, but we have commercial teams that we serve, editorial teams that we serve. We create platforms for developers. So we bring along product managers, designers, UXers. Just get them in there because they will bring a unique perspective. And also what's important to the product manager might seem quite at odds with what's important to the developer and trying to, I think what's somebody was saying earlier, it's like bringing business value. You know, it depends on whether or not the product manager is actually passing that information on in a way that can be understood and kind of get everybody aligned against a single mission basically and what they're doing.

So this was actually the actions that we wrote up as well. This is taken from our document. As you can see, a lot of these are technical fixes. And they're assigned to a team. And a lot of it was around CPU, DNS exhaustion. You know, I was also observing. And I'm glad that they put this one at the end, because I was thinking, Oh God, I hope they put that in there. But you know, it's like, how did we get to point where DNS causes things to throw a 503 and that's what the user sees? Because for me that's like we're doing ... You know, we could be doing better probably at sort of error-handling and error sort of bubbling as well and sort of capturing that before that's what the actual customer sees.

But you need to think beyond purely technical fixes. Because I mean, that's great, right? And I think that's really important. You need to be able to go back and actually identify whether or not you actually did that work. But what else could we do? What else can we do? So, you know, one thing that we tend to do in our game ... Actually, I'll tell you a story about the first one we did.

We actually did role-playing for a long time before we actually started actually breaking anything. So we kind of created some dummy scenarios and brought everybody into a room and said, "How would you go about fixing this?" And the first thing that became pretty apparent is like there were several services that people weren't even aware of. Some people knew about them and others didn't. And the way that people would say, "This is what I would ... These are the steps that I would probably take to even identify what's going on." They were all quite different.

And it actually taught me something about the culture at that time as well. Because we actually had somebody from the ops team there and we said, "Well, what would happen?" So it's something about like an image was on the homepage of vogue.com, like the image wasn't displaying and instead it's throwing some error. And then the person tried to remediate that. But by doing that, the operator actually made it worse. And then they took down the whole site and a couple of other sites as well, given the multi-tenanted architecture. And the person who is kind of doing incident commander said, "Okay, well, what would we do?" And somebody said, "We would fire them." And I thought, No, this is exactly the opposite of the culture I want to create. But it was a good way of like learning a little bit about the culture at the same time.

But yeah, one thing I'd recommend is doing white-boarding during a review. Say, somebody, okay, could you ... And if they don't feel comfortable, you can do things like one-to-ones with people and not do it in a room. But it would be good to get this point where you can say to somebody, "Okay, can you draw an architecture diagram of the part of the system where you think there might've been part of the failure?" And then you get somebody else to do it. And it can usually find that actually there are gaps there. And you can say, "Okay, well, maybe what we could do is get them to do a rotation on that team. Get them to pair with that team more." Find some way of bridging that knowledge gap.

A really common thing I see as well is incident management processes like we build this beautiful incident management process. It's got some sort of flow to it, like kind of a decision-tree flow of what to do. When it's actually the moment, in the moment, I find that often that process can break down. Or it's surprising, like the way that we designed it, in actuality it doesn't flow that well. And we have seen things where we've forgotten to communicate to the customers in a timely way and [inaudible 00:31:20] how you update that.

And this last one is a thing I mentioned before. Too many graphs to be able to even discern what's going on. There's just too much information. We need to kind of hone in on what's important. And sometimes if you have too much of a wealth of information, it becomes more of a hindrance than a help.

Okay, so I think a few people have mentioned Casey Rosenthal, who kind of wrote about Chaos Engineering, set the principles of Chaos Engineering. This isn't one of our pipelines. I took this from the internet. But we do use CircleCI. And CircleCI is really great because it allows development teams, product teams to create their own pipelines for CI/CD. And so we have a lot of different pipelines, as you can imagine, and different steps to those as well, because they can choose to do where they want. That does create a bit of extra complexity. But one thing that Casey talks about as well is can we build in something called continuous verification into our pipelines as well? CI/CD is great, but instead of trying to validate the internals of our architecture, can we actually verify the inputs and outputs that we expect on a continuous basis and even run our experiments through that pipeline continuously?

So I had ... I mean, I do look around quite a lot. But I have observed as well that there is quite a bit of tooling and there are tool chains out there. If you choose to kind of go your own path, maybe not use something like a commoditized service like Gremlin, hosted servers like Gremlin, it does take a bit of work. We did try this at work before we went and used Gremlin. We did try to set up a tool chain separately and see what the effort was. But one thing I saw was it relies heavily on Java. A lot of these services are Java-based, which isn't so good if you're running a Node set of applications. So, yeah, I guess one of my asks would be please contribute to the open-source community. But it would be good to see other languages being supported there a bit more as well. Yeah, we would love to do more application-level fault injection, but at the minute there isn't so much tooling there to support that.

And I wanted to plug someone that I have been speaking to at this conference and is a really sort of amazing writer on the subject and knows a lot about the subject. His name is Adrian, and he's a principal architect at AWS. And he wrote a lot about some of these things that you can do that have been open-sourced.

So there is something that AWS has, which is called the system manager, which you can go in, and you can actually run chaos experiments across different parts of your stack. All kinds of things like kill switches, infrastructure attacks, DNS. There's lots different attacks. And this is something you can do if you're running on AWS that I think isn't that ... I wanted to bring it up because it's not that well-exposed. I didn't even know about it until yesterday or today. And I think this is great because this allows us to do the kind of attacks programmatically through our own teams. And actually I've been told as well you can run it on a container and other clouds as well. It doesn't have to be just AWS.

But yeah, do find Adrian. He is here somewhere today. Yeah.

And this is one of my last slides. So I guess for me, I think it was great to hear Gremlin announce their scenarios, because this is kind of how I'd like to see things move as well, is being able to run a multi-vector attack. Because again, the root-cause fallacy, like there usually isn't a single root cause. It's normally maybe something happened. Maybe there was a gradual memory leak in your application which caused knock-on effects somewhere else and then it caused another effect over here. And actually what you want to see is maybe doing something like this, kind of hitting lots of things, either linearly or all at the same time, just concurrently, and seeing what happens basically, but also to be able to do this in a really more random way.

Because I think that the problem that we often have with Chaos Engineering is that in order to create experiments, they're fairly contrived. You have to imagine what could happen upfront. But is that true to experience? Usually the point in chaos is that it's fundamentally surprising to us. We think that we have risk under control. We think that we know what could fail. But usually in an incident, it's surprising. That's the nature of it. It should be surprising. Or even in the near misses, they say to pay attention to near misses as well, where you went ... That could've blown up into something really big. But I do think we need to get to a point where we can evolve this to be very sort of, yeah, randomly generated or stochastic as they say.

And this is my last slide. So any time you hear this song again, I hope this is what you hear. "It's stochastic. It's fantastic." But yeah. And yeah, I think that's it for me. So, yeah. Thank you.

See our recap of the entire Chaos Conf 2019 event.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Crystal Hirschorn: The Future of Chaos Engineering: In Pursuit of Unknown Unknowns - Chaos Conf 2019

What is Failure Flags? Build testable, reliable software—without touching infrastructure

Introducing Custom Reliability Test Suites, Scoring and Dashboards