Podcast: Break Things on Purpose | Omar Marrero, Chaos and Performance Engineering Lead at Kessel Run
In this episode, we chat with Omar Marrero, Chaos and Performance Engineering Lead at Kessel Run, a company at the forefront of delivering “combat capability that can sense and respond to any conflict in any domain, anytime, anywhere.” To say that Omar and Kessel Run are at the forefront is an understatement. Join the conversation as Omar and Jason discuss bringing chaos into the DOD, bringing the best software possible to the warfighters, convincing the DOD to get on board with chaos engineering, and more! Tune in for the rest.
In this episode, we cover:
- What Kessel Run is Doing: 00:01:27
- Failure Never has a Single Point: 00:05:50
- Lessons Learned: 00:10:50
- Working the DOD: 00:13:40
- Automation and Tools: 00:18:02
- Kessel Run: https://kesselrun.af.mil
- Kessel Run LinkedIn: https://www.linkedin.com/company/kesselrun/
Omar: But I’ll answer as much as I can. And we’ll go from there.
Jason: Yeah. Awesome. No spilling state secrets or highly classified info.
Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.
Jason: Welcome back to Break Things on Purpose. Today with us we have guest Omar Marrero. Omar, welcome to the show.
Omar: Thank you. Thank you, man. Yeah, happy to be here.
Jason: Yeah. So, you’ve been doing a ton of interesting work, and you’ve got a long history. For our listeners, why don’t you tell us a little bit more about yourself? Who are you? What do you do?
Omar: I’ve been in the military, I guess, public service for a while. So, I was military before, left that and now I’ve joined as a government employee. I love what I do. I love serving the country and supporting the warfighters, making sure they have the tools. And throughout my career, it’s been basically building tools for them, everything they need to make their stuff happen.
And that’s what drives me. That’s my passion. If you’ve got the tool to do your mission, I’m in and I’ll make that happen. That’s kind of what I’ve done for the whole of my career, and chaos has always been involved there in some fashion. Yeah, it’s been a pretty cool run.
Jason: So, you’re currently doing this at a company called Kessel Run. Tell us a little bit more about Kessel Run.
Omar: So, we deliver combat capability that can sense or respond to conflict in any domain, anywhere, any time. Or deliver award-winning software that our warfighters love. So, Kessel Run’s kind of… you might think of it as a software factory within the DOD. So, the whole creation of Kessel Run is to deliver quickly, fast. If you follow the news, you know DOD follows waterfall a little bit.
So, the whole creation of Kessel Run was to change that model. And that’s what we do. We deliver continuously non-stop. Our users give us feedback and within hours, they got it. So, that’s the nature behind Kessel Run. It’s like a hybrid acquisition model within the government.
Jason: So, I’m curious then, I mean, you obviously aren’t responsible for the company naming, but I’m sure many of our listeners being Star Wars fans are like, “Oh, that sounds familiar.” Omar: Yep, yep.
Jason: If you haven’t checked out Kessel Run’s website, you should go do that; they have a really cool logo. I’m guessing that relates to just the story of Kessel Run being like, doing it really fast and having that velocity, and so bringing that to the DOD, is that the connection?
Omar: Actually, it goes into the smuggling DevSecOps into the DOD, so the 12 parsecs. So, that’s where it comes from. So, we are smuggling that DevSecOps into the DOD; we’re changing that model. So, that’s where it comes from.
Jason: I love that idea of we’re going to take this thing and smuggle it in, and that rebellious nature. I think that dovetails nicely into the work that you’ve been doing with chaos engineering. And I’m curious, how did you get into chaos engineering? Where did you get your start?
Omar: I’ve been breaking things forever. So, part of that they deliver tools that our warfighters can use, that’s been my jam. So, I’ve been doing, you can say, chaos forever. I used to walk around, unplug power cables, network cables, turn down [WAN 00:03:24]. Yeah, that was it.
Because we used to build these tools and they’re like, “Oh, I wonder if this happens.” “All right, let’s test it out. Why not?” Pull the cable and everybody would scream and say, “What are you doing?” It was like, “We figured it out.”
But yeah, I’ve been following chaos engineering for a while, ever since Netflix started doing it and Chaos Monkey came out and whatnot, so that’s been something that’s always been on my mind. It’s like, “Ah, this would be cool to bring into the DOD.” And Kessel Run just made that happen. Kessel Run, the way we build tools, our distributed system was like, “Yep, this is the prime time to bring chaos into the DOD.” And Kessel Run just adopted it.
I tossed the idea, I was like, “Hey, we should bring chaos into Kessel Run.” And we slowly started ramping up, and we build a team for it; team is called Bowcaster. So, we follow the breaking stuff. And that’s it. So, we’ve matured, and we’ve deployed and, of course, we’ve learned on how to deploy chaos in our different environments. And I mean, yeah, it’s been a cool run.
Jason: Yeah, I’m curious. You mentioned starting off simply, and that’s always what we recommend to people to do. Tell us a little bit more about that. What were some of the tests that you ran then, and then maybe how have they matured, and what have you moved into?
Omar: So, our first couple of tests were very simple. Hey, we’re going to test a database failover, and it was really manual at that point. We would literally go in and turn off Database A and see what happened. So, it was very basic, very manual work. We used to record them so we can show them off like, “Hey, check this out. This is what we did.”
So, from there, we matured. We got a little bit more complex. We eventually got to the point where we were actually corrupting databases in production and seeing what happens. You should have seen everybody’s faces when we proposed that. So, from there, we’re running basically, we call it ‘Chaos Plus’ in Kessel Run.
So, we’ve taken chaos engineering, the concept of chaos engineering, right, breaking things on purpose, but we’ve added performance engineering on top of it, and we’ve added cybersecurity testing on top of it. So, we can run a degraded system, and at the same time say, “All right, so we’re going to ramp up and see what a million users does to our app while it’s fully degraded.” And then we would bring in our cyber team and say, “All right, our system is degraded. See if you can find a vulnerability in it.” So, we’ve kind of evolved.
And I call it, put chaos on a little bit of steroids here. But we call it Chaos Plus; that’s our thing. We’ve recently added fuzzing while we’re doing chaos. So, now we got performance chaos, our cyber team, and we’re fuzzing the systems. So, I’m just going to keep going until somebody screams at me and says, “Omar, that’s too much.” But that’s essentially a little bit of our ride in Kessel Run.
Jason: That’s amazing. I love that idea of we’re going to do this test, and then we’re going to see what else can happen. One of the things that I’ve been chatting with a bunch of folks recently about is this idea, we always talk about, especially in the resilience engineering space, that failure never has a single point. It’s not a singular root cause; it’s always contributing factors. And the problem is, when you’re doing chaos engineering, you’re usually testing one thing.
And then it’s like, “Oh, I did the failover on that database and that worked.” I’ve been suggesting that people now start to do, “Well, if this is in a degraded state, what are the contributing factors—if that’s still working, what are the contributing factors that can lead to a major catastrophe?” That’s one of the nice things that actually performing these failures allows you to do rather than just imagining them and trying to work up some sort of response process to your imagination.
Omar: That’s our thing. So, from our perspective, that’s what I charge the team to do is like, “Hey, we need to make sure these things are working.” Comes back to my passion, right? Were delivering tools to the warfighters; the warfighter needs to have tools that work. And that’s what Kessel Run does; that’s what Kessel Run exists for.
We deliver that award-winning software that our airmen love. So, following that trend, that’s where chaos comes in place. So, we’re building fancy tools, and we got an awesome platform that supports it and all that stuff. We’re just there to make sure, “Hey yeah, this is engineered correctly. It’s responsive to fault or any kind of failure.” And we just—I mean, we’re literally blasting it with anything we can imagine to make sure it could support that.
Jason: I’m curious if you could dive into some details about one of your recent chaos engineering experiments. Was anything unusual or unexpected? And what did you learn from it?
Omar: So, I think one of the cool ones, which is the latest one, was that database corruption. There was a lot of questions on, “Hey, we have some tools in place we built. The engineering is in place to make sure that if the database goes down, nothing is impacting our system and whatnot. What would happen if the database gets corrupted?” For some odd reason. I don’t know, that’s probably going to happen once in a million, I don’t know.
But it’s like, “Hey, let’s figure it out.” So, my team came up with an experiment; we went and we started corrupting databases in staging. It’s like, “All right yeah, that was cool.” Oh, and then we went to the leader, she was like, “Hey, we want to do this in production and call an outage and see how the teams responds.” And at the same time, we’re going to throw a whole bunch of curves.
We’re going to disappear key people, we’re going to make sure you don’t have access to certain things. It was not just database corruption; we’re going to throw curveballs at you like there’s no tomorrow here. So, we did, and it was actually a pretty good experience. So, we figured out, hey, yeah, the database corruption just happens, whatnot and the team like our SRE team actually figured out. It took them a little bit because it was a lot of curveballs, but we learned, all right, if this does happen and we have all these issues happening at once, it’s probably a non-realistic—I’d call it—fire drill, but it’s something we got to prepare for just in case.
We’ve learned from it and we actually practiced it again. So, from the initial time it took us to go through the curveballs, we did another one, threw different curveballs at them, and that was like a no-brainer. They’re like, “Yep. We got this. Don’t worry about it. We ran this through once, so we know.”
Which is why we do these things. You want to practice and then, if there’s an outage, shorten the time, make sure it’s not impacting. What was really cool to see is, like, it didn’t matter how many databases we corrupted and how many curveballs we threw at the system, there was never an impact to the end-user, which is the goal. We practice chaos to make sure that it’s always working. So, we validated that our system can tolerate all these curveballs and all these things we were doing at it. And it’s something that we’ve never tried before, so it was pretty cool.
Jason: I love that you mentioned what you threw at people was maybe not realistic, it’s not something that would happen in the real world, but I think it brings up that idea of when you’re training for things, if you train harder, if you’re an athlete and you train harder than you wouldn’t normally in a game, and you’re constantly stressing yourself when it comes to that real-world situation, it just seems easy.
Omar: Yeah. And that’s what the SRE team—because we do the normal, “Hey, we did the test,” and then we go, [it’s like 00:10:31], “This is what we saw.” And then we actually asked for feedback from the team’s. It’s like, “Any way we could have done this test better?” The normal process.
And they’re like, “We loved this. We’ve learned so much that helps us either automate more scripts or streamline our process.” So, from our standpoint, we’ll keep throwing curveballs. And I think they did that, aside from, hey, this is a very realistic scenario, and then we go to the—this is probably a little bit over the edge, but we still want to do it. We do both. It’s good.
Plus, it doesn’t keep a same [unintelligible 00:11:04]. We’re used to it. All of a sudden you’re throwing all these curveballs at the team, they can nitpick from all these lessons learned and put better processes in place, make it faster, better engineering. The team’s awesome. All the team that supports Kessel Run, our SRE team, our platform team, everybody’s super smart, super amazing, and I’m just there to test their ability to respond. Which is why I like my job.
Jason: You mentioned lessons learned, and I’m curious, as somebody who’s been doing chaos engineering for quite a long time, actually, what are some of the top lessons that you would give, or the top advice you would give to our listeners as they start to do chaos engineering?
Omar: I would say. So, you start simple, and that’s key. You start simple. If you really mention chaos to somebody who’s not familiar, the first thing they’re going to do is they’re going to Google ‘chaos engineering,’ and what they’re going to find out is Netflix and Chaos Monkey. That’s an awesome tool, but do your research, figure out what other people are doing, and get involved in the chaos community world; there’s a lot of people doing some cool stuff.
Start with a small test so you can see and get the data from there, and scale up. As you learn and as you go, you scale up. And it helps—chaos scares, sometimes—or not, sometimes. For the most part—your senior leadership because you’re telling them, “Hey, I’m going to come in and break stuff.” So, doing small-scale tests allows you to prove and provide, hey, this is why it’s beneficial.
The actual event is not chaos. We call it chaos engineering, but the actual event is very controlled. We know what we’re doing, we’re watching, we have somebody in place to say stop in case things are going haywire. So, you have to explain that while you’re doing. And just do it; it’s just like testing, you have to test your applications, and the more testing you do, the better.
The closer you shift left the better, too but you have to test. You got to make sure your apps are working. So, chaos engineering is just another flavor to that. The word chaos usually scares people. So, you just got to slowly do it and show them the value of doing chaos.
Hey, you’re doing chaos, this is what it brings. Hey, we just proved your database can failover. That’s a good thing. And if it didn’t fail over, it’s like, how can we make it happen? So, that’s a small-scale test that provides that feedback and data you need to say, this is why we have to adopt chaos engineering.
And as you going, get—do—go crazy, right? As your leadership allows you to do stuff like, yeah, let’s just do it. And work with your teams. Work with the SRE teams, work with the app teams and get feedback. What do you need?
What is your biggest problem? That’s one thing I ask my team to do. So, every month, they go to the team and say, “All right, so what’s your biggest hurdle? What right now is your—why don’t you sleep?” And we go, “Okay, can we replicate ‘the why don’t you sleep’ so we can let you sleep?”
So, that’s an approach that’s worked for us. And a whole bunch of our tests are based on that. It’s like, okay, “What keeps you up at night?” We’ll test it so you can sleep. And then next month, give me the next thing that keeps you up at night. And we go in and we test it.
Jason: And like that iterative approach of let’s work on, what’s your biggest pain point? What keeps you up at night? And then let’s solve that. And then what’s the next thing? And keep working down that chain until, hopefully, nothing keeps you up at night.
Omar: Yeah, that’ll be good. We all sleep and it’s like, “Oh, this thing’s on cruise control. Let’s go.” Jason: You mentioned convincing management or the upper levels of management in allowing you to do this. What’s that process like at Kessel Run? And then, what’s that process look like as Kessel Run convinces the broader Department of Defense to adopt this?
Omar: Oh, that’s a fun one. Yeah, so we when we first brought it up, we got the, “What are you trying to do?” Look—because it was like, “Hey, we want to do chaos engineering.” It was like, “Okay, yeah, we’ve heard a little bit about this. What does that mean?”
It’s like, “I’m just going to break stuff.” Which probably wasn’t the smart approach at the moment, but that’s what I said. And they’re like, “No, wait. What do you mean?” And I’m like, “Yeah, and eventually I want to do it in production.”
So, I just went all out. That was my presentation. You know, I’ve learned from that. It’s like, okay, baby steps, Omar. But initially, it was like, “I want to do chaos and I want to get to production.” They were like, “Yeah, sounds good, but I need a plan.” I was like, “Okay. I’ll come up with a plan. And we’ll figure it out.”
And so that’s how we slowly started. And I stood up the team, Bowcaster, and from there we kind of, all right, how do we show the value of chaos engineering? How do we learn chaos and all that stuff? So, it was easy to get them to adopt it. It was the actual execution of tests that was a concern.
Because there was a lot of unknowns. We didn’t know what we’re going to break. We don’t know how it’s going to react. And how do we actually do this? And we slowly just kind of did those little tests. It was like, all right, we’re going to do this, we’re going to do that. And that’s how we got it.
And now that we’re moving to the rest of the DOD, that’s a really cool adventure because our framework, what Bowcaster has built in Kessel Run, is what they want to move to the rest of the DOD. So, the Chaos Plus model is what’s interesting. The fact that we are moving to the rest of the DOD is very cool because it’s something I believe should be in the rest of the DOD. And we’re happy to experiment. From the Kessel Run perspective, that’s what we’re here for.
We’ll experiment and we’ll let you know what fails what doesn’t fail because we’re an experimental lab. And, yeah. But the senior leadership in DOD in charge with all the software development and stuff like that, they’re all over it. They just want to—hey, how do we make it happen? What do you need?
You’ll see there’s a different mind change now that chaos engineering is more familiar around the DOD and the tech space. “Hey, yeah. This thing called chaos engineering.” It’s not just, yeah, Netflix does chaos engineering. It’s like, yeah, everybody’s doing chaos engineering.
So, you see the little mind shift from, initially, when I bought it in. It was like, “Hey, I want to break stuff in production.” And everybody’s like, “Whoa, hold up there. There’s [no 00:17:05] baby steps here, Omar.” Now, it’s like, “Hey, let’s go and do it.” Is like, “Yeah, let’s do it. How do we execute? But let’s do it.” So, it’s a very cool thing to see.
Jason: I’m wondering if maybe that readiness to adopt things like this since you’ve spent time in the military—I haven’t, but from what I understand, it sounds like the military has ideas of really, really doing testing. And in some cases, not production testing. We don’t start wars just to train the military, but there is the idea of things like live-fire testing. Do existing practices within the military influence the perception of chaos engineering, and to help people actually understand it better, maybe more so than with standard civilians and corporate enterprise?
Omar: Yes. Testing is very important in our systems. So, it’s a different mindset, I would say. So, because in corporate world, it’s all about the money, making the system work and make sure it’s not going down because you lose profit. Or if you’re—that’s the mindset on that one.
For us, we are in charge of defending the nation, so our system has to be proven and ready to rock within seconds. So, we do a lot of tests, and chaos engineering is just one extra layer to those tests. And now that we are moving to this massive DevSecOps transformation, chaos engineering is key. There’s no way we can do this without having chaos engineering involved. So, that’s what our senior leadership is pushing.
Hey, yeah, this is another flavor of testing. It’s important because we’re building distributed complex systems across the cloud and whatnot, to support the DOD mission. So, chaos engineering is there. Same thing with the live-fire testing. We got to do live-fire testing to make sure that the ammunition is working, and the guns are working, and everything’s working right. This is just a different flavor of live fire testing, just on software, and applications, and infrastructure, and the whole deal.
Jason: You mentioned running game days and throwing curveballs, and that sounds like more of a manual game day where you’ve got people running the attacks and people responding. You’ve mentioned Kessel Run and really that velocity, and getting faster at things, and automating. Have you started automating the chaos engineering process as well?
Omar: So, we have and we’re following the same approach as when we started. So, the baby steps approach. So, we are going to slowly work with the SRE team to automate some of these tests. And that’s ongoing. My team’s working on it right now, so we’re getting there.
It’s part of our slowly learning and kind of process. The manual, like, game days won’t stop. Those will keep going because of the curveballs we want to keep throwing at the teams, but the automations is coming. The idea is to get the chaos engineering closer to the dev cycle as we can, so shift left as much as possible. And that’s our next goal.
So, we’re working on that. And I think a lot of it comes down to where do we do it. So, we work in different environments. It’s not just what we call the internet right now. We have different environments, so how do we automate across all environments?
And part of it is how are we architect that so it works. So, if we make it work on one environment, how does it work on all the environments? So, that’s usually where our timelines are. So, trying to make sure that our architecture supports all environments versus having to spend a lot of resources, you know, all right, we’re going to engineer one environment, we've got to engineer another environment, we've got to engineer another environment. We want to make sure just to—out of the box, here we go. But that is part of our goal, and we are starting baby steps, so the database failover test is probably the one we will automate first.
Jason: As you’ve done chaos engineering, you’re doing the game days manually; what was the process like in terms of tools and adoption? I think a lot of people start off and they hear of Chaos Monkey and so they immediately jump over and, “Cool, let me grab Chaos Monkey and see if I can use that.” For any listeners that have tried that you’ve probably have quickly recognized that that tool, not so great for public consumption, was very much designed for Netflix. So, I’m curious if you could tell me more about your tools adoption, what have you used? What are you using now? What does that evolution look like?
Omar: Yeah, so we actually—the first thing I told my team was you are going to research tools. [laugh]. I know Chaos Monkey is out there, but I’m like, there’s definitely more tools that we should look at. I’m sure there’s been a whole bunch of tools created, depending on our platform. And that’s what they did.
So, they went and they researched a whole bunch of tools. And they came back and they presented the tools they wanted to use, or kind of just integrate into our architecture. When the team started, right, so when we started that chaos team, the Bowcaster team was supposed to focus just on chaos engineering, but the more I kept thinking about it, it was like we need to focus on chaos and some other stuff. So, that’s where the performance engineering and the fuzzing came in plays, and bringing the cyber team into the game. So, from a tool perspective, when you look at us, Bowcaster the team is also the tool.
So, they have a tool, Bowcaster is the tool that we deploy across KR to do chaos engineering. Now, within that tool or that framework, there’s the tools behind it. And there’s a combination of open-source tools and other tools that we do there, but those just provide the engine for us to perform all of our tests on what we call Chaos Plus. So, Bowcaster is our tool. Yeah, it’s the team and the tool is kind of weird, right?
But the team and the tool, so when you go into KR and you say, “Hey, I want to chaos engineering.” It’s like, “All right. Go do chaos engineering with the Bowcaster tool that the Bowcaster team built.” But the architecture behind that, there’s a lot of tools. And it was that—that was the task I gave the team.
It’s like, “I need you to research tools. I know, Chaos Monkey is out there. I know Simian Army, I know all these tools that originally come out when you Google.” It’s like, if Netflix created it, that’s the first thing that comes up. But there has to be more, especially in the Kubernetes world. There’s a whole bunch of tools. So, that’s what they did, and we took a combination of those tools and we built Bowcaster. And that’s what we got.
Jason: That’s an excellent point, though, about not just a chaos engineering tool. And I think a lot of times when people think of chaos engineering because it’s chaos engineering it sounds like this well-defined practice of, this is it. If you have chaos engineering, you must have chaos engineers, and so it seems siloed when in actuality, it’s just one of many practices that SREs and DevOps and all engineers should practice. So, this idea of, we’re going to build a tool that has not just the chaos engineering, but all of these other things that you need, and providing that as a service is, I think, a fantastic idea.
Omar: That’s always been the charter I’ve given the teams. Yes, we want to do chaos engineering; chaos engineering is awesome. We all dig it, we preach it, we’re huge advocates of it, but what else can we provide? I mean, we’re already degrading the system, so what else can we test? [unintelligible 00:24:25] break the system and blast it with a million users and see what happens. And it’s like, “All right, systems degraded; we’re blasting it. Let’s see if we can hack it.”
And maybe while that’s degraded and getting blasted, maybe we figure out there’s a vulnerability or something. So, that’s always been the concept. It’s like putting chaos engineering a little bit on steroids, we call it. And that’s what Bowcaster does. Bowcaster’s job is to build these things and support it.
And I’m sure we’ll come up with other crazy stuff as we get feedback from team, like, “Hey, it would be cool if you can do this.” And we’ll just build it into our framework and it will just be another service that Bowcaster provides aside from performance and chaos engineering.
Jason: Omar, thanks for coming on the show. Fantastic information. It’s inspiring to see the journey of where you’ve come from and where you’re headed, especially with the Bowcaster team at Kessel Run. Before we go, though, I wanted to ask, do you have anything that you want to plug or promote, job openings, upcoming speaking? Where can people find you on the internet to learn more about the stuff you’ve been doing?
Omar: So, Kessel Run, very active, so you can find us at LinkedIn: Kessel Run, or just go to our site, kesselrun.af.mil and you’ll find a whole bunch of information there, careers, so if you’re interested come work, we’re cool people. I promise we do cool stuff.
And if you come work for Bowcaster, we’ll hire you and you can break stuff with us, which is why we—can’t get better than that, right? Yeah, come check us out, kesselrun.af.mil. Lots of information there, careers, you can follow us and yeah.
Jason: Awesome. Thanks again for coming on the show.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more