Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode of “Break Things on Purpose,” the podcast all about Chaos Engineering, we speak with Matthew Simons, Senior Product Development Manager at Workiva.
Rich Burroughs: Hi, I’m Rich Burroughs and I’m a Community Manager at Gremlin.
Jacob Plicque: I’m Jacob Plicque A Solutions Architect at Gremlin and welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: Welcome to episode seven. In this episode we speak with Matthew Simons from Workiva. Matthew is focused on quality and reliability in Workiva services and we had a great discussion about reliability and customer focus. Jacob, what jumps out to you from our chat with Matthew?
Jacob Plicque: Matt is just super awesome and the way that Workiva’s doing Chaos Engineering just feels different. His team is sort of guns for hire, helping other teams get better and the way they go about that I think is really super interesting. How about you?
Rich Burroughs: I really liked hearing about his team’s experience implementing Chaos Engineering in a shop with a very mature, existing product. I expect most people starting to do Chaos Engineering aren’t doing it on greenfield projects. They want to experiment on the apps and the services that are making their company money today.
Jacob Plicque: Could not agree more. A quick reminder, we’re on Twitter at BTOPPod. That’s B-T-O-P-P-O-D and if you’d like to give us feedback on the podcast, our email is firstname.lastname@example.org.
Rich Burroughs: Great. Let’s go to the interview.
Rich Burroughs: Today we’re speaking with Matthew Simons. Matthew is a Senior Product Development Manager at Workiva and he leads the Quality Assessment team there. Welcome.
Matthew Simons: Hi, good to be here.
Jacob Plicque: Awesome. Yes, thanks for joining us. Well, kicking off with Rich’s introduction, why don’t you to tell us a little bit about Workiva’s quality assessment team and maybe overall what does Workiva do as a company?
Matthew Simons: Yeah, so I’ll start with the second one first. What do we do as a company that’s a little easier to describe. We basically provide reporting solutions, compliance reporting solutions for companies. We’re business to business and what that means is there are different regulatory compliance requirements on companies depending on how big you are and what you do. We facilitate the creation and the electronic filing of those reports. The biggest solution that we have is our SEC filing product. Companies that need to file a quarterly or yearly earnings reports with the SEC, the Securities and Exchange Commission, they will create those reports and then file them electronically through our platform Wdesk. In a very simplistic way, I guess you could say we’re like TurboTax for big companies, but there’s a lot more than that.
Matthew Simons: It’s connected reporting. It’s every piece of data within our entire platform between documents and presentations and spreadsheets can all be linked to different places because a lot of these reports they are presented in like a text format with embedded tables and things like that and the numbers come from layers and layers of spreadsheets and formulas that get baked up. We make it so that at the very last minute you can change something that’s eight layers deep in spreadsheets and that change will trickle all the way through the system and update documents and documents and even up to your filing documents where numbers are written out in word form and… The technology behind it is pretty cool but it’s not a sexy product in a lot of ways.
Rich Burroughs: But it sounds very important, that people are using this for pretty important stuff.
Matthew Simons: It is. Our mission is really to provide more transparency in financial markets. Our whole goal is to… We operate under this thesis that if we had been around in 2008 or prior to 2008 that maybe we could have helped prevent some of the financial events that took us into a depression, right? That a lot of the problems that we have with financial markets are a lack of observability and transparency in those markets. By providing a reporting platform that is very transparent and connected, that we can increase that observability into that process.
Rich Burroughs: And your team, the Quality Assessment team.
Matthew Simons: Yeah, so the Quality Assessment team is an unorthodox thing. To put it mildly, we’re an experiment. I have a small team, I’ve got three engineers. We are equal parts quality guns for higher quality standards, consultants, and the inquisition. Although we try not to… it’s actually the opposite of the perception we try and put out there. What we achieve, we achieve largely through collaboration and relationship building. That’s absolutely a joke to say that we’re the inquisition.
Rich Burroughs: So nobody expects the Quality Assessment team?
Matthew Simons: That’s right.
Jacob Plicque: Rich you beat me to that by like two seconds.
Matthew Simons: Oh my gosh. We’ll get t-shirts like that eventually.
Rich Burroughs: It’s interesting to me though because typically it seems that Chaos Engineering projects are owned by operations people or SREs, but you all are focused on the overall quality of the company’s products, including the reliability.
Matthew Simons: We are. That’s actually where our pedigree lies. We come from the infrastructure side of the house where our team right now… I’ve been involved in the past in our infrastructure, I’ve worn a lot of different hats there from reliability engineer to engineering manager over various squads of engineers that have worked on products in our infrastructure department. Other engineers on the team come from SRE, from test engineering and from that kind of product focused infrastructure side. That’s definitely our pedigree. But what we’re focused on is very much a holistic platform level. But where we find ourselves helping out the most is really in that infrastructure quality side of things. We’re running microservices on Kubernetes deployed on AWS and there’s a lot of practices on how to get things.
Matthew Simons: There’s simple things like helping people get better load testing, helping people make sure that there are Helm charts and configurations for Kubernetes are set correctly, that they’re requesting the right amount of resources, that they have good horizontal scaling strategies. A lot of this is fairly basic stuff, but it’s really how we ensure that we have standards across a large microservices platform and that translates to reliability.
Rich Burroughs: Do those individual teams own operating their applications?
Matthew Simons: Yes and no. We’ve mostly made it self-service so they have the ability to define their own Helm charts, request resources from AWS via Cloud Formation that they can create. We have self-service CICD so we have an internal pipelines product, Release Pipelines that allows our teams to script out or orchestrate their deployments. We have a few touch points that are manual, but for the most part it’s all self service. One of our main goals, the Quality Assessment team as we go and perform individual assessments on different parts of our product is to ensure that the best practices for engaging with all of those mechanisms, those best practices are followed and adhered to.
Jacob Plicque: That makes sense. Now on the Kubernetes piece, that’s still fairly, I’d say fairly new from an age perspective, right? Because when you started at Workiva you guys were using App Engine initially before moving over to more of an IAS Kubernetes implementation, is that where you guys thinking about Chaos Engineering or even… Well, actually let’s back up. What was the reason for the switchover initially? Let’s start there.
Matthew Simons: There were lots of them. The decision to move to microservices and move off of App Engine happened, as I was joining the company so… It’s something that I can talk about, but I won’t have all of the perspectives, but I can say that some of them include a lack of flexibility on App Engine. One of them is that, everything is on rails, which we’re now figuring out as a great thing as well as a bad thing. But a lot of those rails were sort of restrictive in terms of allowing us to create the product that we wanted to create. There was a certain amount of I guess you could say creative freedom, wanting to break out of the very prescriptive development paradigm.
Matthew Simons: Frankly, there’s reliability issues. App Engine, I don’t know what the nines are, I haven’t measured, that’s probably something I should know, but we had a lot of issues where being on App Engine, if App Engine had issues, our entire platform went down. Google has a… they have a pattern of making changes without testing them adequately. I mean, we all do, right? It’s something that we’re all guilty of in one way or another, but these felt fairly egregious and that it was often enough that it made us go, “Yeah, that’s another drop in the bucket of reasons for us to potentially find a new platform.”
Jacob Plicque: Right. It became more of a… Because of the fact that we can’t peek behind the curtain, so to speak and make some changes, we need to get out in front of that and let’s get back to maybe controlling that piece. Is that what led to the Chaos Engineering practice started to kick off after that?
Matthew Simons: Yeah. I mean Chaos Engineering on App Engine, you don’t really have any option for, or rather options are very limited, right? You don’t control the platform and so if you don’t really control the platform then you can’t mess with the platform, right? You can’t break it in ways that… We weren’t able to simulate the failures that we were getting on App Engine and so we weren’t able to really build around them or build for that kind of failure.
Jacob Plicque: Yeah, because it sounds like when you were down you were hard down and you just had to twiddle the thumbs a little bit.
Matthew Simons: Yeah.
Jacob Plicque: Not to speak for you, but that’s what it sounds like at least.
Matthew Simons: Yeah. No, all we could do is call our TAM and bitch at the TAM and hope that they acted with urgency. You feel powerless even if in that situation they are working as hard as they can. I like to give them credit and assume that they are. The perception as a customer is that you’re just sitting there twiddling your thumbs, feeling powerless, and that’s a really hard thing to deal with when all of your customers are suffering.
Jacob Plicque: Exactly. I’m glad you mentioned the customer side of things and we talked about this in our previous episodes, but I think it’s important to reiterate is, your customer doesn’t care whether you’re on app agent or you’re an AWS or an EKS, they care if you’re up or not and if they can rely on you.
Matthew Simons: Yeah, absolutely. It’s bad form to blame your provider as well.
Jacob Plicque: Sure. Absolutely.
Rich Burroughs: Yeah. It’s interesting because I think that there seems to be a lot more movement towards managed services, especially with things like Kubernetes. It’s an interesting counterpoint to think about the fact that you do have less control than if you’re operating your own clusters or doing something like that. But Jacob mentioned that you’re an EKS, is that right?
Jacob Plicque: We are, yeah. We’re an EKS. Which has been an interesting struggle, adventure, I guess, maybe adventurous it’s a fair point to phrase that. It’s been an interesting adventure on its own, right? I think when we started on EKS there were a lot of people that felt like it wasn’t quite ready for prime time. It’s certainly matured. It’s gotten better. We’re still figuring things out but I think for the most part we’re doing very well and we have a very stable platform on it now.
Rich Burroughs: That’s great. I saw your talk at SREcon in 2017. I was actually in the audience when you gave it, and I remember this because I was watching the video the other day and there’s a joke in there about Iowa being a foreign country. I grew up in Iowa. The minute I saw that, I was like, “Oh yeah, I was there for that.” I enjoyed the talk. It was about crisis management at scale. One of the things you talked about there that I think is super important was the idea of tightening that customer feedback loop.
Matthew Simons: Yeah, yeah. It’s something that’s actually really hard to do. The customer feedback loop is at its core that really gets to a point of pain allocation or correctly getting the feedback where it needs to go. In a lot of the core thesis there I guess is that in organizations where you have specific teams, departments that are designed for handling crisis response, the people that are responsible for causing the crisis are often than not the ones that have to deal with it or that deal with the pain of the crisis. When we do that, when we remove the cause from the consequence, it just invites more dysfunction in that area. It reduces the feedback that we should have between our developers and our customers.
Matthew Simons: When a customer feels pain and then they report that pain, then that goes to a team where all they do all day is deal with pain. Then as a side note, out of a retro, we create an action item for a team. That team never had to feel all the pain. We’re not correctly, I don’t want to say incentivizing or dis-incentivizing, it’s not really a… Because it’s not a motivation thing. People want to make good products, it’s just a matter of intelligence. It’s observability again, it’s this idea that if we shield people from the pain of their actions. I think the metaphor that I used in the talk is, a child, a toddler putting their hand on a hot stove. If the feedback is either delayed or goes to someone else entirely, it doesn’t change the behavior. Ultimately what we need to do to protect our customers is change our behaviors to ensure that our behaviors align with good customer outcomes.
Rich Burroughs: I think this is the big argument that a lot of people have for putting engineers into on-call rotations.
Matthew Simons: Yeah, yeah, absolutely. There’s pushback, there’s always pushback when you suggest that, but I think that there’s a realization, there’s a couple of arguments that we always try and push when we’re talking to teams about this because this is one of the best practices that we’re really trying to push is, to make sure that teams have a dedicated rotation, that they have a defined rotation and they have an escalation path. If teams push back on that and say, “Well, we don’t want to go on call, that doesn’t… whatever. That’s just something we don’t want to do. I haven’t had to be on call.” We call baloney on them. I don’t know how colorful I’m allowed to be on your podcast.
Rich Burroughs: Baloney is the right level of color.
Matthew Simons: Okay. We call baloney on them and, and say, “You’ve been on call, you are on call. The truth is right now, if your product that’s in production, if your service goes sideways, if it’s 3:00 AM and customers are hurting, we’re calling you, we’re going to come wake you up, we’re going to find you or somebody on your team. But if you don’t have a rotation, especially if you are the point person, we’re calling you every time.” If you want to balance that load and if you want to ensure that we can reduce the time to resolution, then we need to have defined rotations and get people into our paging systems.
Rich Burroughs: That’s a good point. I feel like in my time in the industry, which is like 25 years when I started out, I didn’t ever hear about engineers being on call. I mean it just was a concept that I think didn’t even exist at that time, you know? It really seems that I think especially in the last like five years or so, that a lot of people have come around on that.
Jacob Plicque: Like for me, I don’t have the years that Rich has, I’ve only been really in the tech industry in the last probably six or seven but when I started I think this was right before PagerDuty became really, really popular. I was working overnight. The pain point wasn’t waking people up to fix things, it was, “Oh man, I got to pick up the phone and hopefully I dialed the right person in the right department in the right team.” There were times where that did not go that way for me I will say and got a bit of a talking to. Getting that experience definitely helped me when I became the one waking up because I didn’t want to wake up. It’s on me to understand why are these things happening and why am I waking up and what could I do to sleep more? It really is what it comes down to.
Matthew Simons: Yeah. Steering the direction now towards Chaos a little bit. This is where Chaos has a really strong selling point for developers or should have a really strong selling point for developers, which is, how do you want to find out about these problems? Do you want to find out about them when you’re looking for them during business hours, when everyone on the team is there and with you or do you want to find out about them at 3:00 AM and you’re in this half asleep haze trying to then troubleshoot an issue probably on your own.
Rich Burroughs: Right, with not the context that you have when you’re intentionally injecting the failure.
Matthew Simons: Right. Yeah. It also speaks I think to this point about a fatal optimism almost that we have as product creators, which is that it’s also sometimes hard to sell people on that accountability and on the being oncall and looking for failures, the whole… They might look at it as pessimistic. I think of it more as realistic, but that whole side of listen, failure is part of anything, anything we build. We’re almost like parents, right? Where the things that we build are almost like children to us. It’s really easy when something I’ll say wets the bed to say like, “Well, that’s just Johnny was scared tonight, right? He’s a good kid. It was a bad night tonight, right?” We make that excuse every time it happens and there comes a point where you just say, “No, Johnny’s not a good kid. There’s a pattern here. We need to make Johnny a good kid, right?” Your product, it’s an ugly baby, stop making excuses for it. Let’s make it better.
Jacob Plicque: Let’s have a chat Johnny. Have a seat.
Rich Burroughs: Yeah, I mean I think that the end goal, always is to try to improve the experience for the users, right? And to keep that in mind.
Matthew Simons: Yeah. I’ll say too, that’s a generalization. There are a lot of teams that don’t operate like that. That’s the worst example. There are a ton of teams that care very much about the quality of their product that are willing to accept that it’s not perfect and then it’s not a personal reflection on them as people or developers, but sometimes it does degenerate into that where it’s very tough to get people to care about those failure scenarios and do things proactive like defining oncall rotations, like doing Chaos testing, those kinds of activities that are really important.
Jacob Plicque: Yeah, I find that it’s also about how you address it and applying the right kind of friction. I mean, I was running a Game Day recently, we ran the attack and the scenario and things went really well and there was a… it was fine, but it’s an off color, remark around like, “Oh it’s staging, it’s not a big deal.” I grinned and I looked at the engineer that I was talking to that was talking about it and I was like, “All right, well now how comfortable do you feel about running the same experiment production?” And, “Oh no, no, no.” Okay well then… I was like, “Wait a minute. What happened? You were so comfortable, you were so ready to go just to a few moments ago.” But production is where your customers are and that’s where they care and if this failure happens in production, how ready are you with that? All right, maybe not this particular magnitude, maybe we tuned it down a little bit, but that’s something that we need to get towards.
Jacob Plicque: What it ties into overall it’s that operational maturity. We’re getting a wave overall I think as an industry from the aspect of being reactive and fixing things just isn’t moving the needle enough, especially when people can go to Twitter and complain about your brand for five minutes and that two minute outage costs you much more because people don’t want to go to you anymore. I’m curious as to… because I know you said that’s more of the worst case scenario, but are you running into anything like that and how you’re handling it?
Matthew Simons: Yeah. Okay. This actually speaks to something that I’ve been thinking about for awhile that perplexes me and I’m curious to know your thoughts on which is that… I spoke to it a little bit already, but the optimism in the face of scenarios where we shouldn’t be optimistic. In the sense that there are things that we should be afraid of that we’re not. If you’ll indulge me, I’ll go on a little bit of a tangent and I promise I’ll bring it back. Okay. Understanding, this is going to sound really tangential, but let me ask you a hypothetical here, which is hypothetically, if there were something that could unite all of humanity, the entire world, some event, some hypothetical happening that would bring us all together, let us cast aside all of our differences and work towards a common goal single-mindedly what do you think that would be?
Rich Burroughs: Wow, that’s tough.
Jacob Plicque: Let’s see. Well, let’s go with the classic example of world peace and end of hunger. Let’s just go for it. We’re in the optimism line, right? Let’s just go for it.
Matthew Simons: God, I love where your head is at, that’s so great. I’m going to go a totally different direction, which is that in order to get there, we’d have to have an external existential threat. Aliens, right? If aliens showed up tomorrow or today, and they were like, “Hey, you guys, we don’t like you, right? We’re going to pick a fight.” The rest of us would go, “You know what? We have differences, but we got to figure this out otherwise we’re all toast.” Right?
Jacob Plicque: Makes sense.
Matthew Simons: Existential threats should be the kind of thing that would unify humanity, they shouldn’t be partisan issues. They should be something that we could all say this is something we get behind. But in practice we’ve seen that that’s not the case. In practice, global warming we’ve seen as actually a partisan issue. Why is that? Why does that become a thing? Why do we let that become a thing?
Matthew Simons: Another thing is the year 2182 is significant. The year 2182 is significant because if you Google “asteroid 2182” what you’ll see is that there isn’t an enormous asteroid that is on a collision course for the earth that’s set to hit in 2182.
Jacob Plicque: That’s a news to me. I’m going to write that down.
Matthew Simons: It’s about half a kilometer wide, right? Now, I’ll say this. I think it’s the highest percentage chance to hit us. It’s still a small chance, but this is like of all the things we’ve observed, this is the one that’s most likely to really screw us up. I don’t know that if people are going to listen to this, there’s going to be all kinds of corrections because I might misquote stuff. It’s not big enough to potentially be a game ender for us, but it’s big enough that it’s like we’re going to lose a continent. We’re going to jump back 500 years in our… It’s going to be that, it’s going to be real bad.
Matthew Simons: What’s interesting about it is that there’s two other dates associated with this. 2060 and 2080. If we do something by 2060 we can actually divert the course enough to get an asteroid, not to hit us, we have to divert its course, right? We have to apply some amount of force to change its trajectory. If we do it by the year 2060 the amount of effort required is negligible. We can do it with these existing technologies. It’s not going to require hundreds of nations coming together to make something happen. If we wait until 2080 the technologies do not exist that would allow us to divert its course. If we wait until after 2080 we can throw the entire nuclear arsenal at it, we could send Bruce Willis up on a rocket. Whatever we do, it doesn’t matter, right?
Jacob Plicque: Oh yeah.
Matthew Simons: Whatever we do at that point, it does not matter. Now it’s almost… so from 2020 right? It’s 2019 right now. This is news to both of you guys. It was news to me fairly recently too. How long do we have to actually prepare something by 2060? This is probably something people should be aware of right now, but what we’re doing and what we constantly do as humans is adopt this fatal optimism where we’ve been playing a game of Russian roulette for so long that we’ve just been pulling the trigger and nothing’s happened to where we’ve been positively reinforced where we believe we can keep pulling the trigger and nothing will happen in complete denial of the one rule of the game, which is that the gun is loaded.
Rich Burroughs: This is an interesting conversation to me because as somebody with an operations background, I’ve always felt, if anything, very pessimistic about the system that I’m running. I’m not that kind of person at all and so… I could see what you’re talking about for sure. That there are people who just don’t consider the kinds of failures they may run into or the kinds of problems, the fact that they don’t have a good DR or something like that, that those things are never going to bite them.
Matthew Simons: Yeah. Well and if you want another example wearing your seatbelt. There are tons of people that don’t wear their seat belts, even though we know that dying in a car accident in a car crash, is actually… There’s a significant chance way better than getting hit by an asteroid. They are actually going to die in a car crash at some point. That’s going to be how you end your stay here. But still there are tons of people that don’t wear their seat belts and why would they? They’ve managed to get away with it for this long, right? They don’t do it. They’ve never died in a car crash, so they don’t wear their seat belts.
Jacob Plicque: I’ll have you know sir that I am on life six and …
Matthew Simons: But bringing it back to Chaos, bringing it back to the way that we treat our systems. There are metaphorical asteroids that are on a collision course with your business. There are things that you don’t know about. There are risks that you don’t know about that you’re not going to find out about unless you do things like Chaos testing to discover them. If we take a cowboy attitude towards how we run our businesses, how we run our software and don’t take precautions like wearing our seatbelt, we’re going to go through a windshield and it’ll be too late.
Jacob Plicque: Right. It’s funny to think about, especially when you… I don’t know if I’m alone on this, but only when I really started looking into, and started doing Chaos Engineering experiments that I start to see them in the outside world. Of course the seatbelt I should say crash test dummies and stuff like that. Of course they’re Chaos Engineering experiments. We were talking about it a lot around like, “Hey, Chaos Engineering is very new,” but we’ve been doing these types of tests forever. It actually creates this really interesting… as new as it is, we’ve been doing things like this forever but now I’m just quantifying what the world’s largest chaos experiment against an asteroid would be and what’s the blast radius I should start with?
Rich Burroughs: It’s funny to me though because when I started working at Gremlin, I didn’t have any experience with Chaos Engineering or that was what I thought anyway. I had seen some conference talks about it and read about it, but I had never done anything. But the more I started to think about it, I realized that, I’m sure there are times in my past where I did things like set a null route or block a certain port number or things like that to test how something works. I think I had been injecting failures earlier in my career and I just didn’t really think about it that way.
Rich Burroughs: I talked with Jade, our VP of Engineering when he came on board and he’d come from new Relic and he was talking about some of the kinds of things they had done there. I said to him, I said, “You realize you were doing Chaos Engineering, right?” And he was like, “Wait, what?” I think this kind of failure injection is something that comes pretty naturally to people who are used to doing other kinds of testing.
Matthew Simons: Yeah, I would agree. One of my favorite things about it too is if people aren’t used to it, you wield a lot of power bringing it to a team or to any particular issue. I mean it’s bringing data to an opinion fight, right? Which is, there are oftentimes these problems, the kinds that we run into are issues with quality, right? Where people have some problem that they’ve been trying to solve for a while and they’ve been throwing a lot of hypotheses around of, “Well, where might this be coming from?” You can get into arguments about the best way to solve it or attack it and and bringing data to an opinion fight is the opposite of bringing a knife to a gunfight. Everyone’s running around the room trying to stab each other and you just walk in fire two shots into the air and go, “Listen up.” Everyone pays attention because it’s a new tool. If you’ve been running without it for awhile, having it introduced is a new capability that just changes the game for you in terms of how you attack problems. It gives you a way to get data.
Jacob Plicque: Yeah. I was just going to ask kind of tangentially a little bit, one thing that I love about Workiva is that you guys have taken the Game Day in name and switched it a little bit to something called a Chaos Jam, which I think is a phenomenal name. I would love to know a little bit more about the name change and how that’s resonating.
Matthew Simons: Yeah. I guess for us it just felt natural because we do a lot of… Whenever we have cross team collaborations. Every year we have our own developer conference where we fly our developers in from all around the more interesting places of the world to come to Ames, Iowa. We often will do what we call Jams on different areas of the product and it’s… A jam just feels like to us it’s a very collaborative sit down, let’s just hack out ideas, let’s figure out… It’s just a collaborative catch all I guess. For us to just call it a Chaos Jam I think is a more familiar term for working with some of our different product teams here at… I think it just boils down to for them, it’s easier to grok what that means upon hearing it as opposed to a Game Day.
Rich Burroughs: Sure. We’ve talked about this on previous episodes, but to me that’s one of the most exciting things about Chaos Engineering is that collaborative aspect. Bringing together people from different teams and getting them to sit down and maybe learn about parts of the application that they didn’t understand before. Things like that.
Matthew Simons: Yeah, at least for us, it’s helped us to build a lot of rapport, I should say. Our last assessment we performed was with the team that had been struggling with an issue that had caused incidents in production a number of times over the last few months. They had had trouble reproducing it. They hadn’t been able to reproduce it reliably to be able to determine what was happening and get a good fix out. We just walked in carrying this new tool and we’re able to reproduce it using the Chaos, using Gremlin specifically, we were able to reproduce that issue within the first couple of days of being there. That’s not that we’re any smarter than them or any better than them as engineers. These are dedicated, motivated, quality focused, amazing engineers on their team. They just didn’t have this tool set. And so for us to walk in with slightly different perspectives and to be able to offer them a new tool set meant they were able to tackle this problem that they’d been struggling with for months in what’s really a pretty short period of time.
Rich Burroughs: Yeah, that was honestly one of my first reactions digging into these tools was, “Boy, I wish I would’ve had this stuff earlier in my career.” I know there are times when I was fighting with problems that I could have figured out a lot easier.
Jacob Plicque: Yeah. I mean especially when you’re going down the rabbit hole or troubleshooting and you just hoping for the best and it… Especially in that situation when it’s tough to reproduce and you essentially having to wait and hope that the patch you put in or the PR you put in did what it needed to do. The ramifications of that is that, if your down again, you’re losing maybe more money, but being able to reproduce. Then my assumption there, and correct me if I’m wrong Matt, but were they able to fix it and then reproduce or rerun that experiment again to prove that they were good to go now?
Matthew Simons: Yes. Yes. Absolutely. We were able to both reproduce it, get the fix and then verify that the fix worked.
Jacob Plicque: Yeah, not much more powerful than that.
Rich Burroughs: Yeah, being able to close the loop like that is great. There’s always going to be new problems that come up and always going to be things that we haven’t discovered yet. But every one of those things we can cross off the list is great.
Jacob Plicque: Absolutely.
Rich Burroughs: What kind of challenges have you had with rolling out this program to do Chaos Engineering at a company that was already really well established with well established products?
Matthew Simons: Yeah. The challenges that we’ve had have largely just been challenges of inertia, right?
Rich Burroughs: Mm-hmm (affirmative).
Matthew Simons: It’s a company that as we started on App Engine and didn’t have the ability to do Chaos testing, beginning that practice when we’re not used to it, when that’s not been a part of what we do, well it’s like telling people, “Hey, you need to wear seat belts,” and they go, “What are seat belts?” There’s a whole education process that has to go with it and a demonstration really. Show beats tell every day of the week and so really getting people to set time aside and that’s maybe the harder part is everybody is under deadlines to ship product, to ship features, to get things out the door and to try and sell somebody on, “Hey, set some time aside, do this, start this as a practice. We promise it will up your velocity and save you time in the long run.” That can be a difficult sell.
Jacob Plicque: Sure.
Rich Burroughs: Yeah. I mean, every team that I was ever on had a backlog that was way bigger than we could ever tackle.
Jacob Plicque: Yeah. I mean, On the flip side of that, how many times have you heard “We have enough chaos?” I probably would be retired right now. But I think even though it’s a fun anecdote, I think it’s important to address too. Maybe you do, but wouldn’t you like to get ahead of it and just understand what happens when things inevitably go wrong? To tie back to the failure is inevitable point earlier. We’re tied back around.
Matthew Simons: Yeah. It can also be difficult selling people on the baseline failure load injection. The experimental part, the sit down, make an experimental plan, break things, observe, iterate. That’s a very deliberate, intentional cycle. But even just selling people on the background noise version of Chaos has its difficulties too. We actually just recently switched to the environments where we’re primarily doing Chaos so that now we’re primarily doing Chaos in an environment that’s closer to production. It’s actually our internal dog-fooding environment. We use our own product a lot. It’s primarily a compliance reporting platform, but what that means is that it’s also an entire connected document office suite in the cloud and so we use it. We use it internally a lot. That means that the entire company is invested in not having that disrupted a ton. Even though if you look at it, it’s like, well actually the whole point in dog-fooding is to have it disrupted and learn our lessons there.
Rich Burroughs: It’s true.
Matthew Simons: As we have started with some low level constant disruptions there, there are people that… It’s funny. I’ll back up. The one that we’re starting with. We are starting with a single planned regular disruption, which is to bounce a service that is very important. It’s core to the entire platform. It’s a communications platform, it’s a messaging bus. In bouncing this, we have the potential to bring down the entire platform. What we’ve done though is we’ve done enough tests with this before to know that we’re pretty buttoned up, that we should be able to do this and that we should be resilient to this.
Matthew Simons: Now, introducing this at a regular interval, we’re going to do it just I think once a day to start with. During the middle of the day when everybody’s here so that if there’s problems we can see it. We still get people… one of the questions that I got was, “Well, who are we telling about this? When it happens, where’s the visibility? Who sees it?” The answer that I believe is true but is very uncomfortable for people to hear is, “Nobody. We are telling nobody. This is it. This is your notice.” The reason for that is very intentional, which is that we’re injecting a mode of failure that we believe we should be resilient to. If we are resilient to it, great. Why would I tell anybody to expect failure when we don’t expect failure?
Jacob Plicque: Exactly.
Matthew Simons: If it does fail, then we absolutely want to treat it like a failure, a big failure, and we want to call an incident and we want to track it and we want to treat it just like everything else. It’s actually at that point indicative of probably some… either a regression or some latent edge case that we just never caught.
Jacob Plicque: Exactly.
Matthew Simons: There’s still very much that like, “This isn’t how we do things, don’t disrupt things.” We panic a little, right? We don’t like that level of disruption.
Jacob Plicque: There’s a lot of value and it removes the fatal portion of the fatal optimism statement earlier. It’s like, “We believe that we’ve built ourselves up to this point that we believe that we can inject this failure regularly and not have any problems and now we’re proving it literally everyday.”
Matthew Simons: Everyday. Then you increase it, right?
Jacob Plicque: Right.
Matthew Simons: Not just that particular… See you increase potentially the frequency of that particular attack, but then you also add other attacks. The reality is that any failure that you’ve observed in production is a failure that you should probably expect to observe again and you should plan for. It then becomes fair game for something like a consistent attack until people start to really plan for it. What we’re doing is we’re teaching a whole development organization. We’re teaching the people and through association, the product itself, how to exist and run in a failure mode. We’re saying life is… Well, just like the human body, we are constantly running in one failure mode or another, but we have lots of coping mechanisms for dealing with that and ensuring that the organism as a whole continues to function and do critical things despite those failures and our products need to live that way too.
Rich Burroughs: Yeah. That’s great Matthew. Hey, I think we’re about out of time here. Thanks so much for coming to talk to us. We really appreciate it. Where can people find you on the internet? You have anything you want to plug, Twitter or anything like that?
Matthew Simons: No, honestly you can find me on LinkedIn. For my own sanity, mostly stay off of social media. I think I used Twitter once years ago to get Wells Fargo to shame Wells Fargo for not having two factor and that’s like the one tweet that I didn’t have.
Jacob Plicque: Amazing.
Rich Burroughs: Oh my God.
Matthew Simons: LinkedIn is where you can find me, outside of that, I like to speak regularly at conferences and jump around, so that’s fun too.
Rich Burroughs: Great. All right, well, thanks so much for joining us, Matthew.
Matthew Simons: Yeah, thanks. Thanks for having me.
Our music is from Komiku. The song is titled, Battle of Pogs. For more of Komiku’s music, visit loyaltyfreakmusic.com or click the link in the show notes. For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening and join us next month for another episode.
- Failure mode and effects analysis ( FMEA ) is a decades-old method for identifying all possible failures in a design, a…Matthew HelmkeTechnical Writer