Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode, we speak with Paul Osman, a Senior Site Reliability Engineering Manager at Under Armour.
Rich Burroughs: Hi, I’m Rich Burroughs, and I’m a Community Manager at Gremlin.
Jacob Plicque: I’m Jacob Plicque, a solutions architect at Gremlin. Welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: So, welcome to episode three. We’re really excited about this interview that we did with Paul Osman from Under Armour. It’s been really fun to get to talk to different people about their Chaos Engineering experiences, and Paul has a lot to share.
Jacob Plicque: Absolutely. Before we get started, just a reminder that you can reach out to us through email or Twitter if you have feedback, questions or comments. Our email address is email@example.com. Our Twitter handle is @btoppod. So, B-T-O-P-P-O-D.
Rich Burroughs: So Jacob, What did you find most interesting from our conversation with Paul?
Jacob Plicque: My biggest takeaway was around just listening to Paul, and him talking about his Chaos Engineering practice and how its changed overall from company to company. Effectively, his personal Chaos engineering maturity and how that’s changed as he moved on an upward. What about you?
Rich Burroughs: Yeah, mine was the exact same thing. He starts off doing manual Chaos Engineering about five years ago I think he said, and then, he moves to PagerDuty, where they already had a program in place that was pretty sophisticated, and then he goes to Under Armour where there wasn’t a Chaos Engineering practice, and he bootstraps that.
Rich Burroughs: I think that doing Chaos Engineering over that length of time in those different kinds of circumstances gives him a really unique point of view that a lot of us don’t have.
Jacob Plicque: Exactly. It allows folks like us and folks like our listeners out there to learn more. With that, let’s get to the interview with Paul.
Rich Burroughs: Hi, today we’re speaking with Paul Osman. Paul leads the Site Reliability Engineering team at Under Armour. Welcome, Paul.
Paul Osman: Thank you very much. Glad to be here.
Jacob Plicque: Yes, indeed. Awesome to have you. I like to kind of kick things off starting a little bit about your background and what kind of brought you into the Chaos Engineering space?
Paul Osman: Yeah, certainly. I think going back, the first time I got exposed to Chaos Engineering is probably from watching folks like Adrian Cockcroft or Nora Jones talk about stuff that’s going on at Netflix, and started practicing it at a startup that I used to work at in Toronto called 500 Pixels. We were in a very common space where the product had been built using a monolithic framework in particular Rails.
Paul Osman: A lot of components of the app, a lot of the features were really tightly coupled. We started looking at migrating things into microservices and starting to develop new features that way. That allowed us to start thinking about resiliency. As we were designing new features using this new architectural pattern, we were able to think about failure from the get go. I was part of a team there that started actually practicing game days.
Paul Osman: Every time we would launch a new product, we would put it in production, we would make sure all of the things that we did to try to account for failure were actually working by kind of forcing those failure conditions. It was very manual back then, but a really great way to kind of get introduced to the process and start to learn more about good ways of doing it.
Paul Osman: From there, I started working at PagerDuty; and PagerDuty, I mean as most people who are familiar with them know, have a really great reputation for resiliency. Working there was kind of like a master class in doing Chaos experimenting. The Infrastructure and Site Reliability Engineering teams there led a practice called Failure Fridays. So every Friday, they would pick a team and basically say, like, “We’re going to attack your services,” and kind of go through a runbook where they would try cutting off connection between services and various data stores, making sure that things failed over properly.
Paul Osman: Having that regular routine kind of health check was a really interesting experience. So, that’s definitely how I got my start in, in this space in particular.
Rich Burroughs: How long ago was it that you were doing this stuff at 500 Pixels?
Paul Osman: That would have been about, I’d say about five years ago now.
Rich Burroughs: Wow. So, you were on board pretty early on.
Paul Osman: Yeah. Yep.
Rich Burroughs: That’s awesome.
Jacob Plicque: Yeah. What I think is particularly interesting there is that the monolithic to microservices kind of transition. As everyone knows, that’s an incredibly easy transition to make, right? [laughs] As far as things go for the uninitiated. But how did doing those experiments help with that journey? Did it kind of level set things? Did it help from like an agility perspective from what you thought?
Paul Osman: Yeah, quite a bit actually. We were already the motivation. Part of our motivation for microservices was reliability, but also, just the company was in a growth spurt and they were looking at ways to allow engineering teams to do all the usual things that you want to do with microservices; like, deploy independently, not have to get into a big deploy queue, increase the rate of deploys, things like that.
Paul Osman: That was the big motivation, but also going from a monolith to microservices architecture allows you to just rethink a lot of these failures strategies. In particular, I can think of the first experiment we did, first kind of Game Day we did. We were introducing a new search feature and we decided to move all responsibility for search in the site to a separate service. We didn’t have any edge routing infrastructure or anything like that at the time. This was really pretty rudimentary. We had CDN to a load balancer to Rails application servers.
Paul Osman: What we’re doing with microservices was because we didn’t have any edge routing infrastructure, we were really just calling out to the microservice from the Rails monolith. So, the Rails monolith was evolving into a kind of edge routing component of our infrastructure. What that allowed us to do is when we were doing that search feature, we thought about all these things like, “Okay, if the service can’t talk to Elasticsearch, what can happen then? If the monolith can’t talk to the search service, what happens then?”
Paul Osman: Things that used to be automatic cascading failures for us, we started being able to handle them more gracefully. Instead of saying, like, “Elasticsearch is having trouble, the whole site’s unusable, or the whole app’s unusable,” instead, we were able to say, like, “Search Will 503,” it’ll fail fast.
Paul Osman: So users in the interface can get a nice little dialogue box saying, “Search isn’t working right now, please try again later.” That was just a hugely positive experience for us. It increased the reliability of the site quite a bit, and I would say there’s a trade off for sure but it definitely did increase velocity for certain teams.
Rich Burroughs: Then you show up at Under Armour. Was there a Chaos Engineering program already when you got there or did you end up kicking that off?
Paul Osman: No, that was something I ended up kicking off. I started at Under Armour as a staff software engineer. Here, what that means is I was really responsible for working between all of these different teams, kind of building platform components that we would use to integrate all of our various products. For those who maybe aren’t familiar, Under Armour is the parent company of products like my Fitness Pal, and Map My Fitness, and Endomondo.
Paul Osman: It was really interesting coming into this company where we had three different products that were all the product of acquisitions. These were consumer products that were built in their own way, they use different stacks, they use different architectural patterns. My first responsibility was building up platform components that helped us integrate those, or improve the integration between the apps. But then, I started managing backend services teams and that’s where I introduced Chaos Engineering.
Paul Osman: I think the first thing we did, it was really like dipping our toe into this space. For My Fitness Pal in particular, we have this huge traffic spike every New Year. Users who are downloading the app again who maybe were previous users who turned, and they’re renewing their commitment to improving their awareness of their fitness. Obviously from our perspective, we also have a lot of people who are downloading the app for the first time, and we want to make sure that that experience is really, really good so that we can retain as many of those users, and hopefully turn them into a year long users.
Paul Osman: The first time I went through that exercise where we were with a team, where we were doing all the pre-scaling and kind of capacity planning for this traffic spike, I decided that we would test out a few kill switches in the app; these were things where the team had already designed this functionality.
Paul Osman: The idea behind it was like if there was a problem with, let’s say a third party integration or a cluster of asynchronous task workers or something like that, let’s make sure that we can turn that feature off instead of having a queue backup cause problems for the rest of the app or something like that. The idea being, of course, like if your step tracking isn’t working, you should still be able to log your food.
Rich Burroughs: That’s a really, really important concept in distributed systems and it’s one that a lot of companies seem to be embracing more. That idea of just turning off parts of the system that aren’t functioning, but making sure that there’s still a good user experience for the parts that are. The fact that part of the application isn’t working shouldn’t prevent you from doing things.
Paul Osman: Yeah, absolutely. Failing to do that, or not having the support and infrastructure in place to be able to turn off those features of the app that may not be as central to a user experience, it results in this familiar pattern for a lot of organizations where every incident becomes a Sev 1. If anything is failing, then everything is failing. Users are, of course, hugely impacted by that.
Paul Osman: You see that show up in revenue loss, you see it show up in churn reports, and in various other areas. Our first real Chaos experiment here was actually getting together like a month before New Year’s, and shutting off all these parts of the app in production and making sure that that operation was clean, and that we could bring things back online just as easily.
Paul Osman: That’s where we introduced our Customer Happiness team to the process. We made sure that they were in the loop, and that they were sending out notifications and broadcasts to users saying, “Look, there might be a five minute interruption in some step tracking functionality. We’re doing some routine maintenance and testing.”
Paul Osman: It was received really well. I didn’t know how this was going to go, frankly [laughs] but sort of announced that we were going to be doing this. Thankfully, our chief digital officer was like, “That sounds like a really good idea actually.” So, we just kept doing it.
Rich Burroughs: It’s one of those things that I think people sometimes have a little bit of fear around making some of these changes and doing Chaos Engineering. As scary as a step like that might be, the idea is that it’s going to prevent bigger outages down the road that are going to be a lot more harmful to the company.
Paul Osman: That’s right. One of the things I often recommend to people who are looking at picking this practice up in their organization or trying to sell this internally, if you already have some kind of incident response process fairly well documented, that makes it a lot easier. If you’re already reporting on incidents and making sure that there’s awareness around severities of incidents, and frequency of incidents, and recording things like meantime to resolve, then it’s really easy to make the case that, “Look, these failures are going to happen. Let’s make them happen when we’re all awake, and we’re all in a room, and we’re all ready for it.” To your point, that’s something that you can use to alleviate some of that fear sometimes.
Jacob Plicque: Yeah. Especially when you can tie it directly into some business objectives. Right? That really helps with the executive alignment. Like the fact that the Chief Digital Officer though, I can imagine there was some internal thought initially, but was able to kind of give the green light, really ties into that; because if reliability is a P0 in the organization, then, Chaos Engineering is definitely something to look at.
Jacob Plicque: You mentioned the Customer Happiness team. Is that actually the name of the incident management team?
Paul Osman: No, I should clarify, Sorry. That’s our front line customer support team. So, they’re the great folks who answer all the emails from customers, who manage our support Twitter account, who field all of our inquiries, and just help our users be successful.
Jacob Plicque: Sure, sure. The first line of defense, so to speak, right? Then, they’re able to kind of escalate things if incidents do happen, based off of those reports. I mean, hopefully, the idea is to catch those things before that, right? Obviously, and that tells the full tale a little bit. But yeah, the reason I asked, I was like, “Man, that’s the name? Like, man, that’s a brilliant name.”
Paul Osman: I love that. Actually, now you’re making me think about reorgs and things like that. That’s the goal. Like, that’s a great culture value actually, if you can kind of encode that. The goal of all this stuff is to make our customers happy. I’ve always described one of the things that really attracts me to this field, to Reliability Engineering specifically, is it’s one of those rare areas that almost involves no trade offs.
Paul Osman: Reliability is good for customers, it’s good for engineers because you’re going to get paged less; and when you do get paged, hopefully, it’s something that’s urgent so you don’t have to waste that executive decision making ability. And, it’s good for the business because it improves revenue, it lowers churn, it does all those healthy things. I’m now actually thinking like, “How can we make our team advertised as like the customer happiness?”
Jacob Plicque: It’s really interesting because a lot of it is about that. It really trends upwards from there because if you’re making your customers happy, your customers are buying. Then if that’s happening, your engineers are getting paged less. So, they’re happy and they’re able to shift to being more agile, and doing more interesting things, and making the overall environment better from that aspect. It’s part of a huge culture shift, and I think it’s really awesome to see.
Paul Osman: Cool. Yeah, I completely agree.
Rich Burroughs: Paul, if you were talking to somebody at a company that didn’t have an established Chaos Engineering practice yet and they were trying to get up and running, what kind of advice would you give them?
Paul Osman: That’s a really great question. It’s one I get a lot actually. Yeah, I’ve done conference talks on this, I’ve done some lightning talks on this. I should have an answer at the ready, but it’s not always easy.
Rich Burroughs: How about in terms of like pitching it to the executives, specifically?
Paul Osman: Yeah, so I think before you’re able to do that, and I hate to put like, “you have to be this tall to ride this ride” kind of limitations on people, but I really do think you have to have a few things in place already. One of them is, you have incidents. No software organization in the world doesn’t have incidents. If they do, I want to find out what they’re doing. How you currently deal with those incidents I think is going to really establish the culture that you’re starting from.
Paul Osman: Like I mentioned earlier, having a good incident response process, I think is like step one. That means having engineers be on call for their own code, for the code that they shipped to production that they operate. That means having some tool like PagerDuty that can put people on call and have escalation policies and schedules so that you have a fairly good certainty that when things do happen, the right person’s going to get paged.
Paul Osman: Having good communication in place so that non-engineering stakeholders, like execs, and product managers, and customer support people have awareness about what’s going on. From there, you can start to at least address reality. You can start to look at like, “Look, we are having incidents. Things are happening that impact our customers negatively.” If you don’t have that, I think it’s really hard to start talking about like, “Hey, we’re going to start breaking things.” You know? It’s putting the cart before the horse in a lot of ways.
Paul Osman: The other thing that I think you have to have in place is at least some kind of decent observability. There’s plenty of people who have spoken much more eloquently about this than I can, but if you can’t see what’s going on in your production environment, if you can’t easily look at basic red metrics, request error rate, duration of requests for any service in your infrastructure, then you really aren’t prepared to start doing this stuff, and it’s really not going to go well. I wouldn’t even start to try to convince execs that this is something you want to do.
Rich Burroughs: Yeah. The way I like to think of it is that, “This is an experiment we’re doing, and we’re injecting this failure and we want to measure the results, and our observability tools are how we measure the results.” So if you don’t have those, you can’t really experiment.
Paul Osman: That’s exactly right. It’s not an experiment at that point. It actually is just chaos.
Jacob Plicque: Yeah, exactly. Because, it’s tough to, even before you click the button, to kick off an experiment. You want to have a well-defined hypothesis, right? It’s a little more difficult to be able to build one if you don’t have what you expect. You’ve touched on response rates, right? If I’m injecting a hundred milliseconds of latency and I don’t know what that even looks like, maybe I’m just clicking the UI in my application versus looking at a dashboard that tells me what that looks like.
Jacob Plicque: If that doubles, triples, affects my requests or my time out logic, I’m not seeing the full picture, and I want to make sure that at least that baseline is in there before I get started.
Paul Osman: That’s exactly right. Yeah, I completely agree. Then I think once you have those things in place, then you’re ready to start testing this idea internally and see how it might be received. At that point, I think it all depends on your company and culture. Like, I don’t think that there’s one runbook that’s going to work for everybody. If you have a very conservative culture, you might want to think about some conservative examples.
Paul Osman: One of the favorites that I always go to is, look, banks do disaster recovery. They have a stigma of sometimes being like conservative technical organizations, but they do this stuff on the regular because they have to. Maybe phrasing it as like a Disaster Recovery Game Day or something like that might be more effective for your culture.
Paul Osman: Luckily at Under Armour, our engineering organization is pretty autonomous. When Under Armour acquired Map My Fitness, My Fitness Pal, and Endomondo, it was very clear from the get go that Under Armour understood like, “We’re hiring you because you know how to do engineering.” Within those organizations, we already had a culture that was very much like, teams know what to do, teams have the bandwidth to do it, let’s let teams just experiment with processes, and then let’s pick up what works. So, that’s pretty much how we approached it here; but at other places, you might need a little bit more formal buy in.
Jacob Plicque: So, I’m actually really curious about- I was chatting with Rich a little earlier and he mentioned that you spoke at Kafka Summit, and I wanted to kind of know what your experience is with Kafka as a whole, and how you kind of see it.
Jacob Plicque: The reason why I ask is when we had Chaos conference last year, back in September, we had a bunch of folks come up and talk to, essentially, we worked at as like a Chaos Engineering, check with a Chaos engineer or build an experiment, things like that. The number one application that we heard about was Kafka, which then kind of tied into like a blog that one of our coworkers did. So, I’m curious of your experience with that and what that looks like for you right now?
Paul Osman: Oh, cool. Yeah, we’re really big Kafka users. I mentioned when I first started at Under Armour, I was working as an engineer on a platform team here. Kafka was a big part of my job. The team I was on, what we were working on and what we ended up building, was a messaging platform built on top of federated Kafka brokers, Kafka clusters.
Paul Osman: Different parts of the organization would have their own cluster, and we have a typology service that really allows you to say like, “Here’s a name space and an event name.” Your application doesn’t need to know exactly where the brokers and the Zookeeper clusters are or doesn’t need to do service discovery, just talks to this topology service, and then it routes those messages to the right cluster. You can do that for producing and for consuming messages, and there’s a whole bunch of reasons internally why that’s a really good fit for us.
Paul Osman: What that’s allowed is more and more services, more and more features, are being developed with this kind of evented model in mind, where something happens in one part of the system that puts a message into a Kafka message bus. Then, there’s consumers that are sitting in consumer groups consuming this stuff and doing actual work. That shows up quite a lot in Game Days. We tend to look at things like, “What happens when you just kill off all the consumers? What do we know about things like retention windows or log size on the actual Kafka brokers?”
Paul Osman: But also, what happens when you start those consumers back up or how do you guarantee that you’re consuming all of the messages in the right way depending on the business need? I would say that for us, it’s actually been really nice because not depending on solely synchronous network RPCs just has allowed us all of these different ways that we can fail gracefully, you know? But, we haven’t actually done any Chaos testing on Kafka itself. That would be something I would be very interested in.
Rich Burroughs: So, yeah. So our coworker, Tammy Butow, actually did a little bit of that. She posted a tutorial that’s up on gremlin.com/community.
Paul Osman: I saw that. I’m very curious to give that a go, actually.
Rich Burroughs: Yeah, if folks are interested, they might want to take a look at that. It sounds like you’re also using Kubernetes at Under Armour?
Paul Osman: That’s correct. We’re about a year into our Kubernetes journey, so we’re still in a hybrid environment. We have been migrating all of our services from two home rolled orchestration platforms. That’s kind of amazing. One of them use containers, the other does not. So, you can imagine which one’s easier to migrate from, but we started using Kubernetes about a year ago. We host our own Kubernetes clusters running in Amazon Web Services using tools like Kops. Yeah, it’s been really, really good so far. Just can’t wait to be 100% on Kubernetes.
Rich Burroughs: Have you done experiments around that stuff?
Paul Osman: We have. We’ve had one Game Day where we attacked Kubernetes. It was great actually. We learned a lot in that one. One of the things, actually, this brings up something I think quite a lot about. Obviously the main goal of Chaos experiments is resiliency, and improving resiliency, and inevitably improving availability for customers and whatnot. But the other side effect that we’ve noticed is just increasing confidence amongst engineers, right, and with Kubernetes specifically.
Paul Osman: Our infrastructure team who operate our Kubernetes clusters, they lead this one, and they went through a bunch of different attack scenarios, including like what happens when you take out a whole availability zone in Amazon? We’d have a bunch of pods running on nodes in that availability zone. Like, what’s happening? They actually visualize using dashboards, this sort of pods all being rescheduled on different nodes, and making sure that this happened in an adequate amount of time without affecting customers or without having any real customer impact.
Paul Osman: Obviously, we learned a lot about Kubernetes in that Game Day, and we learned a lot about how our specific setup could be tweaked. There were a few places where just injecting a small amount of latency had a much bigger impact than we anticipated. Almost as importantly, engineers who participated in that, and we have a thing just in an internal practice. Game Days are completely open, so I tend to invite everyone. Not everyone comes, but a lot of engineers from different teams show up even if it’s not their team who’s doing the Game Day. We had a lot of people who were just curious about Kubernetes. I think seeing us attack it and seeing how the clusters were able to kind of cope with these things, because of various resiliency patterns inherent in Kubernetes increased people’s understanding of confidence in the system.
Jacob Plicque: Yeah, that’s brilliant. You touched on comfort, which I think is so, so spot on because, I think a majority of the information and knowledge that I’ve built on Kubernetes has about 95% to do with running Game Days with it. So, I’m right there with you on that. It’s really interesting because I, one thing is that it’s very new still, right? You said you guys are about a year into your journey, and there’s a lot of folks that are just getting started.
Jacob Plicque: I think maybe it’s important to note, one, in my informational knowledge so far, is that Kubernetes is not easy, but I’d love to kind of know just as from an overall standpoint, some things maybe, and maybe Kafka as well, things for folks to look out for when kick starting that journey.
Paul Osman: Yeah, certainly. So, I think one thing, I’ve spoken to some organizations that have overlooked this point, Kubernetes is open source. Kubernetes has a great community supporting it. That’s all fantastic, but you need a team if you’re going to take this on. Yeah, you need an infrastructure team that’s willing to really dive deep into Kubernetes and learn all of the ins and outs because, there are so many little, like, “gotchas” that depends on your environment, like things that no one can give you a simple runbook.
Paul Osman: I think it was Alice Goldfuss, a SRE at GitHub. She did a fantastic talk recently on GitHub’s journey to containers into a container platforms, specifically Kubernetes, and just nailed it in my opinion very much. I was sitting there nodding while I was watching the talk, like the types of skills that you need, the types of team organization that you need, and really that you have to be prepared to invest in this, and really stub your toe a bit and learn.
Jacob Plicque: Yeah. I think that ties right back into the Chaos Engineering practice almost too perfectly.
Rich Burroughs: It’s interesting to me because, we talk about this sometimes with, and I’ve actually heard Jacob talk about this before in some of our conversations about the fact that sometimes customers will kind of make excuses for why they can’t move ahead with Chaos Engineering yet. One of those might be, “Oh, we’re doing a migration to the Cloud or a migration to Kubernetes,” or something like that. But from my perspective, that’s a perfect time to start, right? When you’re kind of tackling a new platform and you want to understand how it works.
Paul Osman: Yeah, absolutely. I think admittedly, we’re still a little bit into this part of our journey. Chaos Engineering is often thought as something you do as an afterthought. Like, you’ve built everything, you’ve got it in production, now you’re ready. You’ve got this crazy strong system and you’re going to test it, you know? But I think, and like I said, we’re working on this at Under Armour. We’re not there yet.
Paul Osman: I think the earlier, though, that you can build it into processes, that’s where you can start to get a lot of pay off. If you’re migrating to, let’s say, you’re just going from bare metal to a cloud provider, or let’s say you’re investing in a new configuration management system or a new infrastructure as code solution, or if you’re moving to a platform like Kubernetes, as you move things, how are you testing it?
Paul Osman: Like as you deploy stuff on this and put production workloads on this, how are you making sure that the resiliency stuff that you have in mind is actually working? That might be a great way to start flexing that muscle and working on it.
Jacob Plicque: Yeah, absolutely. Even something as small as the runbooks that you use for incident management and being able to build those out. Because most of the times, unfortunately, those things come out of postmortems from incidents, and, “Oh man, we should’ve seen that coming,” and, “This was the issue and now we know how to fix it for next time.”
Jacob Plicque: When it’s a really powerful, and this goes back to the executive alignment, how powerful is it to say, “Hey, what if we could have seen that? Seen that coming at 2:00 in the afternoon a week ago versus now at 4:00 AM?”
Paul Osman: Yeah, absolutely. Yeah. Then, that ties into, depending on your company and culture, that’s great stuff that can be used to, like you said, sell this stuff internally.
Rich Burroughs: So, I was interested by something you said a little earlier. You were talking about your Game Days and the fact that you invite everybody, and there’s different engineers from different teams that come. What do you think about using Chaos Engineering to sort of share information and get a deeper learning of the system?
Paul Osman: Oh, I love it. I’ll also just add that it’s not just engineers who show up, which I’m super happy about. We’ve had lots of customer support people come, we’ve had product managers come. Part of me thinks that maybe they’re just like, “What the heck are this team doing?”
Jacob Plicque: Y’all are the cool kids. They want to see what the cool kids are up to.
Paul Osman: I hope so. I hope they’re not like, “This is terrifying and I want to see it.” But to your question, I really think that that’s something that’s not tapped enough. Like, it’s a really untapped potential. Something that we’re starting to think about, and we haven’t rolled this out yet, so hopefully this was something I could talk about tangibly soon, but I would love to use Chaos Engineering as part of our onboarding.
Paul Osman: Every company I’ve been a part of, even PagerDuty, they’re as good as anyone at this, onboarding people into the on call rotation is painful, and inevitably, there’s some point at which it kind of feels like, especially as a manager on teams, you’re like, “Well, this is the deep end now, and I hope you know how to swim.”
Paul Osman: You can certainly page people, you can do things, but this is it. I can’t help but think like there’s a better way to maybe get someone to lead a Game Day so that the whole teams around as a part of their introduction to on call.
Jacob Plicque: You know, that’s a really powerful use case. Like Kolton says it, one of my favorite lines that he’s ever made is like, “Here’s your pager, good luck.”
Rich Burroughs: Yes. For those who might not know, Kolton’s our CEO at Gremlin. Actually in our first episode, we talked to a couple of our coworkers, and Tammy Butow brought this very thing up about the fact that most folks just sort of throw people into that deep end when they put them on call, and that participating in some Game Days, it’d be a great way to earn some things before you get put in the rotation and start getting pages. I think that it would give you some familiarity with observability software and a lot of other things.
Paul Osman: Yeah. The other thing that actually we noticed to that point specifically, I was really glad to see a lot of mobile engineers in our company were attending Game Days. It really peaked, I think, a lot of peoples’ interest who hadn’t done any sort of backend development, and this is specifically Game Days that are attacking back end systems. Just having familiarity with some of our operational dashboards with like how we look at metrics, how we use tracing things like this to debug problems sparked a lot of interest.
Paul Osman: As a direct result of that, we have a bunch of mobile engineers who want to do Game Days now, and I’m super excited about having a mobile led Game Day. We also have mobile engineers who are really interested in backend systems and want to start learning new languages, and programming backend systems, and seeing if they can expand their horizons a bit and get a bit more full stack there.
Rich Burroughs: Well, I want to hear about your mobile Game Days when those happen, that sounds great.
Jacob Plicque: Yeah, 100%. Just to kind of clarify, was it that they kind of walked into the Game Day and and left like, “Huh, that’s interesting. I’ve never thought about that piece of our infrastructure in that way. I want to learn more about it”?
Paul Osman: Absolutely. A specific case I’m thinking of is, we have a woman named Lisa on our Android team who had attended a Game Day, and was familiar with the fact that we did this every so often. I think it was like a month later, she was working with another engineer on recreating this problem that customers were having. It was like this very intermittent bug. I forget the details of the bug, but it was something to do with saving a workout in Map My Fitness, and sometimes it would work, sometimes it wouldn’t.
Paul Osman: What they were finding is it would only work under like really poor network conditions. She actually approached me and was like, “This sounds like something that we could do a Game Day around. We’re jumping through all these sorts of hoops, trying to recreate these situations in staging environments and stuff like that. Why don’t we just organize a Game Day where we attack some of our backend systems by adding this packet loss, or what have you, and see how the mobile experience is?”
Paul Osman: This has actually spawned us to spark some interest in using ALFI to really do targeted application level fault injection, which is something on our roadmap that I really hope that we can get going soon.
Rich Burroughs: So yeah, again, for those who might not be familiar with it, ALFI is a feature of our Gremlin tool that lets you inject failures within your application as opposed to some of the infrastructure attacks that we do.
Jacob Plicque: Yup, at the request level. So, a very small blast radius, and it gets very specific and very interesting.
Paul Osman: Yes, absolutely.
Rich Burroughs: Paul, I’m a community manager as my role, and I know that you all at Under Armour hosts that Chaos meet up down there in Austin. I’m curious if you have any thoughts about that?
Paul Osman: Yeah. Unfortunately, we haven’t done one in awhile. Thanks for the reminder. Actually, I mean, that’s another great story for us. The two people who organized the first Chaos Engineering meetup in Austin are Tammy Butow, who obviously works at Gremlin, and a guy named Matthew Brahms who, I don’t remember what company he was working at in Austin, but he was a really big enthusiast and just a big active community member in the Chaos Engineering community.
Paul Osman: We offered our space and we sort of hosted, and I did a talk at the first one and Kolton came and did a talk as well, which was a lot of fun. Matthew ended up applying to Under Armour and and now works for us on our infrastructure team. It’s just such a great example of like how this stuff, talk loudly about what you do because you’ll attract the people who are interested in it. He’s been a fantastic addition to our infrastructure team here.
Jacob Plicque: That’s awesome.
Rich Burroughs: Matthew’s great. Yeah, I met him up at a KubeCon, and he’s pretty active in our community slack that we have as well. So yeah, that’s great. I think that honestly, it’s one of those things like using Kubernetes or other things that are kind of seen as a little bit more cutting edge in the industry. I think it really is potentially a recruiting advantage.
Paul Osman: Yeah, absolutely. I’ve mentioned that to our recruiting team internally and they’ve said it resonates with a lot of people when they talk to them, when they’re doing like preliminary screens or when they’re just trying to sell candidates on Under Armour, just mentioning or pointing to some of the talks that we’ve done, that folks here have done has been a really big advantage.
Rich Burroughs: Well Paul, I think that’s all the time that we have today. This was a super fun conversation. Thanks so much for coming to talk to us.
Paul Osman: Thanks for having me. I had a blast. It was great to talk.
Rich Burroughs: Where can people find you on the Internet, like on the Twitters and places?
Paul Osman: Yeah. I’m somewhat active on the Twitters. My handle is pretty simple. It’s PaulOsman. One word, no underscores or anything. GitHub as the same. Pretty much everywhere on the Internet, I’m PaulOsman.
Rich Burroughs: Okay, I will just start looking for PaulOsmans on random apps and see if it’s you. Thanks again, Paul.
Paul Osman: Hey, thank you both. Thanks, Rich. Thanks, Jacob.
Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more of Komiku’s music, visit loyaltyfreakmusic.com or click the link in the show notes.
Rich Burroughs: For more information about our Chaos Engineering community, visit gremlin.com/community. Thanks for listening, and join us next month for another episode.
- The following is a transcript from Mailchimp Site Reliability Engineer, Caroline Dickey’s, talk at Chaos Conf 201…GremlinChaos Engineer