Podcast: Break Things on Purpose | Sam Rossoff: Data Centers Inside Data Centers

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

‍

In this episode of the Break Things on Purpose podcast, we speak with Sam Rossoff, Principal Software Engineer at Gremlin

Episode Highlights

00:00:00 - Intro
00:02:23 - Iwata is the best, rest in peace
00:06:45 - Sam sneaks some SNES emulators/Engineer prep
00:08:20 - AWS, incidents, and China
00:16:40 - Understanding the big picture and moving from project to product
00:19:18 - Sam’s time at Snacphat
00:26:40 - Sam’s work at Gremlin, and culture changes
00:34:15 - Pokémon Go and Outro

Transcript

Sam: It’s like anything else: You can have good people and bad people. But I wouldn’t advocate for no people.

Julie: [laugh].

Sam: You kind of need humans involved.

Julie: Welcome to the Break Things on Purpose podcast, a show about people, culture, and reliability. In this episode, we talk with Sam Rossoff, principal software engineer at Gremlin, about legendary programmers, data center disasters at AWS, going from 15 to 3000 engineers at Snapchat, and of course, Pokémon.

Julie: Welcome to Break Things on Purpose. Today, Jason Yee and I are joined by Sam Rossoff, principal software engineer at Gremlin, and max level 100. Pokémon trainer. So Sam, why don’t you tell us real quick who you are.

Sam: So, I’m Sam Rossoff. I’m an engineer here at Gremlin. I’ve been in engineering here for two years. It’s a good time. I certainly enjoyed it. And before that, I was at Snapchat for six years, and prior to that at Amazon for four years. And actually, before I was at Amazon, I was at Nokia Research Center in Palo Alto, and prior to that, I was at Activision. This was before they merged with Blizzard, all the way back in 2002. I worked in QA.

Julie: And do you have any of those Nokia phones that are holding up your desk, or computer, or anything?

Sam: I think I’ve been N95 around here somewhere. It’s, like, a phone circa 2009. Probably. I remember, it was like a really nice, expensive phone at the time and they just gave it to us. And I was like, “ oh, this is really nice.”

And then the iPhone came out. And I was like [laugh], “I don’t know why I have this.” Also, I need to find a new job. That was my primary—I remember I was sitting in a meeting—this was lunch. It wasn’t a meeting.

I was sitting at lunch with some other engineers at Nokia Research, and they were telling me the story about this app—because the App Store was brand new in those days—it was called iRich, and it was $10,000. It didn’t do anything. It was, like, a glowing—it was, like, NFTs, before NFTs—and it was just, like, a glowing thing on your phone. And you just, like, bought it to show you could waste $10,000 an app. And that was the moment where I was like, “I need to get out of this company. I need a new job.” It’s depressing at the time, I guess.

Julie: So. Sam, you’re the best.

Sam: No. False. Let me tell you story. There’s a guy, his name is Iwata, right? He’s a software developer. He works at a company called HAL Laboratories. You may recall, he built a game called Kirby. Very famous game; very popular.

HAL Laboratories gets acquired by Nintendo. And Nintendo is like, “Hey, can you”—but Iwata, by the way, is the president of HAL Laboratories. Which is like, you know, ten people, so not—and they’re like, “Hey, can you, like, send someone over? We’re having trouble with this game we’re making.” Right, the game question, at the time they called it Pokémon 2, now we call it Gold and Silver, and Iwata just goes over himself because he’s a programmer in addition to be president of HAL Laboratories.

And so he goes over there and he’s like, “How can I help?” And they’re like, “We’re over time. We’re over budget. We can’t fit all the data on the cart. We’re just, like, cutting features left and right.” He’s like, “Don’t worry. I got this.”

And he comes up with this crazy compression algorithm, so they have so much space left, they put a second game inside of the game. They add back in features that weren’t there originally. And they released on time. And they called this guy the legendary programmer. As a kid, he was my hero.

Also famous for building Super Smash Brothers, becoming the president of all of Nintendo later on in his life. And he died a couple years ago, of cancer, if I recall correctly. But he did this motion when he was president of Nintendo. So, you ever see somebody in Nintendo go like this, that’s a reference to Iwata, the legendary programmer.

Jason: And since this is a podcast, Sam is two hands up, or just search YouTube for—

Sam: Iwata.

Jason: That’s the lesson. [laugh].

Sam: [laugh]. His big console design after he became President of Nintendo was the Nintendo Wii, as you may recall, with the nunchucks and everything. Yeah. That’s Iwata. Crazy.

Julie: We were actually just playing the Nintendo Wii the other day. It is still a high-quality game.

Sam: Yeah.

Jason: The original Wii? Not like the… whatever?

Julie: Yeah. Like, the original Wii.

Jason: Since you brought up the Wii, the Wii was the first console I ever owned because I grew up with parents that made it important to do schoolwork, and their entire argument was, if you get a Nintendo, you’ll stop doing your homework and school stuff, and your grades will suffer, and just play it all the time. And so they refuse to let me get a Nintendo. Until at one point I, like, hounded them enough-I was probably, like, eight or nine years old, and I’m like, “Can I borrow a friend’s Nintendo?” And they were like, sure you can borrow it for the weekend. So, of course, I borrowed it and I played it the whole weekend because, like, limited time. And then they used that as the proof of like, “See? All you did this weekend was play Nintendo. This is why we won’t get you one.” [laugh].

Sam: So, I had the exact same problem growing up. My parents are also very strict. And firm believers in corporal punishment. And so no video games was very clear. And especially, you know, after Columbine, which was when I was in high school.

That was like a hard line they held. But I had friends. I would go to their houses, I would play at their houses. And so I didn’t have any of those consoles growing up, but I did eventually get, like, my dad’s old hand-me-down computer for, like, schoolwork and stuff, and I remember—first of all, figuring out how to program, but also figuring out how to run SNES emulators on [laugh] on those machines. And, like, a lot of my experience playing video games was waking up at 2 a.m. in the morning, getting on emulators, playing that until about, you know, five, then turning it off and pretending to go back to bed.

Julie: So see, you were just preparing to be an engineer who would get woken up at 2 a.m. with a page. I feel like you were just training yourself for incidents.

Sam: What I did learn—which has been very useful—is I learned how to fall asleep very quickly. I can fall asleep anywhere, anytime, on, like, a moment’s notice. And that’s a fantastic skill to have, let me tell you. Especially when [crosstalk 00:07:53]—

Julie: That’s a magic skill.

Sam: Yeah.

Julie: That is a magic skill. I’m so jealous of people that can just fall asleep when they want to. For me, it’s probably some Benadryl, maybe add in some melatonin. So, I’m very jealous of you. Now I—

Jason: There’s probably a reason that I’m drinking all this cheap scotch right now.

Sam: [laugh].

Julie: We should point out that it’s one o’clock in the morning for Jason because he’s in Estonia right now. So, thank you, A, for doing this for us, and we did promise that you would get to talk about Pokémon. So—

Sam: [laugh].

Julie: [laugh].

Sam: I don’t know if you noticed, immediately, that’s what I went to. I got a story about Pokémon.

Julie: So, have you heard any of our episodes?

Sam: I have. I have listened to some. They’re mostly Jason, sort of, interviewing various people about their experience. I feel like they come, like, way more well-prepared than I am because they have, like, stuff they want to talk about, usually.

Julie: They also generally have more than an hour or two’s notice. So.

Sam: Well, that’s fair. Yeah. That probably [laugh] that probably helps. Whereas, like, I, like, refreshed one story about Iwata, and that’s, like, my level of preparation here. So… don’t expect too much.

Julie: I have no expectations. Jason already had what you should talk about lined up anyway. Something about AWS incidents in China.

Sam: Oh, my God. The first question is, which one?

Jason: [laugh].

Sam: So, I don’t know how much you’re familiar with the business situation in China, but American businesses are not allowed to operate in China. What happens is you create a Chinese subsidiary that’s two-thirds owned by Chinese nationals in some sort of way, you work through other companies directly, and you form, like, these partnerships. And I know you know, very famously, Blizzard did this many years ago, and then, like, when they pulled out China, that company, all the people worked at are like, “Well, we’re just going to take your assets and make our own version of World of Warcraft and just, like, run that instead.” But Amazon did, and it was always this long game of telephone, where people from Amazon usually, like, VP, C-level people were asking for various things. And there were people whose responsibility it was to, like, go and make those things happen.

And maybe they did or, like, maybe they just said they did, right? And, like, it was never clear how much of it was lost in translation, or they’re just, like, dealing with unreasonable requirements, and they’re just, like, trying to get something done. But one story is one of my favorites because I was on this call. Amazon required all of their data centers to be multiple zones, right? So, now they talk about availability zones in a region. Internally at Amazon, that’s not how we referred to things; it’d be like, there’s the data center in Virginia, and there’s, like, the first one, the second one, the third one, right? They’re just, like, numbered; we knew what they were.

And you had to have three of them, and then all services had to be redundant such they could handle a single data center failure. In the earlier days of Amazon, they would actually go turn off data centers to, like, make you prove this as the case. It’s was, like, a very early version of chaos engineering. Because it’s just, like, unreliable. And unfortunately, AWS kind of put the kibosh on that because it turns out people purchasing VMs on AWS don’t like it when you turn off their VMs without warning. Which, like, I’m sympathetic, uh… I don’t know.

As a side note, if you are data center redundant, that means you’re running excess capacity. So, if I’m about to lose a data center, I need to be able to maintain traffic without a real loss in error rates, that means I’ve got to be running, like, 50% excess capacity if I’ve only got three data centers, or 33% if you’re four data centers. And so capacity of course was always the hard problem when you’re dealing with data centers. So, when we were running the Chinese website— z.cn or amazon.cn—there was a data center in China, as you might imagine, as required by the complex business regulations and whatnot.

And it had, you know, three availability zones, for lack of a better term. Or we thought it had three availability zones, which of course, this is what happened. One day, I got paged into this call, and they were dealing with a website outage, and we were trying to get people on the ground in China on the call, which as I recall, actually is a real hard problem to get. It was the middle of the night there; there was a very bad rainstorm; people were not near internet connectivity. If you’re unfamiliar with the Chinese landscape—well, it’s more complex today, but in those days, there were just basically two ISPs in China, and, like, Amazon only paired with one of them.

And so if you were on the other one, it was very difficult to get back into Amazon systems. And so they’d have places they could go to so they could connect them when they—and so it was pair to. And so it was a very difficult situation. It took us a while to get people on the phone, but basically, we lost two data centers at the same time, which was very surprising. And later we find out what happened is one of the data centers had flooded, which is bad, bunch of electrical machines flooding for a rainstorm that’s got whatever else going on.

It turns out the other data center was physically inside of the first [laugh] data center. Which is not the sort of isolation you want between two regions. It’s not really clear where in the conversation, you know, things got lost, such that this is what got implemented. But we had three data centers and in theory, and in practice, we had two data centers, since one was inside the other. And when the first one flooded, the, like, floor gave away, and the servers crashed down on top of the other one. [laugh].

And so they were literally inside of each other after that point. They took down the Chinese website for Amazon. It was an experience. It was also one of those calls where there’s not a lot I could do to help, which is always frustrating for a lot of reasons.

Julie: So, how did you handle that call? Out of curiosity, I mean, what do you say?

Sam: Well, I’ll be honest with you, it took us a long time to get that information, to get save the world. Most of the call actually was trying to get ahold of people try to get information, get translators—because almost everybody on the line did not speak either Cantonese or Mandarin, which is what the engineers were working with—and so by the time we got an understanding—I was in Seattle at the time—Seattle got an understanding of what was happening in—I think it was Beijing. I don’t recall off the top of my head—the people on the ground had done a lot of work to isolate and get things up and running, and the remainder of the work was reallocating capacity in the remaining data center so that we wouldn’t be running data center redundant, but at the very least, we would be able to serve something. It was, as I recall, it was a very long outage we had to take. Although in those days, the Amazon cn website was not really a profit center.

The business was—the Amazon business—was willing to sell things at steep discounts in China to establish themselves in that market, and so, there was always sort of a question of whether or not the outage was saving the company money. Which is, like, sort of a—

Julie: [laugh].

Sam: —it’s like a weird place to be in as an engineer, right? Because you’re, like, “You’re supposed to be adding business value.” I’m like, “I feel like doing nothing might be adding business out here.” It’s not true, obviously because the business value was to be in the Chinese market and to build an Amazon presence for some eventual world. Which I don’t know if they ever—they got to. I don’t work at Amazon, and haven’t in almost a decade now.

But it was definitely—it’s the kind of thing that wears our morale, right? If you know the business is doing something that is sort of questionable in these ways. And look, in the sales, you know, when you’re selling physical goods, industry loss leaders are a perfectly normal part of the industry. And you understand. Like, you sell certain items or loss to get people in the door, totally.

But as engineering lacked a real strong view of the cohesive situation on the ground, the business inputs, that’s hard on engineering, right, where they’re sort of not clear what the right thing is, right? And anytime you take the engineers very far away from the product, they’re going to make a bunch of decisions that are fundamentally in a vacuum. And if you don’t have a good feel for what the business incentives are, or how the product is interacting with customers, then you’re making decisions in a vacuum because there’s some technical implementation you have to commit in some way, you’re going to make a lot of the wrong decisions. And that was definitely a tough situation for us in those days. I hear it’s significantly better today. I can’t speak to it personally because I don’t work there, but I do hear they have a much better situation today.

Julie: Well, I’ll tell you, just on the data center thing, I did just complete my Amazon Certified Cloud Practitioner. And during the Amazon training, they drilled it into you that the availability zones were tens of miles apart—the data centers were tens of miles apart—and now I understand why because they’re just making sure that we know that there’s no data centers inside data centers. [laugh].

Sam: It was a real concern.

Julie: [laugh]. But kind of going back though, to the business outcomes, quite a while ago, I used to give a talk called, “You Can’t Buy DevOps,” and a lot of the things in that talk were based off of some of the reading that I did, in the book, Accelerate by Dr. Nicole Forsgren, Gene Kim, and Jez Humble. And one of the things they talked about is high-performing teams understanding the business goals. And kind of going back to that, making those decisions in a vacuum—and then I think, also, when you’re making those decisions in a vacuum, do you have the focus on the customer? Do you understand the direction of the organization, and why are you making these decisions?

Jason: I mean, I think that’s also—just to dovetail on to that, that’s sort of been the larger—if we look at the larger trend in technology, I think that’s been the goal, right? We’ve moved from project management to product management, and that’s been a change. And in our field, in SRE and things, we’ve moved from just thinking of metrics, and there were all these monitoring frameworks like USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) and monitoring for errors, and we’ve moved to this idea of SLOs, right? And SLOs are often supposed to be based on what’s my customer experience? And so I think, overall, aside from Accelerate and DevOps, DevOps I feel like, has just been one part of this longer journey of getting engineers to understand where they fit within the grander scheme of things.

Sam: Yeah. I would say, in general, anytime you have some sort of metric, which you’re working towards, in some sort of reasonable way, it’s easy to over-optimize for the metric. And if you think of the metric instead as sort of like the needle on a compass, it’s like vaguely pointing north, right, but keep in mind, the reason we’re heading north is because X, Y, Z, right? It’s a lot easier for, like, individuals making the decisions that they have to on a day-to-day basis to make the right ones, right? And if you just optimize for the metric—I’m not saying metrics aren’t helpful; they’re extremely important. I would rather be lost with a compass than without one, but I also would like to know where I’m going and not just be wandering to northwards with the compass, right?

Julie: Absolutely. And then—

Jason: I mean, you don’t want to get measured on lines of code that you commit.

Sam: Listen. I will commit 70 lines of code. Get ready.

Julie: Well, and metrics can be gamed, right? If people don’t understand why those metrics are important—the overall vision; you’ve just got to understand the vision. Speaking of vision, you also worked at Snap.

Sam: I did. I did. That was a really fun place to work. I joined Snapchat; there were 30 people at the company and 14 engineers. Very small company. And a lot of users, you know, 20-plus million users by that point, but very small company.

And all the engineers, we used to sit in one room together, and so when you wanted to deploy the production back end, you, like, raised your hand. You’re like, “Hey, I’m going to ship out the code. Does anyone have changes that are going out, or is everyone else already doing it?” And one of my coworkers actually wrote something into our deploy script so the speakers on your computer would, like, say, “Deploying production” just so, like, people could hear when it went out the door. Because, like, when you’re all in one room, that’s, like, a totally credible deployment strategy.

We did build automation around that on CircleCI, which in those days was—I think this was 2014—much less big than it is today. And the company did eventually scale to at least 3000 engineers by the time I left, maybe more. It was hard for me to keep track because the company just grown in all these different dimensions. But it was really interesting to live through that.

Julie: So, tell me about that. You went from, what, you said, 30 engineers to 3000 in the time that you were there.

Sam: Fifteen engineers, I was the fifteenth.

Julie: Fifteen. Fifteen engineers. What were some of the pain points that you experienced? And actually maybe even some advice for folks going through big company growth spurts?

Sam: Yeah, that hypergrowth? I think it’s easier for me to think about the areas that Snap did things wrong, but those were, like, explicit decisions we made, right? It might not be the case that you have these problems at your company. Like, one of the problems Snap had for a long time, we did not hire frontline managers or TPMs, and what that did is it create a lot of situations where you have director-levels with, like, 50-plus direct reports who struggled to make sure that—I don’t know, there’s no way you’re going to manage 50-plus direct reports as engineers, right? Like, and it took the company a while to rectify that because we had such a strong hiring pipeline for engineers and not a strong hiring pipeline for managers.

I know there’s, like, a lot of people saying companies like, “Oh, man, these middle managers and TPM’s all they do is, like, create work for, like, real people.” No. They—I get to see the world without them. Absolutely they had enormous value. [laugh]. They are worth their weight in gold; there’s a reason they’re there.

And it’s not to say you can’t have bad ones who add negative value, but that’s also true for engineers, right? I’ve worked with engineers, too, who also have added negative value, and I had to spend a lot of my time cleaning up their code, right? It’s like anything else: You can have good people and bad people. But I wouldn’t advocate for no people.

Julie: [laugh].

Sam: You kind of need humans involved. The thing that was nice about Snap is Snap was a very product-led company, and so we always had an idea of what the product is that we were trying to build. And that was, like, really helpful. I don’t know that we had, like, a grand vision for, like, how to make the internet better like Google does, but we definitely had an idea of what we’re building and the direction we’re moving it in. And it was very much read by Evan Spiegel, who I got to know personally, who spent a lot of time coming down talking to us about the design of the product and working through the details.

Or at least, you know, early on, that was the case. Later on, you know, he was busy with other stuff. I guess he’s, like, a CEO or something, now.

Julie: [laugh].

Sam: But yeah, that was very nice. The flip side meant that we under-invested in areas around things like QA and build tools and these other sorts of pieces. And, like, DevOps stuff, absolutely. Snapchat was on an early version of Google Cloud Platform. Actually an early version of something called App Engine.

Now, App Engine still exist as a product. It is not the product today that it was back in 2014. I lived through them revving that product, and multiple deprecations and the product I used in 2014 was a disaster and huge pain, and the product they have today is actually semi-reasonable and something I’ve would use again. And so props to Google Cloud for actually making something nice out of what they had. And I got to know some of their engineers quite well over the—[laugh] my tenure, as Snap was the biggest customer by far.

But we offboarded, like, a lot of the DevOps works onto Google-and paid them handsomely for it—and what we found is you kind of get whatever Google feels like level of support, which is not in your control. And when you have 15 engineers, that’s totally reasonable, right? Like, if I need to run, like, a million servers and I have 15 engineers, it’s great to pay Google SREs to, like, keep track of my million servers. When you have, you know, 1000 engineers though, and Google wants half a billion dollars a year, and you’re like, “I can’t even get you guys to get my, like, Java version revved, right? I’m still stuck on Java 7, and this Java 8 migration has been going on for two years, right?”

Like, it’s not a great situation to be in. And Snap, to their credit, eventually did recognize this and invested heavily in a multi-cloud solution, built around Kubernetes—maybe not a surprise to anyone here—and they’re still migrating to that, to the best of my knowledge. I don’t know. I haven’t worked in that company for two years now. But we didn’t have those things, and so we had to sort of rebuild at a very, sort of, large scale.

And there was a lot of stuff we infrastructure we set up in the early days in, like, 2014, when, like, ah, that’s good enough, this, like, janky python script because that’s what we had time for, right? Like, I had an intern write a janky Python script that handled a merge queue so that we could get changes in, and that worked really great when there was like, a dozen engineers just, like, throwing changes at it. When there was, like, 500 engineers, that thing resulted in three-day build times, right? And I remember, uh, what was this… this was 2016… it was the winter of either 2016 to 2017 or 2017 to 2018 where, like, they’re like, “Sam, we need to, like, rebuild the system because, like, 72 hours is not an acceptable time to merge code that’s already been approved.” And we got down to 14 minutes.

So, we were able to do it, right, but you need to be willing to invest the time. And when you’re strapped for resources, it’s very easy to overlook things like dev tools and DevOps because they’re things that you only notice when they’re not working, right? But the flip side is, they’re also the areas where you can invest and get ten times the output of your investment, right? Because if I put five people on this, like, build system problem, right, all of a sudden, I’ve got, like, 100x build performance across my, like, 500 engineers. That’s an enormous value proposition for your money.

And in general, I think, you know, if you’re a company that’s going through a lot of growth, you have to make sure you are investing there, even if it looks like you don’t need it just yet. Because first of all, you do, you’re just not seeing it, but second of all, you’re going to need it, right? Like, that’s what the growth means: You are going to need it. And at Snap I think the policy was 10% of engineering resources were on security—which is maybe reasonable or not; I don’t know. I didn’t work on security—but it might also be the case that you want maybe 5 to 10% of the engineering resources working on your internal tooling.

Because that is something that, first of all, great value for your money, but second of all, it’s one of those things where all of a sudden, you’re going to find yourself staring at a $500 million bill from Google Cloud or AWS, and be like, “How did we do this to ourselves?” Right? Like, that’s really expensive for the amount of money we’re making. I don’t know what the actual bill number is, but you know, it’s something crazy like that. And then you have to be like, “Okay, how do we get everything off of Google Cloud and onto AWS because it’s cheaper.” And that was a—[laugh] that was one heck of a migration, I’ll tell you.

Julie: So, you’ve walked us through AWS and through Snap, and so far, we’ve learned important things such as no data centers within data centers—

Sam: [laugh].

Julie: —people are important, and you should focus on your tooling, your internal tooling. So, as you mentioned before, you know, now you’re at Gremlin. What are you excited about?

Sam: Yeah. I think there’s, like, a lot of value that Gremlin provides to our customers. I don’t know, one of the things I liked working at Snapchat is, like, I don’t particularly like Facebook. I have not liked Facebook since, like, 2007, or something. And there’s, like, a real, like, almost, like, parasitic aspect to it.

In my work at Snap, I felt a lot better. It’s easy to say something pithy, like, “Oh, you’re just sending disappearing photos.” Like, yeah, but, like, it’s a way people stay connected that’s not terrible the way that Facebook is, right? I felt better about my contribution.

And so similarly, like, I think Gremlin was another area where, like, I feel a lot be—like, I’m actually helping my customers. I’m not just, like, helping them down a poor path. There’s some, like, maybe ongoing conversation around if you worked in Amazon, like, what happens in FCs and stuff? I didn’t work in that part of the company, but like, I think if I had to go back and work there, that’s also something that might, you know, weigh on me to some degree. And so one of the—I think one of the nice things about working at Gremlin is, like, I feel good about my work if that makes sense.

And I didn’t expect it. I mean, that’s not why I picked the job, but I do like that. That is something that makes me feel good. I don’t know how much I can talk about upcoming product stuff. Obviously, I’m very excited about upcoming product stuff that we’re building because, like, that’s where I spend all my time. I’m, like, “Oh, there’s, like, this thing and this thing, and that’s going to let people do this. And then you can do this other thing.”

I will tell you, like, I do—like, when I conceptualize product changes, I spend a lot of time thinking, how is this going to impact individual engineers? How is this going to impact their management chain, and their, like, senior leadership director, VP, C-suite level? And, like, how do we empower engineers to, like, show that senior leadership that work is getting done? Because I do think it’s hard—this is true across DevOps and it’s not unique to Chaos Engineering—I do think it’s hard sometimes to show that you’re making progress in, like, the outages you avoided, right? And, like, that is where I spend, like, a lot of my thought time, like, how do I like help doing that?

And, like, if you’re someone who’s, like, a champion, you’re, you’re like, “Come on, everyone, we should be doing Chaos Engineering.” Like, how do I get people invested? You care, you’re at this company, you’ve convinced them to purchase Gremlin, like, how do I get other engineers excited about Chaos Engineering? I think, like, giving you tools to help with that is something that, I would hope, I mean, I don’t know what’s actually implemented just yet, but I’d hope is somewhere on our roadmap. Because that’s the thing like, that I personally think a lot about.

I’ll tell you another story. This was also when I was at Amazon. I had this buddy, we’ll call him Zach because that’s his name, and he was really big on testing. And he had all this stuff about, like, testing pyramid, if you’re familiar with, like, programming unit testing, integration testing, it’s all that stuff. And he worked as a team—a sister team to mine—and a lot engineers did not care heavily about testing. [laugh].

And he used to try to, like, get people to, like, do things and talk about it and stuff. They just, like, didn’t care, even slightly. And I also kind of didn’t care, so I wasn’t any better, but something I did one day on my team is I was like, “You know, somebody else at Amazon”—because Amazon invested very heavily in developer tools—had built some way that was very easy to publish metrics into our primary metrics thing about code coverage. And so I just tossed in all the products for my team, and that published a bunch of metrics. And then I made a bunch of graphs on a wiki somewhere that pulled live data, and we could see code coverage.

And then I, like, showed it in, like, a team meeting one week, and everyone was like, “Oh, that’s kind of interesting.” And then people were like, “Oh, I’m surprised that’s so low.” And they found, like, some low-hanging fruit and they started moving it up. And then, like, the next year bi-weekly with our skip-level, like, they showed the progress, he’s like, “Oh, this it’s really good.” You made, like, a lot of progress in the code coverage.

And then, like, all of a sudden, like, when they’re inviting new changes, they start adding testing, or, like, all sudden, like, code coverage, just seemed ratchet up. Or some [unintelligible 00:30:51] would be like, “Hey, I have this thing so that our builds would fail now if code coverage went down.” Right? Like, all of a sudden, it became, sort of like, part of the culture to do this, to add coverage. I remember—and they, like, sort of pollinated to the sister teams.

I remember Zach coming by my desk one day. He’s like, “I’m so angry. I’ve been trying for six months to get people to care. And you do some dumb graphs and our wiki.” And I’m like, “I mean, I don’t know. I was just, like, an idea I had.” Right? Like, it wasn’t, like, a conscious, like, “I’m going to change the culture moment,” it was very much, like, “I don’t know, just thought this was interesting.”

And I don’t know if you know who John Rauser is, but he’s got this great talk at Velocity back in 2010, maybe 2011, where he talks about culture change and he talks about how humans do change culture readily—and, you know, Velocity is very much about availability and latency—and what we need to do in the world of DevOps and reliability in general is actually we have to change the culture of the companies we’re at. Because you’re never going to succeed, just, like, here emoting adding chaos engineering into your environment. I mean because one day, you’re going to leave that company, or you’re going to give up and there’ll be some inertia that’ll carry things forward, but eventually, people will stop doing it and the pendulum will swing back the other way, and the systems will become unreliable again. But if you can build a culture, if you can make people care—of course, it’s the hardest thing to do in engineering, like, make other engineers care about something—but if you can do it, then it will become sort of self-perpetuating, right, and it becomes, like, a sort of like a stand-alone complex. And then it doesn’t matter if it’s just you anymore.

And as an engineer, I’m always looking for ways to, like, remove myself as a critical dependency, right? Like, if I could work myself out of a job, thank you, because, like, [laugh] yeah, I can go work on something else now, right? Like, I can be done, right? Because, like, as we all know, you’re never done with software, right? There’s always a next version; there’s always, like, another piece; you’re always, like, migrating to a new version, right? It never really ends, but if you can build something that’s more than just yourself—I feel like this is, like, a line from Batman or something. “Mr. Wayne, if you can become a legend”—right? Like, you’d be something more yourself? Yeah, absolutely. I mean, it’s not a great delivery like Liam Neeson. But yeah.

Jason: I like what you said, though. You talked about, like, culture change, but I think a big thing of what you did is exposing what you’re measuring or starting to measure this thing, right? Because there’s always a statement of, “You can’t improve until you measure it,” right? And so I think simply because we’re engineers, exposing that metric and understanding where we’re at is a huge motivator, and can be—and obviously, in your case—enough to change that culture is just, like, knowing about this and seeing that metric. And part of the whole DevOps philosophy is the idea that people want to do the best job that they can, and so exposing that data of, “Look, we’re not doing very well on this,” is often enough. Just knowing that you’re not doing well, is often enough to motivate you to do better.

Sam: Yeah, one of the things we used to say at Amazon is, “If you can’t measure it, it didn’t happen.” And like, it was very true, right? I mean, that was a large organization that moves slowly, but, like, it was very true that if you couldn’t show a bunch of graphs or reports somewhere, oftentimes people would just pretend like it never happened.

Julie: So, I do you want to bring it back just a little bit, in the last couple of minutes that we have, to Pokémon. So, you play Pokémon Go?

Sam: I do. I do play Pokémon Go.

Julie: And then how do people find you on Pokémon Go?

Sam: My trainer—

Jason: Also, I’m going to say, Sam, you need to open my gifts. I’m in Estonia.

Sam: [laugh]. It’s true. I don’t open gifts. Here’s the problem. I have no space because I have, like, all these items from all the, like, quests and stuff they’ve done recently.

They’re like, “Oh, you got to, like, make enough space, or you could pay us $2 and we’ll give you more space.” I’m like, “I’m not paying $2,” right? Like—

Jason: [laugh].

Sam: And so, I just, like, I have to go in every now and then and, like, just, like, delete a bunch of, like, Poké Balls or something. Like maybe I don’t need 500 Poké Balls. That’s fair.

Jason: I mean, I’m sitting on 628 Ultra Balls right now. [laugh].

Sam: Yeah. Well, maybe you don’t need—

Jason: It’s community day on Sunday.

Sam: I know, I know. I’m excited for it. I have a trainer code. If you need my trainer to find me on Pokémon Go, it’s 1172-0487-4013. And you can add me, and I’ll add you back because, like, I don’t care; I love playing Pokémon, and I’d play every day. [laugh].

Julie: And I feel it would be really rude to leave Jason out of this since he plays Pokémon a lot. Jason, do you want to share your…

Jason: I’m not sharing my trainer code because at this point, I’m nearing the limit, and I have all of these Best Friends that I’m actually Lucky Friends with, and I have no idea how to contact them to actually make Lucky trades. And I know that some of them are, like, halfway around the world, so if you are in the Canary Islands and you are a friend of mine on Pokémon Go, please reach out to me on Twitter. I’m @gitbisect on Twitter. Message me so that we can actually, like, figure out who you are. Because at some point, I will go to the Canary Islands because they are beautiful.

Sam: Also, you can get those, like, sweet Estonia gifts, what will give you those eggs from Estonia, and then when you trade them you get huge mileage on the trades. I don’t know if this is a thing you [unintelligible 00:36:13], Jason, but, like, my wife and I both compete for who can get the most mileage on the trip. And of course, we traded each other but that’s, like, a zero-sum game, right? And so the total mileage on trades is a big thing in my house.

Jason: Well, the next time we get together, I’ve got stuff from New Zealand, so we can definitely get some mileage there.

Sam: Excellent.

Julie: Well, this is excellent. I feel like we have learned so much on this episode of Break Things on Purpose, from obviously the most important information out there—Pokémon—but back to some of the history of Nintendo and Amazon and Snap and all of it. And so Sam, I just want to thank you for being on with us today. And folks again, if you want to be Sam’s friend on Pokémon Go—I’m sorry, I don’t really know how it works. I don’t even know if that’s the right term—

Sam: It’s fine.

Julie: You’ve got his code. [laugh]. And thanks again for being on our podcast.

Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Podcast: Break Things on Purpose | Sam Rossoff: Data Centers Inside Data Centers

Episode Highlights

Transcript

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Treat reliability risks like security vulnerabilities by scanning and testing for them