Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
Subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode, we speak with Michael Kehoe, a Staff Site Reliability Engineer, about the practice of Chaos Engineering at LinkedIn.
Topics covered include:
- Site Reliability Engineering
- Building satellites at NASA
- LinkedIn’s Chaos Engineering project called Waterbear
- Using Chaos Engineering to test autoscaling
- Running Chaos Engineering experiments as regression tests in a release pipeline
- Tips for starting a Chaos Engineering practice at your company
Rich Burroughs: Hi, I’m Rich Burroughs and I’m Community Manager at Gremlin.
Jacob Plicque: And I’m Jacob Plicque, a solutions architect at Gremlin, and welcome to Break Things on Purpose, a podcast about Chaos Engineering.
Rich Burroughs: So here we are at episode two. We’re really excited. In a minute we’ll be going to our interview with Michael Keogh from LinkedIn. But we just want to thank all of you who listened to the first episode, and let you know that you can reach out to us if you have comments, or feedback, or questions.
Jacob Plicque: Absolutely. So our email is email@example.com. And you can also find us on Twitter at @BTOPpod. That’s B-Top-Pod on Twitter and we’ll put that on info in the show notes as well.
Rich Burroughs: So Jacob, what’s something that stood out for you in our conversation with Michael Keogh?
Jacob Plicque: So yeah when thinking back on it, one thing that was really key for me was taking the learnings of a Chaos Engineering practice and seeing how an organization matures from that. They use a few different applications in order to accomplish the experiments that they run, which we’ll get into in the interview. And I think that it was really, really interesting to see what happens or how you get started, and move forward long term with a Chaos Engineering practice versus a small project. What about you?
Rich Burroughs: Well I agree with you completely. I mean what their doing is really advanced there and it was great to hear about. I think the most fun thing for me was hearing about Michael’s experience at NASA building satellites, and again we’ll get into that more in the interview, but…
Jacob Plicque: It’s so crazy. It’s so awesome. Yeah, I’m so excited for people to hear that.
Rich Burroughs: I find it really, really interesting because you know we live in this world where we’re building software, but we iterate it all the time. So if there’s a problem we just patch it, and it’s such a different world when you think of the kind of work that those people at NASA do, and other space programs where you’re building something that’s going to go up into space and your not patching it tomorrow right? So when you think about like shifting left and making sure that you got everything working right, as early as possible this is sort of the ultimate shift left kind of scenario.
Jacob Plicque: Absolutely, absolutely. Yeah I’m really excited for folks to give it a listen, so without further ado let’s get to the interview with Michael.
Rich Burroughs: So we are here today with Michael Keogh. Michael is a Staff Site Reliability Engineer at LinkedIn. Welcome Michael.
Michael Keogh: Thanks for having me on the podcast.
Rich Burroughs: Oh, we really appreciate you joining us.
Jacob Plicque: Absolutely.
Rich Burroughs: So can we just start off some of our audience might not actually know what a Site Reliability Engineer is. Can you maybe give a little bit of an overview of what it is that you do?
Michael Keogh: Sure so Site Reliability Engineering is great fusion of software engineering and operations. So, as a Site Reliability Engineer I need to be proficient in obviously software engineering. But also understanding how the Linux Kernel works and how networking works, how to troubleshoot systems, how to design monitoring, as well as being able to architect those systems.
Rich Burroughs: So, that sounds like a lot of things to know about.
Michael Keogh: Definitely.
Jacob Plicque: What do you mean it seems so simple, right? It’s interesting because I think that it’s popularized a lot now especially with Google SRE books and things like that. But it’s for a lack of a better term, a discipline that’s been around for a really, really long time. Did you bring up on the software engineering side and the kind of switch over to the reliability side, or were you more infrastructure side then the code caught up to it?
Michael Keogh: So, when I was about 11 or 12 I was about to go on a really long overseas trip and I bought some magazines to take on the plane. And this magazine had an introduction to what was called Windows Avalon there which some of your listeners might know now as Windows Presentation Platform in Microsoft. And that peaked my interest in coding they had all these cool new glossy features that looked really interesting and sort of caught my attention. And from reading that magazine and then going and finding videos online I got really into coding at about that age. And then during class in high school did as many courses, as many projects as I could. And then took sort of a bit of a pivot when I went to university and decided to study electrical engineering. And of course electrical engineering does have somewhat of a software component. But I got really interested in the operations side after we had some people come and visit the university and talk about site reliability engineering. I saw it as this great fusion of being able to touch all these different things and being able to continually learn. And so that’s what really got me into this space. So I spent a lot of time teaching myself about Site Reliability Engineering what it means, and once I joined LinkedIn I really enjoyed the culture of being able to work with people, being able to work through problems, design things and take something that might be difficult or complicated, and make it work not only from an engineering standpoint but from a business standpoint as well.
Rich Burroughs: That’s fantastic. I’ve heard a lot of people talk about that idea of having a learning mentality and how much of a difference that makes in someone’s career.
Michael Keogh: Yeah it’s definitely something that’s important. Obviously our industry changes so quickly there’s always something new to learn. So it’s sort of pride point of mind to try, and be on top of you know whatever the new technology is. So this year I’m spending more time on some of the new Linux Kernel features. As well as some of the new Web Assembly platform functionality that’s coming into browsers. And that’s my little goal for the year to push myself professionally.
Jacob Plicque: Nice. So it’s a combination of also staying ahead of the curve but also it’s something that your passionate and legitimately interested in. So it’s nice that kind of pyramid comes together in that way.
Michael Keogh: For sure.
Rich Burroughs: So I was looking at your LinkedIn page, and I think that I saw something mentioned on there about you building small satellites at NASA. And the nerd in me just can’t not ask about that.
Michael Keogh: Yes, while I was in university I was given this great opportunity to do an internship at NASA Ames. Which is just down the road from us here at LinkedIn. And they were working on this really cool concept. It was basically a prototype at the time where they are looking to build these small satellites called Cube Sats which are four by four by four inches big. So they’re pretty small, and we put an Android phone in them and used the Android phone to do all of the computation for the satellite.
Rich Burroughs: Oh, wow.
Michael Keogh: So this internship became a part of my thesis project. So at that time NASA was finishing their 1.0 version, and part of my internship I helped them do some, for lack of a better word, regression testing on their software. And then for my thesis I looked at how to iterate on this design and improve it. So we found this really cool platform called I/O Board which allowed us to natively connect the Android phone onto all of this I/O bus ware which allowed us to control a number of the sensors and mechanical parts of the satellite. And we made a reasonable improvement on the design without any cost overhead. So just for the record we’re building these satellites out of 3 to 10 thousand dollars which is very cheap in the realm of product development.
Rich Burroughs: Wow.
Michael Keogh: After the internship they launched a set of these satellites into a low Earth orbit, and they successfully completed their mission which was 11 days long I believe. So that was an awesome opportunity, and definitely gave me this great background and sort of working on multiple disciplines. So there was a little bit of mechanical there, and then a lot of electrical, and a lot of software.
Jacob Plicque: Nice, that’s awesome. Yeah that sounds once in a lifetime literally. How do you think that brought you into the… From a Site Reliability Engineering perspective how did that help tie that together for you, and brought you where you are today?
Michael Keogh: For these satellites… They’re an expensive investment. We’re building small satellites but you know the rock you’re still using is still expensive. So they cannot fail. So we were spending some time you know looking at ways where we could prod various parts of the platform, and make sure that things didn’t catastrophically fail. At one point we actually found a bug in the phone software which actually rebooted the phone. So through this testing we actually had to develop a way where the satellite could gracefully come back up and resume operations within a small time frame. So, definitely reliability was very much front and center of this project.
Jacob Plicque: Wow, that’s almost terrifying and exciting at the exact same time.
Michael Keogh: Yeah it’s… I mean I think that’s a part of being a Site Reliability Engineer. You got this great responsibility, and you’re always looking for ways to make it as reliable as possible and give the people using your platform the best experience possible. So this internship was definitely a good avenue into what I do as a day job now. Obviously, now I’m helping make the LinkedIn member experience the best we can be. But then it was making sure that this very expensive investment we had made did not fail during its mission.
Rich Burroughs: Yeah, so let’s talk about what you’re doing at LinkedIn. So I’ve read a piece that you wrote for an InfoQ magazine about Chaos Engineering. That talked a little bit about what you’re doing there. There’s a project called Water Bear, I think?
Michael Keogh: Yeah.
Rich Burroughs: Can you talk us through that a little bit?
Michael Keogh: So Water Bear is LinkedIn’s main Chaos Engineering project. So I’m sure we’ll include the link to the article in the show notes.
Rich Burroughs: Yes, absolutely.
Michael Keogh: But we basically look at three of our main Chaos Engineering tools. So the first one is Fire Drill, which is our host level fault injection platform. So we can go and manipulate networks, CPU, and disk components of a system and see how the application and infrastructure responds. The second one, which is probably the coolest is LinkedOut, where we can take a LinkedIn page and then go in, fail downstream calls just for that test user and see what happens to the end user experience. This is really cool. So if you go and look at the LinkedIn home page we have many downstream microservices serving that page. So what we can go and do is go and selectively fail some of downstream calls by even making them time out, or making them throw an error back, and then see how that you know that front end responds and what the user sees.
Rich Burroughs: That’s fantastic.
Michael Keogh: And the best part about it is that we’re only doing it for that particular LinkedIn engineer testing it. This does not affect the member experience.
Jacob Plicque: Got you. So you can actively test in production while only having the blast radius as small as that particular user.
Michael Keogh: Yes.
Jacob Plicque: And that particular request as well and then see how that degrades and if that affects performance.
Michael Keogh: Yeah, absolutely. So it’s a very flexible platform that gives us a lot of granularity to see you know what the… if this one particular API endpoint became unavailable or became slow what the exact impact to the end user would be. And again as you said the blast radius is literally contained to that one person whose testing that. So there’s no impact on the rest of the engineers at LinkedIn, and there’s no impact to the end member.
Rich Burroughs: Wow.
Michael Keogh: So the final piece we have is our D2 tuner project. So LinkedIn has this API transport layer called Restly which basically allows our services to discover each other, and then talk to each other over a HTTP JSON protocol. In this Restly framework we have service side timeouts which are globally defined. And D2 tuner basically collects all the operational metrics for how that service performs normally, and then provides us recommendations on what those global timeouts should be. By default, in the past our default timeout has been rather large which has caused us some issues. And then using this D2 Tuner platform we can go and optimize those timeouts to make sure that if the system is going to hit some problems we can fail that call as quickly as possible. And then let it retry or let the application gracefully degrade.
Jacob Plicque: Oh, wow. So was that the V3? Because as you were discussing and mentioning my mind kind of went to thinking about Consul for service discovery and then application will default injection adding to that piece. Was this created as a result of doing some Chaos Engineering experiments? The reason why I bring that up is a lot of times when you’re creating an app your putting together timeouts and retry logic almost as a best guess so to speak. And in some cases maybe you don’t know what the steady state of the application is. So I’m curious to know if that was… was it just kind of the mindset or was it created after something, like an incident.
Michael Keogh: So I think it was a natural evolution of our infrastructure. So LinkedIn is a rather large software platform. So over the years we’ve migrated to this Restly framework, and over time we saw the need to you know make further optimizations into the operations of it. So definitely there were some people taking the initiative saying, “Hey we can do better here.” Part of it was also the LinkedOut platform giving us some hints on, “Hey you can probably can do this better over here to improve the experience.” So I don’t think there was one sort of catalyst event. It was a natural evolution and then using those Chaos Engineering outcomes to better inform our operational decisions.
Rich Burroughs: And the project was started after LinkedIn had moved from a monolith to microservices, is that right?
Michael Keogh: Yeah, so one or two years before I joined LinkedIn we started the move from this very monolithic architecture to this very microservices based architecture. Now we have a lot of moving parts. We’re very proud of our deploy velocity we have here. And so it’s very important for us to be continually testing all parts of the system. Making optimizations where we can, and using this combination of the Waterbear tool suite, we’re able to do this on a daily basis now.
Jacob Plicque: Awesome, awesome. So you mentioned some Chaos Engineering experiments there tied to CPU and very host level, and then touched on application level, was there ones that top of mind for you that were more useful the others for example?
Michael Keogh: So I’m a bit of a networking nerd. So I definitely started with the network related tests. So in our Fire Drill platform we have the ability to slow down the speed of the network. So my team owns one of the services that interacts with our Points of Presence, so latency is a factor or network latency is a factor in the performance of both my application and the POPs. So we were able to use Water Bear or use Fire Drill more specifically on this application, to go and see if the network was to degrade what would be the impact to our infrastructure. And from this task we’re actually able to make some optimizations. So that if something is not performing as well, we can go and cache that content, and we can go and retry and get that content. The second one I really like is manipulating the CPU usage particularly in containers. So behind all the nice Docker or Kubernetes abstractions on the Linux layer we have what’s called cgroups or control groups, which enforce CPU quotas in layman’s speak. And how this works especially when you’re using muti-core or muti-threaded processors or… We’re a very Java heavy application stack here at LinkedIn understanding how java application interacts with high CPU usage in a cgroup has been very beneficial from us. And also allows us to go and test out auto scaling platform to ensure that if we are using more CPU that we are scaling out infrastructure and application instances accordingly.
Jacob Plicque: Absolutely.
Rich Burroughs: Yeah, that’s very important.
Jacob Plicque: So specifically to that use case, so it’s one thing to touch on containers. That’s one piece of the scaling, right? And then from there how do you switch gears when you need to make a move from a host level perspective, or do those two things even talk to each other?
Michael Keogh: So with Fire Drill I can basically pick any process and go and manipulate it. Which is really cool. So I can limit that blast radius to, you know one instance. At LinkedIn we also have a feature called Dark Canaries so we can go in and test new software like new builds of software in production where we don’t actually affect the responses that go back to the user. So we can go in really prod at these applications and containers without affecting the member experience, which is very beneficial. So this allows us to do the same sort of things we’re doing with LinkedOut on the host level or on the system infrastructure. Again without affecting the end user experience.
Jacob Plicque: While also getting in a little bit deeper, even while your limiting the blast radius, which is really cool.
Michael Keogh: Yes.
Jacob Plicque: Awesome.
Michael Keogh: And as SREs, I really want to know as much as I possibly can about my application. So how it behaves in certain situations. How far can I stress it before it breaks? So I think that makes both my application more resilient, but also makes me a better engineer because I have a better depth of understanding of the applications that I am responsible for.
Rich Burroughs: Yeah, absolutely. And I think that having that deeper understanding of the applications you’re working with is one of the really big benefits of doing Chaos Engineering. What are some of the other benefits that you’ve seen from the Chaos Engineering you all do?
Michael Keogh: I actually looked at this quote recently, and it’s from Donald Rumsfeld who was then the U.S. Secretary of Defense. And he stated at a Defense Department briefing there are Known Knowns, there are things that we know that we know, there are Known Unknowns, Which is to say that are things that we know we don’t know. But there are also Unknown Unknowns. So there are things we that we know we don’t know. And as an SRE the Unknown Unknowns really scare me.
Rich Burroughs: Yeah.
Michael Keogh: So I will occasionally wake up in the middle of the night having had a bad dream about something I didn’t necessarily have the answer to. Chaos Engineering gives us an opportunity to shine a light on particularly those Unknown Unknowns, and some of those Known Unknowns. The way at particularly done at the LinkedIn we’re able to do that with such a small blast radius where we don’t have to impact the end user. So we can really deep dive into those Unknown Unknowns, and I can basically do anything I want to into an application to understand what happens to it when it is stressed and how I can make it more resilient.
Rich Burroughs: I love too that you’re looking at it from the perspective of the user, right? That’s really what we should always be thinking about is, what’s the user experience and how we can improve it.
Michael Keogh: Yeah, one of our core company principles at LinkedIn is Members First. And that does extend to our Chaos Engineering. We’re very careful to limit that blast radius, and we’ve got a great team behind this tooling that has made sure that we’ve kept that company principle first and foremost when building this suite of tools.
Rich Burroughs: I was reading a bit on your blog as well, and you mention in there the idea of using Chaos Engineering to help break down silos?
Michael Keogh: Yes, so I think as an SRE my technical skills are very important. But also being able to communicate with people is also really, really important. And I did a talk about this last year. So with Chaos Engineering there is endless number of tests that I can go and run within the LinkedIn infrastructure. But it is also important to talk with people and communicate. So if I am going to run a chaos test on a host, I probably need to potentially talk to some of the other tenants on that host to make sure that they’re aware. Equally, if I am doing a test that may potentially impact an upstream service, I should be reaching out to them and letting them know that you may see this interruption at this time. So there is definitely an expectation within the company that if you are testing something that another person or another team is going to notice, that you proactively reach out and do the right thing. Let them know that, “Hey I’m doing this for this benefit you may see this between this time please reach out to me if your seeing anything, and we can resolve the issue.”
Rich Burroughs: Right, or even roll back if need be right?
Michael Keogh: Yeah, of course equally, hopefully we can build this knowledge repository where from the different tests that people are running we can go and make optimizations in some of our core infrastructure and core libraries that we used to build the LinkedIn stack. So I think Chaos Engineering is naturally a great fit for SREs because we’re already conditioned to be talking and working with our counterparts in infrastructure engineering who build our data centers and networks, as well as the software engineers who are building the software that we support.
Rich Burroughs: Yeah, that seems to be the kind of role that a lot of the practitioners are in right now.
Michael Keogh: Just to build on that thankfully with what were being able to build, a lot of our software engineers really want to go and test this stuff now as well. So, since we have a suite of tools where we can go and run tests without impacting the member, any software engineer is able to go and log into these tools, perform tests against their own user account, see how they… Say they wrote a new piece of code or new feature. They want to go and see how does my new feature degrade gracefully or what happens if this API is not available. So we’re able to extend some of those cultural paradigms around reliability and operational excellence to the software engineers, and we actually empower them to go and do that on their own without Site Reliability Engineers having to be involved in that process. Which is awesome.
Jacob Plicque: That’s actually really huge because that’s what over removes from becoming a single point of failure, right? In that aspect. Because I think a lot of folks when they think single point of failure, they’re very focused on it from the infrastructure or application piece. But it’s also just as important from a people perspective. Using the Donald Rumsfield point there, which I love, even though the Unknown Unknowns are absolutely terrifying in some cases, being able to prove out those known knows is just as effective. It ties directly into building that knowledge base and saying, “Hey this gracefully degrades this way because we performed this experiment on this date and this is what it looks like.”
Michael Keogh: And just one thing on that is with our LinkedOut platform we’re actually able to go and perform regression tests, or like Chaos Engineering regression tests that release time to ensure that we haven’t introduced any bugs, where we now have either new critical downstreams or the experience degrades because of something that isn’t working in this new code release. So we’re able to ensure that we keep that operational excellence over a period of time.
Rich Burroughs: So you got these tests running as part of your pipeline. And you’ve got engineers who can kick off one of these tests if they want to. Do you all do game days as well?
Michael Keogh: There are a number of teams that do game days on what we call In days. Which is the day that LinkedIn gives us to its employees to do whatever they want. It’s a great day for us to do something that’s not on our sprint that might interest us. So there are definitely some teams who to do activities like what we call Wheel Of Doom. Where we have a wheel that we spin around with different incident scenarios and we talk through the resolution of that. There are also teams who might have noticed something in a small incident and want to go back and verify that what they thought happened happened. And it gives them opportunity to go and file a bug with the developer to go and optimize whatever that thing they saw was. So there is definitely the practice where teams will go and set aside some time but equally we also built the tooling into the culture now. Where you know people will go and do it on their own or have the automatic regression test running to catch these things straight away.
Rich Burroughs: So where do you see Chaos Engineering heading in the future? Do you have any thoughts about where things might go?
Michael Keogh: So I think there is definitely more room to grow in that application infrastructure space. I think for a rather new field everyone runs on some sort of hardware, or any container, and we’ve been able to standardize Chaos Engineering tooling on that. I think that we will eventually get to the point where a number of API layers will have Chaos Engineering and functionality baked into them. So we can test what happens to an application when API calls fail. I think that’s the natural evolution in that space. Definitely, we’re mentioning single points of failure with people before as well. Hopefully, over time people will not only just think about the infrastructure or the applications on that infrastructure, but also think about the human aspect of it. So if a building is unavailable, or if a VPN is unavailable, how does the company work around that, and how do we work around, not the non-traditional things that we think about as SREs.
Rich Burroughs: Sure.
Michael Keogh: And work to reduce the bus, or what we call bus factors, in either people, on-call, management, or even just VPN which is very basic for our oncall. What happens if those things are unavailable what is the next step for us to make sure that we can keep the site up and reliable.
Rich Burroughs: Michael, do you have any suggestions for people who want to learn more about Chaos Engineering?
Michael Keogh: So there’s a lot of great content out there. As we mentioned earlier there’s InfoQ mag that we did at end of last year. There’s a number of great contributors, Patrick Higgins, myself, John Alspar, as well as Nora Jones.
Rich Burroughs: Yeah.
Michael Keogh: There’s also recently in New York Chaos Community Day, which was great to hear from a number of different companies. Especially some of the smaller companies about what their doing in the Chaos Engineering space. I believe all of those talks are available online. Of course we semi-regularly post on the LinkedIn engineering blog about our Chaos Engineering journey. That is something that we’re continually evolving, and something that we’ll continue to talk about publicly as our culture around Chaos Engineering grows, as well as the tools that we build.
Jacob Plicque: Yeah, that’s actually a good question there is. Is the challenges that you folks at LinkedIn faced when preceding down a Chaos Engineering journey. Is there anything that’s top of mind for folks wanting to kick that off as they roll out a Chaos Engineering practice at their own work?
Michael Keogh: So I think there’s two things obviously getting over this stigma and understanding what we’re actually trying to do. I know I often hear Kolton the CEO of Gremlin talk about the cost of outages. And when you actually start thinking about that it does become quite scary. So if you could run your tests that validate the resiliency of your infrastructure with a small blast radius, the benefits definitely outweigh the risks over a period of time. And I think the other thing to keep in mind is it’s not too… Your company doesn’t have to be a certain size before you start doing or practicing Chaos Engineering. Personally I view it as a craftsmanship thing. So if I am writing code I want to try and break it in every possible way possible before it goes out into production so I understand what are its failure cases, where I can make it more resilient. And so baking this sort of culture around Chaos Engineering and going the extra mile to do resiliency testing on your code or infrastructure… Baking that into your culture helps significantly get over any stigma or any challenges around practicing Chaos Engineering regularly.
Jacob Plicque: Awesome, couldn’t agree more. Yeah as long as you starting small you’re able to really control that blast radius and find some value.
Rich Burroughs: All right, I think that’s all the time we have, Michael. Before we go I just want to ask if there’s anywhere that people can find you on the Internet to read more about what you have to say about these things?
Michael Keogh: Sure, so you can find me on Twitter @MichaelKKehoe. And my blog is at michael-kehoe.io. You’ll always find me on LinkedIn of course and hopefully on the LinkedIn Engineering Blog soon.
Rich Burroughs: All right, thanks so much, Michael. We really appreciate you talking with us. This was a lot of fun.
Jacob Plicque: Absolutely.
Rich Burroughs: All right.
Michael Keogh: Thanks very much gentlemen.
Jacob Plicque: Thanks, appreciate it.
Rich Burroughs: Our music is from Komiku. The song is titled Battle of Pogs. For more Komiku’s music visit loyaltyfreakmusic.com. For more information about our Chaos Engineering community visit gremlin.com/community. Thanks for listening and join us next month for another episode.
- The following is a transcript from Mailchimp Site Reliability Engineer, Caroline Dickey’s, talk at Chaos Conf 201…GremlinChaos Engineer