- 17 min read

Podcast: Break Things on Purpose | Mikolaj Pawlikowski, Engineering Lead at Bloomberg

Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

You can subscribe to Break Things on Purpose wherever you get your podcasts.

If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

In this episode of the Break Things on Purpose podcast, we speak with Mikolaj Pawlikowski, Engineering Lead at Bloomberg.

Episode Highlights

Transcript

Jason Yee: We may cut this later, but I have a totally random question. For those who are listening to the podcast, you can't actually see Miko, but he's got this plush penguin head that's sitting next to him. I'm curious if there's a story behind the penguin head.

Mikolaj Pawlikowski: Well, it's complicated.

Patrick Higgins: Hello and welcome to today's episode of Break Things on Purpose. My name is Patrick Higgins and I'm a Chaos Engineer at Gremlin.

Jason Yee: And I'm Jason Yee, Director of Advocacy at Gremlin.

Patrick Higgins: Today we speak Mikolaj Pawlikowski, who is a software engineer project lead at Bloomberg. He's also the author of Chaos Engineering: Site Reliability Through Controlled Disruption. How are you doing today, Miko?

Mikolaj Pawlikowski: Doing great, thanks for having me.

Patrick Higgins: We're very excited to have you here today. Most recently, your book has come out. That's been something that I think a lot of people in our space have taken a lot of interest in, and it's been really well received so we've got a bunch of questions about that. We would like to ask you if you've had those experiences in the past where you've really felt the heartache that have come with not thinking about things proactively and what that's looked like for you.

Why Chaos Engineering?

Mikolaj Pawlikowski: That's a tough question. I think the entire reason why we started with this Chaos Engineering thing to begin with was because we started working on a brand new project using Kubernetes back in 2016 or so, and all of a sudden we find ourselves with all this new software on this new pieces, a lot of moving parts where we ... yeah, it's probably a lot of patches coming in and there wasn't like a good manual that would tell you, "Oh, this is the best way to configure it out." You have to kind of figure this out by yourself.

So it kind of came naturally as an evolution of just trying to make sure that all the things that we can think of or all the outages or all the problems that we had before. We would simulate them again to make sure that next time we're not actually, prone to running into the same issues, and that later evolved into things like Powerful Seal when we opened services tools separately, and eventually kind of distilled into writing that book, so kind of often joked that this was all kind of like a sleeping aid, to sleep a little bit better at night and not be called that much.

Patrick Higgins: That's fantastic. Have you found one of the more interesting things when you make that transition from firefighting into thinking about things ahead of time, that you can start to look at abstractions like Kubernetes, and really thinking more logical and curious terms about it more as like a learning approach, rather than desperately trying to get things fixed and put out the fires.

Mikolaj Pawlikowski: Sure. I mean, typically like the good presentations that you see at conferences. They're either about some big outage that they had and they fixed our an outage that was prevented. I think that in a part of it is that if you start with a new code base, it was the case for us. We could bring to you so if you start to refine a new project, the best way to understand that is to kick the tires and understand how these things work or how they break and all of that, so this is like naturally again, going into this direction of discovering these things but most of the time, the things that don't necessarily make for particularly exciting presentations. This is easy stuff, the low hanging fruit and I'm sure your experiences has probably been similar without that.

Now that the example that I use now to get like a got you moment for when I explained this to people who are a little bit skeptical is like everybody knows [systemd 00:04:04] and [systemd 00:04:05] services and a lot of people don't know that by default, if you just put it to restart, always, it doesn't actually mean it's always going to restart because the other default parameters mean that if you have, I think it's by default five crashes in 10 seconds period, it's just going to stop and everybody's going to be like, "Whoa, what's going on?".

So when I'm teaching a little bit of Chaos Engineering and during training internally and externally, that's a good moment when you're, "Oh way." Well, even simple things like that. It's one line you actually, if you didn't test it properly, if you didn't actually disseminate this kind of thing, you might run into trouble. So you don't really have to go all the way to comparing these two to prove those things, you can start pretty simple, and I know, is that also your experience? Do you find that's how people react?

Patrick Higgins: I've found that it's really ... what you're describing is a really interesting way to look at getting that light switch to flick for people with Chaos Engineering. Where it's I'm taking a very simple configuration, default configuration or parameter, or some kind of thing where most people think it's one way, but in fact it's the other way, and if you can test that and show it to them, face-to-face, then they really have the ... they kind of got to confront the fact that kind of, "Whoa." This changes the way I think about this one thing, how many other things may fall into the bucket, where I don't actually ... Where I think I know what it looks like, and then once we actually press that button or we'll pull that lever, it isn't exactly as we'd expect.

Mikolaj Pawlikowski: Yeah, exactly, and it's very often simple things, and people have this misconception that it's only for like massive distributed systems, and if you're not at scale of Netflix, you shouldn't be touching about. My point of view is that you can start with any system, even a single process and just this mindset typically gives you a lot of value for a little effort.

Patrick Higgins: Yeah, absolutely.

Jason Yee: Yeah, I find that these days, even those simple systems are starting to become complex systems. When we look at what does it take to just set up a simple blog these days, and you're looking at various hosting providers and you've got a database and a front end, and it's suddenly a much more distributed app than it was five, ten years ago.

Mikolaj Pawlikowski: Yeah, so what do I need for my blog? Well, I need Kubernetes. I need that load balancers in front of it. I need a CDN because I expect that to blowing popularity any day now. So yeah, things gets complex really quick. That is nice, that's true.

Miko's Book

Patrick Higgins: Yeah, I was wondering that your book, or it breaks... The book's kind of broken down into three separate sections where you kind of upfront dealing with introducing the principles of Chaos Engineering. You translate that then in the second section into thinking about Chaos Engineering experiments, and then in the third section, it's a lot of bringing things together and integrating that into real world situations from an organizational perspective or a business perspective, perhaps.

And that second section was really interesting to me because the experiments that you chose was so varied. It wasn't that you're really aiming that book at Reliability Engineers, particularly, this is something that was practical, and that Application Engineers could pick up or people coming from all different backgrounds, and I was wondering how much of that was intentional, and also what you think that means for the kind of people that are going to look to your book and take value from it.

Mikolaj Pawlikowski: I'd like to think it was fully intentional. Basically, when... So the story behind this kind of funny, because I wasn't suspecting anything when I woke up that day, and I got a call out of the blue from the Manning Publisher and they're, "Oh yeah, well, we liked the things that you've been doing with Chaos Engineering and Powerful Seal, the presentation is wide. Why don't you write a book?" I was well, they've already written like a book or two about that, and then we're talking about that, and we realized that shared, there are these books that talk about the mindset and that is great, but in order to be able to actually apply that, it's easy to fall into the trap of thinking that this is actually only for Netflix and Google, and whatever, and that the same methodology can't necessarily be applied elsewhere.

I said that was basically how they talked me into writing that book in the first place to share the different layers. And like you said, I was initially trying to actually start entirely from like the sys calls and bill up all the way to Kubernetes. We had to switch it a little bit, so the order for like the marketing reasons, but yeah, the goal was to basically shout out. This is really not rocket science, most of that is simple stuff that everybody can do, and regardless of whether you're in SRE at Google, which is great for running your Kubernetes stuff or managing the Kubernetes so that you need understand behind the scenes of how those things actually work and all the way to... we have this legacy system we kind of know what it's, how it's supposed to work. Are we really sure about that? Okay, I think.

So yeah, the type of content was specifically designed to go through the different stacks, and different technology, and different languages to show that it's about the breadth here rather than any specific tool, and I try to throw in tools here and there, obviously where it might make sense and gives you value, but you'll notice that most of those things are used tool that have been there forever and doesn't necessarily require you to do anything brand new or to pay money for any of that. So that was... I'm kind of happy that you noticed that trend in the book, because that was one of the big parts of what I was trying to achieve with it.

Chaos Engineering For Frontends

Patrick Higgins: Yeah, I found it to that point of view of thinking about chaos engineering and its effect on front ends. I thought it was really cool as someone who's spent a lot of time in front end code. I really appreciated it because taking that perspective, obviously of thinking about the fact that what we're really thinking about is user experience and reliability from a user's perspective as well, so being able to like take that really holistic view of the way that these things look, if the way that this practice can really help, not just us, but create less pain for customers as well, I thought it was, great. I think a lot of people are going to have that really affect the way that they think about practices of Chaos Engineering, but also just reliability generally like that it's not abstracted away from customers. It's actually fundamentally important to the customer experience.

Mikolaj Pawlikowski: Yeah, I'm really happy you're bringing this up, the JS crowd, actually, I appreciated that, so that's great, and I guess the flip side of that is that it was really challenging to kind of compress enough information about those different stacks, technologies, languages, and whatnot into single chapters without overwhelming people because you could write a book about each of those things that will be touched in this chapter so I hope that I struck a reasonable balance, but the JS chapter was particularly funny to write and I hope I didn't offend the JS crowd too much.

eBPF

Jason Yee: I'm curious. One thing that I wanted to know earlier on, you had mentioned starting simply and so I'm curious, is there one favorite Chaos Engineering experiment that you've run repeatedly on systems and found the most benefit from?

Mikolaj Pawlikowski: Deep question. I think that from my personal benefit before I actually wrote that book. I didn't really use things like strace that much on anything more complicated that a basic thing and through the process of writing those and finding this examples. I realized that it's much more relevant than we expect and understanding at the kind of lower level what the thing is actually doing. It's not that hard and I think another... and you probably have seen these, so this is not directly really answering your question a little bit on the side, but I think you might have seen it in the book. I keep repeating how amazing eBPF is, and that has really been like a game changer in the last couple of years for us in terms of what visibility we can achieve and at what cost, because now obviously strace is great, but the performance kit means that you can't really touch anything production with that.

And with eBPF you can. So for the listeners who don't know, it's the Extended Berkerly Packet Filter and it's a part of kernel that allows you to more or less write arbitrary snippets of code that can be attached to various events, and also allows for generating aggregations, so it can generate things like counters, and my first other maps that are directly executed into kernel and without a penalty, and then you can export that data outside to get your visibility, so we've had an amazing amount of great fun to begin with, but also the extended visibility that we're able to achieve with really small snippets of code, to gain the kind of, sort of ability that we didn't really have before at all, so that's really been like a game changer for us and that's something that I keep recommending to everybody, because especially in the context of Chaos Engineering, when the observability is so important. You can do a lot of stuff without actually modifying any of that code, and most of the time, without the application, knowing anything about your observability.

So it's not really my favorite experiment, it's more an entire family of observability that opens up in this space. So if you're not using it, you should. So check it out.

Jason Yee: Yeah, I know a lot of monitoring companies have definitely taken advantage of eBPF in order to gain observability, so most of the... at least SaaS offerings out there I know, or have used it, and that was super exciting, so it does sound one of those good things that you've started to do with Chaos Engineering is really just to get in there and actually start to validate your monitoring and things like that.

Mikolaj Pawlikowski: Yeah, but it looks a little bit scary at the beginning, but once you get past of the fact that you need to occasionally look up some bit of the kernel code, it's not that bad, so good stuff.

Patrick Higgins: What is the post game day process for you? How do you go through postmortems or post game day wrap ups? What does that look like? And do you have any tips for that as well?

Mikolaj Pawlikowski: That's an interesting one. I got to say that I don't really do a lot of game days because I find it, most of the benefit that I get from that is just to generate the initial buy-in from like team members, to kind of make it a little bit more fun, kind of like hackathon style and get them to get excited about it but then I don't really want them to just do it on one particular day, every month or whatever. I just want them to think that way through. To begin with, so I'm not super keen on, you know, this kind of, it is cool and I see the benefit, but it's not necessarily to something that I do very often. So probably the wrong person to ask that question, nothing wrong with them though.

Patrick Higgins: No, absolutely. In terms of the experiments you do run, what tends to be the kind of models that you go after for introducing Chaos experimentation generally to teams?

SLOs

Mikolaj Pawlikowski: Sure. So obviously that depends very much on what kind of team it is. For example, for my team, we typically... we're starting with, when we actually, materialized in paper, our SLOs that we're expecting to hit. The kind of obvious first step is to just run some kind of continuous verification of this, that we satisfied just as SLOs, even though we have the kind of failure that we expect, and if you run in any cloud environment that the failure is there all the time, so typically we would just see the things that were breaking and kind of continuously over the time add them up to whatever process we ran for the community staff. We run a lot of those, just us scenarios for a Powerful Seal, because it's easy to do, but it could be anything.

So the SLOs are typically a good place to start because it's the kind of thing that everybody knows they need to have, but in practice it, it kind of blurs other little bit here and there. I'm imagining that if on the other hand we talked about the JS and the front end stuff, if you're running a team like that, you might not necessarily need to have SLOs in terms of the performance of your front end, if you do that's great, so I would expect that on this kind of spectrum, the other side of the spectrum, it will be very different, and probably just poking around a little bit and kind of ad hoc experiments might already be a good start the kind of thing. Like in the book when he just writes a little snippet, and you've verified that things are going well, and kind of depending on your pipeline, just make sure that it's automated later. So yeah, it's kind of I don't really have a one fits all answer for this. Like you said, there was a lot of different scenarios and different rules apply.

Patrick Higgins: Absolutely. Yeah.

Jason Yee: So I'm curious along those lines, when you mentioned SLOs. I think a lot of folks are starting to adopt that and there's often some confusion around. Great, so you're saying I should start with my SLOs, but how do I actually start? Do you have any sort of advice on how to create good SLOs?

Mikolaj Pawlikowski: I think that it's probably true that having a bad SLOs is it might be even worse than having no SLO at home. I think that it's one of those things, again, that a lot of people see as something complicated and something that requires a degree in Maths, but most of the time from my experience, the Maths is kind of the back of the napkin kind of calculation, and that tends to be good enough. A lot of the time you do need to estimate this thing. So anyway, if you can calculate, or if you can multiply and divide, you're typically doing okay, so.

Jason Yee: But what if I can't do that?

Mikolaj Pawlikowski: I'll find someone with a calculator. I think that my advice would be just to start easy, start really easy and build up. It's kind of like the same thing with alerting, when teams start alerting on things, they typically go like all in, and want alerts and all of that, and then over the time you realized that this one is actually pretty noisy, this one has an arbitrary value in it.

So well, which shows that value really isn't that relevant anymore. That fresh culture be changed, so getting the right alerts takes a bit of trial and error. I'm kind of surprised, I'm saying this on record, but it is a little bit of equal parts, art and science. I'm sure you've had some experiences with that too, and it's like the same thing with SLOs, and sometimes you have them directly coming from the business and you don't really have a choice, and he designed the entire system to meet a certain SLO, right? But there's a lot of the gray area where you need to decide on something that's reasonable. The definition of reasonable is going to depend from one person to another. So yeah, if I can give you one piece of advice is to just not think about this as rocket science and just start simple and iterate, unless obviously your cases are clear cuts and this comes from above. Anyways guys, good luck.

Jason Yee: I like that you say art and science, because for me it's usually pain and value, or annoyance and value of, "All right, I'm getting super annoyed at this alert. It's just time to deal with it." I've known for a week that it's awful because I've gotten hundreds of alerts, so then you turn that off and you try to fiddle with things as you say iterate.

Mikolaj Pawlikowski: Yeah, that's another good measure. The first cold of pain that eventually this is a point when you're actually going to fix it, but yeah.

Patrick Higgins: Is there anything coming up for you in terms of what you're excited about seeing at the moment in the world of reliability and perhaps in Chaos Engineering? What is exciting you right now?

Mikolaj Pawlikowski: One of things that are personally exciting for me is the fact that I am seeing a little bit of a movement in the kind of ecosystem from people immediately thinking that this is just some kind of gimmick, and they remember the slogans from blog posts a couple of years ago and just randomly break things in production, and that's usually... That means that's not really going to go anywhere, and I think that if their work that's been done by different companies and as more materials become available, and trainings, and workshops and whatnot, it's becoming a little bit more demystified, and people will no longer have this kind of weird laughter when you say Chaos Engineering to them with a straight face, so there's definitely doubt.

I think another thing that you mentioned before, there's seems to be a new startup doing observability every other day now, and it's kind of interesting a long time, but I think as this things are happening and as we mature, the kind of observability and the EBFs of the world to, and we leverage all of that gets the observability to a point where it's easy to use, and there are tools that do that for you.

I think that at this point, the adoption can raise a much quicker because a lot of people I speak to the main roadblock that they mentioned is horrible name and Chaos Engineering, but typically the second and the third one are about in lack of maturity in their observability, if don't have a really good way to verify that you didn't break anything, then you probably won't go breaking things because what's the part, right? And, the third thing is typically this training aspect, so that's why I'm kind of trying to address a little bit with the book.

So I'm hoping that this combination, just little cocktail of things is going to get us to a place sooner than later. Where it's just a normal practice, and maybe not that stays. We should change the nature from Chaos Engineering to resilience or something, so that people stop being confused.

Patrick Higgins: Yeah, so something like proactive value mitigate. I don't know if that's a ...

Mikolaj Pawlikowski: And that sounds corporate ready.

Patrick Higgins: That sounds great. Yeah, but we can sell that to bosses, so that's all right.

Mikolaj Pawlikowski: Yep, right about that.

Patrick Higgins: Have you kind of found that introducing this processes, in terms of the people around you, have you seen less burnout potentially, or just better experiences for the people you work with?

Mikolaj Pawlikowski: Yeah, I mean, definitely the SRE people, they really react to the idea of being called less at night, and if there's going to be an outage to generate that outage when during the office, so there's definitely a good reception to that. I don't really have a statistics on how well that actually works out, but just the idea of knowing that you're not just rolling the dice every time and you did your best to try to detect this is, its already, I think from the therapeutic point of view, a good thing, but it also works now with the management.

They like hearing that this kind of outage we've got to cover it. That's not going to happen again because we know for sure, because we tried the same problem and it actually doesn't cause anything anymore, so I don't have a degree in psychology, but my personal receptionists that it does help.

What Miko Is Currently Excited About

Patrick Higgins: Yeah, that totally checks out. I can imagine knowing that something's not going to break the same way for the 10th time is going to reduce people's anxiety. Miko, do you have, in terms of doing some shameless self promotion, could you give us all the dates what's coming up for you? What's exciting? The name of your book, of course. All of the great things?

Mikolaj Pawlikowski: Sure, the book is called Chaos Engineering: Site Reliability Through Control Disruption. I know that's a mouthful, it's the third attempt at the subtitle that will hopefully work. It's out of a bit delayed. I think it's now a supposed to hit Amazon for the physical copies in mid-January, so if you just go to manning.com, you can get an online copy now, or pre-order the physical copy.

Otherwise, if you'd like to stay in touch with me, I do run a small newsletter, chaosengineering.news. You can put your email and specifically hear from me.

Otherwise, if you are looking for resources to start with Chaos Engineering, that's not a shameless plug, but I think one of the best resources is the awesome Chaos Engineering list on GitHub. There's plenty of different links, different things, and it's fairly up to date, so I tend to recommend this one.

And if you want to chat about that, or if you'd like me to give a presentation to your team or a talk, reach out on LinkedIn and I will figure something out.

Patrick Higgins: Awesome. Thanks Miko. That's great.

October 7, 2021 - 4 min read

Getting started with Disk attacks

Persistent storage is one of the more difficult aspects of managing distributed systems. When we attach a storage device to a host—whether it’s flash storage, network attached storage (NAS), or old fashioned spinning disks—we generally don