Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.

You can subscribe to Break Things on Purpose wherever you get your podcasts.

If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

In this episode of the Break Things on Purpose podcast, we speak with Jérôme Petazzoni, tinkerer extraordinaire and container technology educator.

Episode Highlights

Transcript

Jérôme Petazzoni: So I think at some point we were joking with some friends at the bar, they were like, "Well, you don't have a middle name. You should kind of Americanize yourself a little bit and have a middle name." But I thought, well, maybe I could try to have a middle name made of emojis. We picked the little whale, then a ship, and then the emoji with the little box. So it was like "Docker ships containers" and so in every system after that, at least the non-critical ones, I would put Jérôme and then these three emojis as my middle name and Petazzoni at the end. And of course, very often that wouldn't work, not exactly crashed, but something bad would happen.

Jason Yee: Welcome to the Break Things on Purpose podcast, a show about chaos and building more reliable systems. I'm Jason Yee. And in this episode, Pat Higgins and I chat with Jérôme Petazzoni about designing for failure while doing SRE work in the very early days of Docker and how you should approach reliability with your Kubernetes.

Patrick Higgins: How are you doing Jérôme?

Jérôme Petazzoni: Hi, pretty good. Thanks.

Patrick Higgins: So Jérôme, what does your day to day tend to look like? What do you do for work?

Jérôme Petazzoni: So my day to day in the past few years, basically since my departure from the band formally known as Docker, I've been doing training and consulting surprisingly around the themes of, well, containers, Docker and Kubernetes. I would say the first year in 2018, it was lots of Docker, little bit a bit of Kubernetes and then 2019, 2020 and now 2021, strangely the ratio tends to flip around and it's mostly Kubernetes with a little bit of Docker as well.

Fun with Riak at dotCloud (Docker)

Patrick Higgins: That would line up with what we're kind of seeing in the industry, a move towards a lot of Kubernetes, people are very bullish. When people come on the show and we'd like to ask them about a horror story they might've had in the past and just kind of like how that went for you and what you took forward in your career based on that horror story?

Jérôme Petazzoni: I've had a few horror stories since I've been like building infrastructure and doing things like that for quite a while. But when thinking a little bit, I thought about an outage that we had at dotCloud, which is how Docker was named before it became this household brand. So we were running this platform pretty similar to Heroku, kind of the tagline of dotCloud back then was, "Hey, we're like Heroku," but we can run pretty much anything Ruby, Python, PHP, JAVA, Node and we have Mongo and MySQL and Postgres and Redis and Memcache and you name it. And of course we are doing that with containers and you can guess where that led later on. And as part of the platform we used Riak, which was this kind of distributed key-value store built in Erlang, it was extremely robust. This thing feels like it would just withstand anything you could throw at it. I guess if we had been a decade later, we probably would go for something like etcd or consul just to give you an idea of how confident we were about this thing.

So we always were super happy about Riak, it was really a pleasure to operate that from an admin standpoint, scaling and replacing dead nodes etc, that was amazing. The folks at, Basho was the name of that company, where like absolute class acts. And of course, since we sometimes didn't know what we were doing, one day, we and by we I think I want to take the credit/blame here, I magically screwed up when I think it was replacing nodes, something like that. It was a decision, I don't remember the exact details, but maybe something like, "Oh, we need to replace these c1.whatever instances, with maybe m2.something or routine node replacement operation." And it turns out that I had no idea how Riak replication worked. Well, I thought I knew, but really I didn't.

And so what I ended up doing, instead of replacing the nodes, is that I ended up creating a second cluster, so completely new and empty, and then replacing my current cluster with the new cluster that was completely empty. Of course, I didn't realize that I was doing that as I was doing it, I was kind of restarting nodes and running commands and being like, everything is kind of stringing along. And the way that Riak operated as far as I can remember is that, it's this kind of hashing thing where your keys and the values are distributed over the nodes and there is some replication involved which means that when you take a node down, everything keeps chugging along, everything is fine.

It's not like etcd or where you start losing nodes, you kind of have alarms blaring all around and if you lose your quorum, suddenly it kind of stops immediately and you know that you just stepped in another circle of hell. With Riak, you could take your nodes down and at first nothing happens. And at some point when you send queries, you don't get all the results. And in that way, I think it's a little bit similar to what you can see with ElasticSearch. I remember some demos I do sometimes on ElasticSearch on Kubernetes and I show some folks, "Hey, this is what you happen when you scale down your ElasticSearch cluster a little bit too excitedly" and then we have Kibana on the side and when you reload in Kibana, at some point you see a bunch of your logs just like vanishing. And so there is some of that can happen.

So in that case, I was replacing the nodes one by one and at some point, basically everything disappeared. And that was kind of a problem because Riak, since we were putting so much trust in Riak, that's where we would put all the important stuff, what we call like the crown jewels if you will. So we had basically without diving too much into the details, but we had a mapping between the ports that we had on the platform like the front NAT ports and the containers behind that. So if we were to imagine dotCloud was a little bit like a giant Kubernetes cluster, you could imagine that this would be like the table of the nodeports, saying that port 30,001, 2, 3 maps to this container and etc.

So we basically lost that and so the long-term consequence would be, well, every single customer who has a port number in some configuration files somewhere is going to be magically screwed over by this. Now, I said, long-term consequence because Riak is pretty awesome in terms of reliability, but we had read that it was not a good idea to hit it directly. It's not really designed to withstand a lot of queries per second, well, it could, but not in the ways we're using it. So we had a bunch of caches in front of it using Redis and the data was still in the Redis caches, and we knew that everything would stay fine until the cache would refresh and then basically we'd be dead.

So we basically had two options at this point. One is, I still have the data in the cache, so maybe I can extract that data and I can try to re-inject stuff in Riak, which sounds like a perfectly reasonable, straightforward thing to do, except when you stop thinking about the details, it's like, I need to extract the data and then I need to somehow build a bunch of queries. Do I want to directly inject the data in Riak or do I want to go through our kind of little microservice that we had in front of it? What's the load that this could take and you do some tests and you see that the little microservice is way too slow for that for reasons that I can't remember, but you do the math and you're like, well, at that speed, it's going to take a few days before I can reload everything so that's not going to fly.

And then you're like, I guess I need to send my queries directly to Riak. And then you're like, well, I don't know if I want to learn about that right now. And of course it's, I don't know if it was a Friday, I'm not sure about that, but it was definitely past 5:00 or 6:00 PM. And so you're kind of also factoring in my energy levels are going down, how much coffee can I put into my system to compensate for that? How long before customers notice, because of course things were still operating, but kind of in a read-only mode, so to speak. So if a customer were to deploy a new service, that service would not be added to the NAT table and everything. So eventually, after lots of a trial, and error eventually what we did is that we realized that we could take a pretty fresh backup of the Riak, like data directory, and just kind of unpack that on top of one of the Riak nodes and kind of let it, rebuild everything from that.

If I remember correctly, as part of my big cluster folding, at some point we had a single node Riak, and so on that single node, we had all the data. And so we just ended up getting that tarball and just unpacking it everywhere and kind of miraculously it worked. Everything fell back in place and I remember it took us like a pretty long time, like something between half an hour and an hour to check that everything was really working because it felt almost too good to be true. It's like, Oh, you mean, I just got this tarball that I kind of baked on the side with that half part of the cluster and I just unpack that here and it works and yeah, it did.

So the takeaways here, well, one big takeaway is, maybe do more backups doing that kind of a node upgrade thing like that. We had some backups, but I don't remember why we had to use that older backup instead, like there was something was wrong with it. Of course, it sounds like we should have run kind of a dry run or some kind of a trial thing and of course we did in the smaller test cluster and we never realized that anything went wrong. At the end, you see that the cluster is up and there appears to be some data in it because the thing feeding the cluster is already filling it some data. So, and since we didn't have the whole data in the cluster, we did not realize, Oh, all the data is gone or like, there is about a dozen of items in the cluster, that seems reasonable. We had 10 items before doing the update, 10 items after, that seems reasonable, without realizing that they were 10 completely different items.

So honestly, if I think about that and I kind of try to think about, what would I do differently today, I'm kind of afraid and embarrassed to admit that I would probably do things similarly in a way, because what saved us that day is that the design of that whole thing was extremely defensive. It's like, there is Riak and there is some caches and there is a... when we built that platform with Solomon Hykes, Sébastien Pahl and Sam Alba back then, that was kind of the early team dotCloud, I think, I don't want to speak for them, but I feel like we had this shared awareness that we were a tiny team and we were trying to build something extremely ambitious and we had to be extremely careful in our design decisions.

We knew that we would be shooting ourselves in the foot on a daily basis and so we had to make sure that this wouldn't have disastrous effects. So we had a lot of almost defense in depth, that kind of caching thing, we had many ways to kind of rebuild all these layers of data because we knew... all this Redis caches for instance, we had easy ways to rebuild them because we knew that this could just blow up. We had so many things like that because we knew we would get things wrong. So, that helped us a lot I think.

I think the main lesson here is that I would keep having this extremely, I don't want to say pessimistic way of thinking, but trying to envision, we are going to screw this up at some point, how are we going to recover? And I don't know how we are going to screw up of course, because when you knew what can go wrong, it's easy to mitigate that, but have the backup plan and the backup backup plan and how do I rebuild this critical datastore and what happens if I realize that the backup is no good and things like that. That way of thinking really helped us. It didn't save us from all the problems and outages and embarrassments, but it definitely saved us a bunch of times.

Jason Yee: I think that's such a good perspective, because a lot of times, as we think of reliability, it's based on that idea of, what are the ways that I think my application can fail or my service can fail and how do I mitigate that. And there is this notion of, with complex systems, you never know exactly how something's going to fail. It's always a collusion of crazy things that just work in the right way and causing us to go down and so you can't predict, or you can't always predict what that will look like. So having this idea of, let's just not even really try to focus on that, let's try to focus on things will go down, bad things will happen, how do we recover from that?

Jérôme Petazzoni: Absolutely. I think it's another thing I try to keep in mind very often when trying to design the perfect system in whatever area, with time I tend to value more something that will fail often, but recover gracefully rather than something that will almost never fail, but when it does, you're just like out of luck, dead in the water.

Patrick Higgins: It is really interesting that you made that note that you're not taking that defensive approach just with the software you're building, but you're also looking for that in the software that you choose to integrate in your system as well. And I think that's a really interesting perspective.

Reliability and the lasagna of Kubernetes

Jason Yee: You've been doing a lot of training on Kubernetes and I feel like people have that same view of Kubernetes that you had with Riak is that Kubernetes is reliable, so I don't have to think about it. So I'm curious, if you take the lessons of what you've learned from this incident and were to apply it to Kubernetes as someone that trains people on how to use Kubernetes, what would you suggest for people to do to make their Kubernetes not only more reliable, but how to respond to failures that they will inevitably see?

Jérôme Petazzoni: That's a good one that I'm sure that, many folks when hearing that would be like, wait a minute, Kubernetes being super stable? I've been running Kubernetes a little bit and I've seen some different things. But there is some of that, manifesting a little bit differently. For instance, when I talk about networking and Kubernetes and I kind of talk about this kind of "lasagna" thing in the sense that, you have the communication between pods, what we usually call CNI and then we have the communication with services and cluster IP and nodeports and that's usually where we have kubeproxy. And then we have everything about filtering, which is network polices.

And one thing I find absolutely mind blowing and that for me commands respect to the folks who designed the whole thing is that you can replace each of these lasagna toppings if I could say, without everything else falling apart, like you could have cilium for the pod to pod communication and stick to kubeproxy for service communication, and you could have maybe cilium as well, and maybe something else completely differently for firewalling. And the fact that you can switch over these things and everything still works, that's pretty amazing. You even have folks like switching their CNI in a kind of live while clusters are running with minimal or even no downtime. And I think that's great. So for me with Kubernetes, the thing that I really admire is more around the design. Then if I think about reliability, the difference is that Riak was just one component and you have this component is you replicated like two, three, four times. Great. But with Kubernetes, as we know you have etcd, you have the API Server, you have the Controller Manager and you have all these things and they do lots of stuff.

Just if we talk about certificate renewal on the control plane, we kind of zoom on a very small kind of sub space of the problem here, I took great joy in listening to Laurent Bernaille from Datadog. He speaks about how these big Kubernetes clusters operate and he kind of draws the differences between, "This is the textbook replicated cluster that you can see in their Kubernetes documentation. Now, this is how we do things because we actually have clustered with thousands of nodes." And I'm also extremely grateful that he's up there and doing these presentations and kind of talking truly about how things are in the field. That's really amazing. And to me, the takeaway is, the design is super solid, it's great. But then if you want everything to be as reliable, as you want to think it to be, all these layers, either in my lasagna network metaphor, or generally speaking, all these layers in Kubernetes need to individually be rock solid.

So your etcd cluster needs to be up to spec and you probably need to split the events to a separate one, because that thing is going to get really busy on a big cluster and then you need to scale different components differently because some are kind of active-passive, failover and others like the API Server node then there you want to load balance things. And that's where it gets a little bit tricky. But in parallel to that, the thing I really try to convey to folks learning Kubernetes with me is this whole idea that you cannot have 100% uptime on a system that becomes like complex enough. It's easy to compute the uptime of your monolithic application because you just send this Pingdom or whatever requests every minute and you count how many, 200 OK and how many anything else you got and you do ratio and there we go, 99 point something, great.

But on, I don't know, how do we compute the uptime on the whole Data Center? Am I checking how long the light has been up in the little booth with the security person at the entrance of the Data Center or am I checking that every single of the 1000s of servers in that Data Center are up and running and reachable on the network, etc. And that becomes really hard. And so that's what I try to convey. It's like, on the Kubernetes cluster, the uptime is going to be something more abstract, depending on how you count it, maybe you can count it as 100% because you could have many components that are actually literally on fire and your apps are still chugging along, everything is fine.

The best example I give for that, if you're running email and you have a bunch of email clients regularly checking mail and sending mail, your emails servers could be down half of the time and nobody would notice. I mean, when they send an email, they would be like, it took a couple of minutes for my email to get there, but guess what, this happens all the time anyway, so that's fine.

On the other hand, you could also consider that each time you have one pod somewhere on one node having the slightest problem, you could be like, "Oh, my cluster is down." So to speak or rather it's not 100% up, but if you count your uptime like that, then your uptime is going to be 0% basically because there is always something wrong in the realm of Kubernetes. So that was a really long tirade to say that we need to design our apps to withstand that. But also in generally it kind of amounts to, if something fails just retry and retry and retry and retry forever, and look at basically how Kubernetes itself is doing it.

When I want to introduce that, I kind of show folks, Hey, let's do a rolling update on this deployment. We have version 1, let's put version 2. Yay, something happens. Now put version 3 and then the kicker is that version 3 actually doesn't exist. So you see your deployment that starts to update, and then it kind of stops dead in its tracks because the pods don't come up since the image doesn't exist and we kind of illustrate that see, Kubernetes is going to try it again and again and again, it's going to Rick Astley-style like never going to let you down, it's going to try again and again until you tell it to stop or until you push that image. But otherwise you could let that cluster for six months and come back and push the image, in a few minutes it would pick up that image and run it. And that's how we should build our applications and if we do that, then we can have some pretty solid confidence that they will run fine on a Cloud Native platform, so to speak and of course that's easier said than done.

Enabling work that matters

Patrick Higgins: So, Jérôme, just to wrap things up, I'd like to ask you about the things that you're excited about right now in the industry, what are the things you're really looking forward to?

Jérôme Petazzoni: In our industry like in IT etc, I would say just wing it and get through this and let's try to help the folks that are working on the really important stuff to be efficient at what they do. Last year in 2020, I was extremely lucky to see my training business mostly unaffected by the pandemic. And that's a huge amount of luck and privilege and all these things. I mean, there are so many things that are now moving online and so many things that needs to be scaled like crazy and so containers and Kubernetes and all these things are pretty good at that. So yeah, people want to learn that and so that's why they come to me and many other folks to help them. So that's why I was doing pretty well last year.

But really to me, my end goal in this is, well, if I can help folks working on online education, not as container education, but like, real education for kids and so on, folks working on medical systems and data analysis and protein folding and all these things. If with what I do, I can help these folks to do what they do better than I consider it's a... one day when I do that is a good day basically. And of course not all the folks that I trained were doing things like that, but I think I hope some of them at least did and that's what I'm excited about. It's helping these folks to save the rest of us.

Patrick Higgins: That's some great perspective.

Jason Yee: If folks wanted to actually get training because they are doing stuff like that, but they need to scale it, what's the best way to reach out to you and get some training?

Jérôme Petazzoni: Well, I have a very basic website on container.training and there is like a form in some contact info there and then I'm happy to find ways to help folks. I've been doing basically online training since March of last year. So for almost a year now.

Patrick Higgins: Awesome. Thanks very much for coming on the show Jérôme. I really appreciate it.

Jérôme Petazzoni: Thanks.

Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things On Purpose podcast on Apple Podcasts, Spotify or wherever you listen to your favorite podcasts. Music from this episode includes Battle of Pogs by Komiku and Never Gonna Give You Up by Rick Astley.

No items found.
Categories
Jason Yee
Jason Yee
Director of Advocacy
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL