Podcast: Break Things on Purpose | Itiel Shwartz, CTO and Co-founder of Komodor
Welcome back to another iteration of “Build Things on Purpose” where we talk with developers and engineers who are creating tools that help us build more reliable systems. Today Itiel Shwartz, CTO and Co-Founder of Komodor, has joined us to chat about what they’re doing to help tame the chaos of Kubernetes. Itiel talks about Komodor’s goal of making troubleshooting Kubernetes not only easy, but potentially even fun! Itiel breaks down how Komodor is making this a possibility, his own troubleshooting background, some useful troubleshooting tips and more!
In this episode, we cover:
- 00:00:00 - Introduction
- 00:05:00 - Itzel’s Background in Engineering
- 00:08:25 - Improving Kubernetes Troubleshooting
- 00:11:45 - Improving Team Collaboration
- 00:14:00 - Outro
Jason: Welcome back to another episode of Build Things On Purpose, a part of the Break Things On Purpose podcast where we talk with people who have built really cool software or systems. Today with us, we have Itiel Shwartz who is the CTO of a company called Komodor. Welcome to the show.
Itiel: Thanks, happy to be here.
Jason: If I go to Komodor’s website it really talks about debugging Kubernetes, and as many of our listeners know Kubernetes and complex systems are a difficult thing. Talk to me a little bit more—tell me what Komodor is. What does it do for us?
Itiel: Sure. So, I don’t think I need to tell our listeners—your listeners that Kubernetes looks cool, it’s very easy to get started, but once you’re into it and you have a big company with complex, like, micros—it doesn’t have to be big, even, like, medium-size complex system company where you’re starting to hit a couple of walls or, like, issues when trying to troubleshoot Kubernetes.
And that usually is due to the nature of Kubernetes which makes making complex systems very easy. Meaning you can deploy in multiple microservices, multiple dependencies, and everything looks like a very simple YAML file. But in the end of the day, when you have an issue, when one of the pods is starting to restart and you try to figure out, like, why the hell is my application is not running as it should have, you need to use a lot of different tools, methodologies, knowledge that most people don’t really have in order to solve the issue. So, Komodor focus on making the troubleshooting in Kubernetes an easy and maybe—may I dare say even fun experience by harnessing our knowledge in Kubernetes and align our users to get that digest view of the world.
And so usually when you speak about troubleshooting, the first thing that come to mind is issues are caused due to changes. And the change might be deploying Kubernetes, it can be a [configurment 00:02:50] that changed, a secret that changed, or even some feature flag, or, like, LaunchDarkly feature that was just turned on and off. So, what Komodor does is we track and we collect all of the changes that happen across your entire system, and we put, like, for each one of your services a [unintelligible 00:03:06] that includes how did the service change over time and how did it behave? I mean, was it healthy? Was it unhealthy? Why wasn’t it healthy?
So, by collecting the data from all across your system, plus we are sit on top of Kubernetes so we know the state of each one of the pods running in your application, we give our users the ability to understand how did the system behave, and once they have an issue we allow them to understand what changes might have caused this. So, instead of bringing down dozens of different tools, trying to build your own mental picture of how the world looks like, you just go into Komodor and see everything in one place.
I would say that even more than that, once you have an issue, we try to give our best efforts on helping to understand why did it happen. We know Kubernetes, we saw a lot of issues in Kubernetes. We don’t try complex AI solution or something like that, but using our very deep knowledge of Kubernetes, we give our users, FYI, your pods that are unhealthy, but the node that they are running on just got restarted or is having this pressure.
So, maybe they could look at the node. Like, don’t drill down into the pods logs, but instead, go look at the nodes. You just upgraded your Kubernetes version or things like that. So, basically we give you everything you need in order to troubleshoot an issue in Kubernetes, and we give it to you in a very nice and informative way. So, our user just spend less time troubleshooting and more time developing features.
Jason: That sounds really extremely useful, at least from my experience, in operating things on Kubernetes. I’m guessing that this all stemmed from your own experience. You’re not typically a business guy, you’re an engineer. And so it sounds like you were maybe scratching your own itch. Tell us a little bit more about your history and experience with this?
Itiel: I started computer science, I started working for eBay and I was there in the infrastructure team. From there I joined two Israeli startup and—I learned that the thing that I really liked or do quite well is to troubleshoot issues. I was in a very, very, like, production-downtime-sensitive systems. A system when the system is down, it just cost the business a lot of money.
So, in these kinds of systems, you try to respond really fast through the incidents, and you spend a lot of time monitoring the system so once an issue occur you can fix it as soon as possible. So, I developed a lot of internal tools. For the companies I worked for that did something very similar, allow you once you have an issue to understand the root cause, or at least to get a better understanding of how the world looks like in those companies.
And we started Komodor because I also try to give advice to people. I really like Kubernetes. I liked it, like, a couple of years ago before it was that cool, and people just consult with me. And I saw the lack of knowledge and the lack of skills that most people that are running Kubernetes have, and I saw, like—I’d have to say it’s like giving, like, a baby a gun.
So, giving an operation person that doesn’t really understand Kubernetes tell him, “Yeah, you can deploy everything and everything is a very simple YAML. You want a load balancer, it’s easy. You want, like, a persistent storage, it’s easy. Just install like—Helm install Postgres or something like that.” I installed quite a lot of, like, Helm-like recipes, GA, highly available. But things are not really highly available most of the time.
So, it’s definitely scratching my own itch. And my partner, Ben, is also a technical guy. He was in Google where they have a lot of Kubernetes experience. So, together both of us felt the pain. We saw that as more and more companies moved to Kubernetes, the pain became just stronger. And as the shift-left movement is also like taking off and we see more and more dev people that are not necessarily that technical that are expected to solve issues, then again we saw an issue.
So, what we see is companies moving to Kubernetes and they don’t have the skills or knowledge to troubleshoot Kubernetes. And then they tell their developers, “You are now responsible for the production. You are deploying? You should troubleshoot,” and the developers really don’t know what to do. And we came to those companies and basically it makes everything a lot easier.
You have any issue in Kubernetes? No issue, like, no issue. And no problem go to Komodor and understand what is the probable root cause. See what’s the status? Like, when did it change? When was it last restarted? When was it unhealthy before today? Maybe, like, an hour ago, maybe a month ago. So, Komodor just gives you all of this information in a very informative way.
Jason: I like the idea of pulling everything into one place, but I think that obviously begs the question: if we’re pulling in this information we need to have good information to begin with. I’m interested in your thoughts of if someone were to use Komodor or just want to improve their visibility into troubleshooting Kubernetes, what are some tips or advice that you’d have for them in maybe how to set up their monitoring, or how to tag their changes, things like that? What does that look like?
Itiel: I will say the first thing is using more metadata and tagging capabilities across the board. It can be on top of the monitors, the system, the services, like, you name it, you should do it. Once an alert is triggered, you don’t necessarily have to go to the perfect playbook because it doesn’t really exist. You should understand what’s the relevant impact, what system it impacted, and who is the owner, and who should you wake up, like, now or who should look at it?
So, spending the time tagging some of the alerts and resources in Kubernetes is super valuable. It’s not that hard, but by doing so you just reduced the mental capacity needed in order to troubleshoot an issue. More than that, here in Komodor we read of this metadata label stacks, and we harness it for our own benefits. So, it is best practice to do so and Komodor also utilize this data.
And for example, for an alert, say like, the relevant team name that is responsible, and for each service in Kubernetes write the team that owns this service. And this way you can basically understand what teams are responsible for what services or issues. So, this is the number one tip or trick. And the second one is just spend time on exposing these data. You can use Komodor I think, like, it’s the best solution, but even if not, try to have those notification every time something change.
Write those, like, web hooks to which one of your resources and let the team know that things change. If not, like, what we see in companies is something break, no one really know what changed, and in the end of the day they are forced to go into Slack and doing, like, here—someone changed something that might cause production break. And if so, please fix it. It’s not a good place to be. If you see yourself asking questions over Slack, you have an issue with the system monitoring and observability.
Jason: That’s a great point because I feel like a lot of times we do that. And so you look back into your CI/CD logs, like, what pushes are made, what deploys are made. You’re trying to parse out, like, which one was it? Especially in a high-velocity organization of multiple changes and which one actually did that breaking.
Itiel: We see it across the board. There are so many changes, so many dependencies. Because microservice A talks with microservice B that speak with microservice C using SQS or something like that. And then things break and no one know what is really happening. Especially the developers, they have no idea what is happening. But most of the time also the DevOps themselves.
Jason: I think that’s a great point of, sort of, that shared confusion. As we’ve talked about DevOps and that breaking down of the walls between developers and operations, there was always this, “Well, you should work together,” and there is this notion now of we’re working together but nobody knows what’s going on.
As we talk about this world of sharing, what are some of your advice as somebody who’s helped both developers and operations? Aside from getting that shared visibility for troubleshooting, do you have any tips for collaborating better to understand as a team how things are functioning?
Itiel: I have a couple of thoughts on this area. The first thing is you must have the alignment. Both the DevOps, or operation and the developers need to understand they are in this together. And this, like, base point in other organization you see they struggle. Like, the developers are like, yeah, I don’t really need—like, it’s the ops problem if production is down, and the ops are, like, angry at the devs and say they don’t understand anything so they shouldn’t be responsible for issues in production.
So, first of all, let’s create the alignment. The organization needs to understand that both the dev and the ops team need to take shared responsibility over the system and over the troubleshooting process. Once this very key pillar is out of the way, I will say that adding more and more tools and making sure that those tools can be shared between the ops and the dev team.
Because a lot of the times we see tools that are designed for the DevOps, and a developer don’t really understand what is happening here, what are those numbers, and basically how to use them. So, I think making sure the tools fit both personas is a very crucial thing. And the last thing is learning from past incidents. You are going to have other incidents, other issues. The question is, do you understand how we improve the next time this incident or a similar incident will happen? What processes and what tools are missing in the link between the DevOps and the system to optimize it. Because it’s not after you snap your finger and everything works as expected.
It is an iterative process and you must have, like, the state of mind of, okay, things are going to get better, or they are going to get better, and so on. So, I think this is the third, like, three most important things. One make sure you have that alignment, two, create tools that can be shared across different teams, and three, learn from past incidents and understand this is like a marathon. It’s not a sprint.
Jason: Those are excellent tips. So, for our listeners, if you would like a tool that can be shared between devs and DevOps or ops teams, and you’re interested in Komodor—Itiel, tell us where folks can find more info about Komodor and learn more about how to troubleshoot Kubernetes.
Itiel: So, you can find us on Twitter, but basically on komodor.com. Yeah, you can sign up for a free trial. The installation is, like, 10 seconds or something like that. It’s basically Helm install, and it really works. We just finished, like, a very big round, so we are growing really fast and we have more and more customers. So, we’ll be happy to hear your use case and to see how we can accommodate your needs.
Jason: Awesome. Well, thanks for being on the show. It’s been a pleasure to have you.
Itiel: Thank you. Thank you. It was super fun being here.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more
Treat reliability risks like security vulnerabilities by scanning and testing for them
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.
Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis.Read more