Podcast: Break Things on Purpose | Zack Butcher, Founding Engineer at Tetrate
Welcome back to another edition of “Build Things on Purpose.” This time Jason is joined by Zack Butcher, a founding engineer at Tetrate. They also break down Istio’s ins and outs and the lessons learned there, the role of open source projects and their reception, and more. Tune in to this episode and others for all things chaos engineering!
In this episode, we cover:
Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems.
Jason: Everyone, welcome back to the Build Things on Purpose episodes for our podcast, Break Things on Purpose. With us today, I’ve got Zack Butcher. Zack is a founding engineer at Tetrate. But I’ll let him go ahead and introduce himself and tell us more about who he is and what he does.
Zack: Thanks for having me, first off. And like Jason said, I’m Zack, one of the founding engineers at Tetrate, which is a company that is kind of taking service mesh to enterprise. Before then, I was one of the original engineers over at Google that helped create Istio. And I know we want to talk about where Istio came from, and then some of why it came about.
It’s pretty interesting because I think the origins are maybe a little bit different than a lot of people, maybe assume, based on the architecture and different things. Istio actually came about to address, primarily first, API-serving problems and it kind of grew after that. And so in particular, when we were at Google, we had one system that served all public-facing traffic. So, if you want to call a Google API, doesn’t matter if it’s a cloud API, if it was Maps, or Translate, any API, to get in there, you had to go through the front door. And that was a monolithic API gateway.
It was painful. It was painful to run, we had all kinds of reliability problems because it was this shared proxy. It was a single set of instances; obviously, it was horizontally scaled, but it was a single set of instances driven by a single set of configurations. It was multi-tenant. And so we had things like shared outages. “Oops, I fat-fingered a config and I just took out Maps.” [laugh]. Not a good thing. Or, “I just took out Gmail [laugh]. Oops.”
I didn’t personally take out either of those, but there certainly have been fat-fingers in the system where that has happened. And that was even after the system was pretty mature. And I’m sure a lot of your viewers know, in mature systems, configuration change is the thing that causes the most outages. Once you reach a certain level of maturity where you have enough testing in place to vet binary changes, where you have good rollout procedures in place to be able to do things like canarying and all, then what really starts to get you there are configuration changes. And so that was one of the set of problems.
One of the other set of problems was noisy neighbors. You know, you’re running on the same computer; it’s multi-tenant web server; at the end of the day, there’s only so much multi-tenancy that we can build in there. You do compete for bandwidth, for CPU utilization, for other things. And so you can get noisy neighbor problems as well. Then the third one we had was cost accounting.
We have this big monolithic thing, it’s serving all the traffic, but how much does it cost to serve traffic for Maps, versus how much does it cost to serve traffic for cloud? And those economies become pretty important. And then there was additionally a locality problem. It was really hard because of the particular load-balancing setups that existed to get good cache coherency. Hey, it turns out with API-serving, caching is really, really good, really, really effective and—because you have high temporal locality; when somebody comes in to interact with API, they tend to do it a lot.
And so caching tends to be an effective tool we couldn’t cache effectively because of some of the load-balancing layers there. And so that’s what prompted this idea to take the sidecar pattern, and the sidecar pattern was not novel to proxies, of course. I’m sure you’ve had listeners here that have used sidecars for things like log rotation, aggregation, and that for years. And so we took that pattern and applied it to the proxy. And the initial sidecar that was delivered inside of Google, and actually, every single job in Borg used, was effectively an API gateway, a little mini API gateway.
I wanted to just, kind of, leak this out to the people so you make the connection, zero trust architectures or BeyondCorp is all in vogue now. And that idea that any given service is the API gateway has the kind of hard edge around it is pretty important for being able to realize that. So, that was the set of constraints that started us towards the sidecar architecture. And we saw that worked really, really well. And we built out the architectural components as well, a mixer component, no longer with us anymore, and somewhere things.
And we built out that architecture inside of Google, and we found that it was really, really effective at doing what we had wanted to do, which is making caching work well, giving us high locality, stopping shared fate outages because of misconfigurations; instead, you’re only going to cause an outage for yourself. And we got that locality and cache coherency that we really needed to make the system performant at runtime. Out that the success of that, we started to look around and we said, “Hey, this is actually a massive problem out in the community.” Especially as, you know, I was a GCP at the time, we were looking going, “How do we get people in the cloud?” This idea that networking was one of the hardest pieces.
So, Kubernetes was created and helped solve this compute problem; storage is a relatively solved problem overall, there’s—I have good answers, good tools in the toolbox for whatever my storage needs are, pretty good tools in the toolbox for whatever my compute needs are, but networking has been very lacking. It’s been this hugely fragmented space. And so that’s where we said—and this was in early 2016—we should take those ideas, this mesh idea, and bring it out into open-source to try and start to build that networking substrate, that we can start to make it more like computer storage, make it easier for application developers to really start to get to work. And that—is in that process, we went, actually, and reach out to IBM who were starting to run a very similar project called Amalgam8. And Amalgam8 was really all about traffic flow and traffic routing.
And so Google had come at it from this API-serving perspective, and some of the security perspective was very important for us on the policy side, and we partnered up with IBM at the outset who were very interested in the traffic flow side. And we both agreed on this phenomenal technology that had just been open-sourced and actually, wasn’t even open-source yet, but we had an early preview from Matt Klein, of Envoy, and we said, “This thing can power the traffic side, it can power the policy side; we think this is going to be awesome.” And so we set out to start to build Istio with a little bit larger charter than what the mesh had had inside of Google because of just—you know, inside of Google, it was primarily oriented around API-serving and we knew that we wanted to tackle that a little bit bigger space into open-source. And so we did. And actually, I know that this is going to air a lot later, but we happen to be filming this in the month of Istio’s fourth anniversary; in just another two weeks, Istio will be at the fourth anniversary of Istio 0.1, very initial release, which is exciting. We’ve come a long way in that time.
Lessons from Istio
Jason: Yeah, you’ve come a huge way in that time. I mean, you mentioned things like having the components in Istio, right? As someone who’s had an inside look—so for listeners, Istio used to be very componentized, and I think we all think of that as the natural way that you build applications is you break things down into components so that you can not block teams, that everyone can run independently, and you maintain that developer velocity. And so when you make small changes, it doesn’t have this huge blocking effect. And yet Istio moved reverse of that, right? Talk to me a little bit more about that.
Zack: This is a fascinating question, I think, because it really actually gets into the heart of the complexity that even application developers have to grapple with, which was, like you said, we built Istio as a set of microservices from the get-go, in part because A) we always intended to run Istio with Istio. So, we always intended to have the Istio components deployed on Istio. And we actually did for a very long time, and even the initial 0.1 release, I believe we did, in a very limited fashion. The problem was that it became such a special case.
The Istio bootstrap configs for the Istio components themselves were this special snowflake that was different than anything else, just because of bootstrapping problems and a variety of different things, so we weren’t able to effectively use Istio to manage Istio. And so then with that problem, you now, I don’t have Istio to solve my microservice problem; I’m back where exactly all of our app developers are, which is, “How do I handle reliable communication? How do I do observability and introspection? How do I mandate security, and policies, and all of that.” And so that’s complicated and hard and expensive.
The second thing that we saw is that, operationally, it was a challenging to actually run Istio, for a variety of different reasons: it was an earlier project and not as productionized; we didn’t do a good job of explaining which pieces needed to be scaled, when, for what reasons, and similar. We did, actually—and so effectively what we decided is, look, in this insanity that is the microservice-oriented architecture, with all the craziness going around somewhere there has to be a bottom. And so we said, “Look, we’ll make Istio the bottom.” Which—in—I mean, I use bottom in the sense of, like, programming language, of Haskell, like the—it’s a base type. So, we need something to build everything else off.
And that something needs to be easy to operate because it’s going to be the thing that allows you to operate everything else. And hey, it turns out that we have really good practices and patterns for deploying and operating a monolith. This is actually one of those fundamental tensions in computer science. It feels like we repeat everything in computer science on a pretty regular cadence. It’s actually not quite the case.
Instead, we have these cycles of creative destruction where we pull things apart so that we can iterate with them independently, and then once we do some learning, we hopefully then reassemble the system into a better whole. And this is kind of this pull and push that the industry goes over on these five, ten-year cycles. And we abbreviated that a little bit with Istio, right, where we said, “Look, this is [unintelligible 00:09:53], we see the problems that people have operationally. Our whole point is to be simple to operate. We have good principles for running and operating monoliths. Let’s deliver Istio as a monolith so that Istio itself is far easier and simpler to run, and that will enable others to be able to run their own systems more reliably.”
And instead, we—but we do pay some of that costs of, it is a monolithic binary release. There are trade-offs there for the Istio developers, but on the whole for the world, it’s a lot better for the Istio developers to foot that cost and for the operation of the mesh across all these different sites to be easier, than it is to make it easier for us and harder for everybody that actually wants to operate it.
Jason: That’s such an interesting thought of the trade-offs, especially as an open-source project. I think it’s a natural thing that we think about when we have a commercial company offering a product, and you’re like, “Obviously I’m going to be customer-oriented.” As open-source, I think, too, naturally, we build open-source projects for the people that are involved in contributing. We don’t always quite think about our end-users and making their lives easier.
Zack: Exactly. My takeaway leaving Google—and it’s been about three years now—was that actually, that was maybe Istio’s biggest sin. We basically went to a cave and we built this really, really cool tool. And then we came out of the cave, and we handed it to people and they cut themselves on it because it was sharp; we didn’t have users that were wearing down the sharp edges as we were building and developing it. That didn’t happen until, unfortunately for the project, later in the project’s lifecycle.
That’s where Istio gets the reputation for being very hard to use, hard to adopt because just—it was justified a log—a while ago. For maybe two years, it was very challenging because we didn’t have end-users in mind. And over the last two years, as part of steering—and I didn’t mention to my intro, but I actually do sit on Istio’s steering committee—over the past two years, we have made a really conscious effort to really focus on the day-two operations of Istio itself to focus on our end-users and really optimize the system so that they can run with it. Because fundamentally, a lot of our users are the platform teams that are then trying to enable their entire company. And they already have a hard enough job, so we don’t need to make it harder.
Jason: So, you mentioned something there about releasing, and when you came out of that cave, people cutting themselves on it. And I feel like, yeah, that was definitely a struggle. Istio sort of launched and it was this hot thing, right, because it solved a lot of needs that companies really had, but because of that complexity, there was a question of how do I do this?
Jason: So, continuing on that momentum, though, Istio is still pretty hot. I think it’s still an extremely valuable and useful tool. If our listeners, if someone out there is listening and says, “You know, Istio does solve these things. I want to have more control over the networking within my Kubernetes cluster.” What are some of the tips—what’s some of that advice of how should they think about rolling this out? What are the things that they should consider, particularly so they don’t cut themselves and they can have a reliable application?
Zack: Yeah. That’s an excellent question. So, the first piece of advice that I’ll give is one that I’ve been giving for actually a couple years now, which is, pick exactly one problem. In the space that we’re in and in the space of networking, there’s a lot of essential complexity. Then Istio introduces incidental complexity.
We’ve been trying to reduce that incidental complexity, but fundamentally, there’s a level of complexity present that is just because of the problem space that we’re solving, right? Networking is hard. And adopting a new fundamental piece of your stack is hard. So, don’t try and move the whole world at once. Pick one thing.
And so, what is the killer use case for your organization? I work with a lot of financial companies and the killer use case is security. I’m sure for a lot of our listeners here that they’re interested in this podcast, right, “I need reliability. I want a traffic controller. I want canaries to be able to do safe rollouts,” those kinds of things.
Whatever it is, pick one and ignore all the other features. Just use that one thing, and use that as your tool to cut your teeth on the project, learn how to deploy it and operate it, figure out how to roll out Envoy as the sidecar to a small set of applications initially. And go incrementally. And once you gain that confidence, the relative complexity of enabling a new feature or exposing a new set of features to application developers using the mesh is so much lower than if you just say, “Hey, I installed this. Have at it.”
So, be mindful of the rollout and how you’re doing that. The second piece of advice that I would give: look at how you can hide some of the underlying complexity for users, too. You don’t need to give users virtual services, and destination rules, and all these other Istio config objects. You can, for example, give a set of Helm templates that encapsulate a small part of the overall Istio behavior, and let your developers tweak within that box that you give them, that template that you give them. The other big problem that we see quite a lot is, make sure you have a plan for upgrade.
And this is what—I see a lot of people that—or a lot of teams and organizations that get really excited about Istio; they have a hard problem to solve; they have one problem to solve; they solve that one problem. And then it’s six months later and Istio has had three different security patches and a minor version release, and hey, Istio actually only supports two versions concurrently for security patches, and so the next one that comes out, you fall off the train. In general, we see a lack of planning around day-two operations and upgrade. Fortunately, this is where Istio in modern versions has a whole lot of features to help. There’s even features coming out in 1.10 that make it even easier. 1.10 is coming out—by the time this releases, will have already been out.
And so this is an area where we’re constantly improving, but be mindful that like, hey, it’s not a static—there is a little bit of a moving target here, in terms of staying up to date for security. Just like any other piece of infrastructure that you have. And very often, I run into teams that don’t plan for that day-two bit because it’s not as widely talked about, quite frankly. It’s a shame and can leave them in painful spots. By and large, it’s a lot better these days to be able to do Istio upgrades and similar, but that has bit people.
Jason: I guess along those lines then, is it recommended because we are starting to see managed Istio implementations with cloud providers, should people just opt for that?
Zack: Yeah. Much like Kubernetes, this is not necessarily a layer that you need, or want, or have to operate. Now, I will say a lot of the cloud provider Istio flavors are of varying levels of support. And so your mileage may vary depending on which cloud you’re in. There are a bunch of vendors around that help with it, too.
I work for one such vendor. And so there are folks to be able to help. In general, I would say it is not valuable to your company to run this infrastructure code. Having this stuff available for your application developers is hugely valuable for your company, for the end-users of your company’s products, but running it is not does not provide them value. And so this is a case where, yes, I would say, along with a lot of other infrastructure projects, push that out of house and focus in on what gives you value for your customers.
Jason: Awesome. Well, that’s all fantastic advice, Zack. Thanks for joining us on the podcast. Before we go, just wanted to give you an opportunity, if there’s anything that you want to promote or tell people about. And also, how can folks contact you if they’ve got more questions or want to keep in touch?
Zack: So, obviously like I said, hey, I do work for a company that does some of this stuff. So, go check out tetrate.io. If you’re interested in some of the Istio Lifecycle Management, Istio adoption, some of that, we actually have a really, really great resource, istio.tetratelabs.io. Tetrate Labs is our home for open-source. And under the Istio site there, there’s a ton of resources, getting started guides, cheat sheets for Istio stuff. So, go check that out. If you want to get it me, I’m probably easiest to get at on Twitter at @zackbutcher—Z-A-C-K Butcher, and I’m that tag basically, I’m that handle, basically on all the things, if you want to get me on GitHub, or wherever else, that’s the tag to get me. More than happy to talk about this stuff. So, yeah.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.