Podcast: Break Things on Purpose | Gustavo Franco, Senior Engineering Manager at VMWare
In this episode Jason is joined by Gustavo Franco, Senior Engineering Manager at VMWare, to chat about chaos in the Gustavo’s early days. Gustavo reflects on Googles early disaster recovery practices, to the contemporary SRE movement. Gustavo’s work at VMWare has him focusing on building reliability as a product feature, and how the new VMware Tanzu Reliability Engineering Organization, is changing up the SRE game via doing some “horizontal work” across the reliability spectrum. Gustavo divulges some details on how exactly that horizontal is being done with security, in which he has an extensive history, and others!
In this episode, we cover:
- 00:00:00 - Introduction
- 00:03:20 - VMWare Tanzu
- 00:07:50 - Gustavo’s Career in Security
- 00:12:00 - Early Days in Chaos Engineering
- 00:16:30 - Catzilla
- 00:19:45 - Expanding on SRE
- 00:26:40 - Learning from Customer Trends
- 00:29:30 - Chaos Engineering at VMWare
- 00:36:00 - Outro
- Tanzu VMware: https://tanzu.vmware.com
- GitHub for SREDocs: https://github.com/google/sredocs
- E-book on how to start your incident lifecycle program: https://tanzu.vmware.com/content/ebooks/establishing-an-sre-based-incident-lifecycle-program
- Twitter: https://twitter.com/stratus
Jason: Welcome to Break Things on Purpose, a podcast about chaos engineering and building reliable systems. In this episode, Gustavo Franco, a senior engineering manager at VMware joins us to talk about building reliability as a product feature, and the journey of chaos engineering from its place in the early days of Google’s disaster recovery practices to the modern SRE movement. Thanks, everyone, for joining us for another episode. Today with us we have Gustavo Franco, who’s a senior engineering manager at VMware. Gustavo, why don’t you say hi, and tell us about yourself.
Gustavo: Thank you very much for having me. Gustavo Franco; as you were just mentioning, I’m a senior engineering manager now at VMware. So, recently co-founded the VMware Tanzu Reliability Engineering Organization with Megan Bigelow. It’s been only a year, actually. And we’ve been doing quite a bit more than SRE; we can talk about like—we’re kind of branching out beyond SRE, as well.
Jason: Yeah, that sounds interesting. For folks who don’t know, I feel like I’ve seen VMware Tanzu around everywhere. It just suddenly went from nothing into this huge thing of, like, every single Kubernetes-related event, I feel like there’s someone from VMware Tanzu on it. So, maybe as some background, give us some information; what is VMware Tanzu?
Gustavo: Kubernetes is sort of the engine, and we have a Kubernetes distribution called Tanzu Kubernetes Grid. So, one of my teams actually works on Tanzu Kubernetes Grid. So, what is VMware Tanzu? What this really is, is what we call a modern application platform, really an end-to-end solution. So, customers expect to buy not just Kubernetes, but everything around, everything that comes with giving the developers a platform to write code, to write applications, to write workloads.
So, it’s basically the developer at a retail company or a finance company, they don’t want to run Kubernetes clusters; they would like the ability to, maybe, but they don’t necessarily think in terms of Kubernetes clusters. They want to think about workloads, applications. So, VMWare Tanzu is end-to-end solution that the engine in there is Kubernetes.
Jason: That definitely describes at least my perspective on Kubernetes is, I love running Kubernetes clusters, but at the end of the day, I don’t want to have to evaluate every single CNCF project and all of the other tools that are required in order to actually maintain and operate a Kubernetes cluster.
Gustavo: I was just going to say, and we acquired Pivotal a couple of years ago, so that brought a ton of open-source projects, such as the Spring Framework. So, for Java developers, I think it’s really cool, too, just being able to worry about development and the Java layer and a little bit of reliability, chaos engineering perspective. So, kind of really gives me full tooling, the ability common libraries. It’s so important for reliable engineering and chaos engineering as well, to give people this common surface that we can actually use to inject faults, potentially, or even just define standards.
Jason: Excellent point of having that common framework in order to do these reliability practices. So, you’ve explained what VMware Tanzu is. Tell me a bit more about how that fits in with VMware Tanzu?
Gustavo: Yeah, so one thing that happened the past few years, the SRE organization grew beyond SRE. We’re doing quite a bit of horizontal work, so SRE being one of them. So, just an example, I got to charter a compliance engineering team and one team that we call ‘Customer Zero.’ I would call them partially the representatives of growth, and then quote-unquote, “Customer problems, customer pain”, and things that we have to resolve across multiple teams. So, SRE is one function that clearly you can think of.
You cannot just think of SRE on a product basis, but you think of SRE across multiple products because we’re building a platform with multiple pieces. So, it’s kind of like putting the building blocks together for this platform. So then, of course, we’re going to have to have a team of specialists, but we need an organization of generalists, so that’s where SRE and this broader organization comes in.
Jason: Interesting. So, it’s not just we’re running a platform, we need our own SREs, but it sounds like it’s more of a group that starts to think more about the product itself and maybe even works with customers to help their reliability needs?
Gustavo: Yeah, a hundred percent. We do have SRE teams that invest the majority of their time running SaaS, so running Software as a Service. So, one of them is the Tanzu Mission Control. It’s purely SaaS, and what teams see Tanzu Mission Control does is allow the customers to run Kubernetes anywhere. So, if people have Kubernetes on-prem or they have Kubernetes on multiple public clouds, they can use TMC to be that common management surface, both API and web UI, across Kubernetes, really anywhere they have Kubernetes. So, that’s SaaS.
But for TKG SRE, that’s a different problem. We don’t have currently a TKG SaaS offering, so customers are running TKG on-prem or on public cloud themselves. So, what does the TKG SRE team do? So, that’s one team that actually [unintelligible 00:05:15] to me, and they are working directly improving the reliability of the product. So, we build reliability as a feature of the product.
So, we build a reliability scanner, which is a [unintelligible 00:05:28] plugin. It’s open-source. I can give you more examples, but that’s the gist of it, of the idea that you would hire security engineers to improve the security of a product that you sell to customers to run themselves. Why wouldn’t you hire SREs to do the same to improve the reliability of the product that customers are running themselves? So, kind of, SRE beyond SaaS, basically.
Jason: I love that idea because I feel like a lot of times in organizations that I talk with, SRE really has just been a renamed ops team. And so it’s purely internal; it’s purely thinking about we get software shipped to us from developers and it’s our responsibility to just make that run reliably. And this sounds like it is that complete embrace of the DevOps model of breaking down silos and starting to move reliability, thinking of it from a developer perspective, a product perspective.
Gustavo: Yeah. A lot of my work is spent on making analogies with security, basically. One example, several of the SREs in my org, yeah, they do spend time doing PRs with product developers, but also they do spend a fair amount of time doing what we call in a separate project right now—we’re just about to launch something new—a reliability risk assessment. And then you can see the parallels there. Where like security engineers would probably be doing a security risk assessment or to look into, like, what could go wrong from a security standpoint?
So, I do have a couple engineers working on reliability risk assessment, which is, what could go wrong from a reliability standpoint? What are the… known pitfalls of the architecture, the system design that we have? How does the architectural work looks like of the service? And yeah, what are the outages that we know already that we could have? So, if you have a dependency on, say, file on a CDN, yeah, what if the CDN fails?
It’s obvious and I know most of the audience will be like, “Oh, this is obvious,” but, like, are you writing this down on a spreadsheet and trying to stack-rank those risks? And after you stack-rank them, are you then mitigating, going top-down, look for—there was an SREcon talk by Matt Brown, a former colleague of mine at Google, it’s basically, know your enemy tech talk in SREcon. He talks about this like how SRE needs to have a more conscious approach to reliability risk assessment. So, really embraced that, and we embraced that at VMware. The SRE work that I do comes from a little bit of my beginnings or my initial background of working security.
Jason: I didn’t actually realize that you worked security, but I was looking at your LinkedIn profile and you’ve got a long career doing some really amazing work. So, you said you were in security. I’m curious, tell us more about how your career has progressed. How did you get to where you are today?
Gustavo: Very first job, I was 16. There was this group of sysadmins on the first internet service provider in Brazil. One of them knew me from BBS, Bulletin Board Systems, and they, you know, were getting hacked, left and right. So, this guy referred me, and he referred me saying, “Look, it’s this kid. He’s 16, but he knows his way around this security stuff.”
So, I show up, they interview me. I remember one of the interview questions; it’s pretty funny. They asked me, “Oh, what would you do if we asked you to go and actually physically grab the routing table from AT&T?” It’s just, like, a silly question and they told them, “Uh, that’s impossible.” So, I kind of told him the gist of what I knew about routing, and it was impossible to physically get a routing table.
For some reason, they loved that. That was the only candidate that could be telling them, “No. I’m not going to do it because it makes no sense.” So, they hired me. And the student security was basically teaching the older sysadmins about SSH because they were all on telnet, nothing was encrypted.
There was no IDS—this was a long time ago, right, so the explosion of cybersecurity security firms did not exist then, so it was new. To be, like, a security company was a new thing. So, that was the beginning. I did dabble in open-source development for a while. I had a couple other jobs on ISPs.
Google found me because of my dev and open-source work in ’06, ’07. I interviewed, joined Google, and then at Google, all of it is IC, basically, individual contributor. And at Google, I start doing SRE-type of work, but for the corporate systems. And there was this failed attempt to migrate from one Linux distribution to another—all the corporate systems—and I tech-led the effort making that successful. I don’t think I should take the credit; it was really just a fact of, like you know, trying the second time and kind of, learned—the organization learned the lessons that I had to learn from the first time. So, we did a second time and it worked.
And then yeah, I kept going. I did more SRE work in corp, I did some stuff in production, like all the products. So, I did a ton of stuff. I did—let’s see—technical infrastructure, disaster recovery testing, I started a chaos-engineering-focused team. I worked on Google Cloud before we had a name for it. [laugh].
So, I was the first SRE on Google Compute Engine and Google Cloud Storage. I managed Google Plus SRE team, and G Suite for a while. And finally, after doing all this runs on different teams, and developing new SRE teams and organizations, and different styles, different programs in SRE. Dave Rensin, which created the CRE team at Google, recruited me with Matt Brown, which was then the tech lead, to join the CRE team, which was the team at Google focused on teaching Google Cloud customers on how to adopt SRE practices. So, because I had this very broad experience within Google, they thought, yeah, it will be cool if you can share that experience with customers.
And then I acquired even more experience working with random customers trying to adopt SRE practices. So, I think I’ve seen a little bit of it all. VMware wanted me to start, basically, a CRE team following the same model that we had at Google, which culminated all this in TKG SRE that I’m saying, like, we work to improve the reliability of the product and not just teaching the customer how to adopt SRE practices. And my pitch to the team was, you know, we can and should teach the customers, but we should also make sure that they have reasonable defaults, that they are providing a reasonable config. That’s the gist of my experience, at a high level.
Jason: That’s an amazing breadth of experience. And there’s so many aspects that I feel like I want to dive into [laugh] that I’m not quite sure exactly where to start. But I think I’ll start with the first one, and that’s that you mentioned that you were on that initial team at Google that started doing chaos engineering. And so I’m wondering if you could share maybe one of your experiences from that. What sort of chaos engineering did you do? What did you learn? What were the experiments like?
Gustavo: So, a little bit of the backstory. This is probably because Kripa mentioned this several times before—and Kripa Krishnan, she actually initiated disaster recovery testing, way, way before there was such a thing as chaos engineering—that was 2006, 2007. That was around the time I was joining Google. So, Kripa was the first one to lead disaster recovery testing. It was very manual; it was basically a room full of project managers with postIts, and asking teams to, like, “Hey, can you test your stuff? Can you test your processes? What if something goes wrong? What if there’s an earthquake in the Bay Area type of scenario?” So, that was the predecessor.
Many, many years later, I work with her from my SRE teams testing, for my SRE teams participating in disaster recovery testing, but I was never a part of the team responsible for it. And then seven years later, I was. So, she recruited me with the following pitch, she was like, “Well, the program is big. We have disaster recovery tests, we have a lot of people testing, but we are struggling to convince people to test year-round. So, people tend to test once a year, and they don’t test again. Which is bad. And also,” she was like, “I wish we had a software; there’s something missing.”
We had the spreadsheets, we track people, we track their tasks. So, it was still very manual. The team didn’t have a tool for people to test. It was more like, “Tell me what you’re going to test, and I will help you with scheduling, I’ll help you to not conflict with the business and really disrupt something major, disrupt production, disrupt the customers, potentially.” A command center, like a center of operations.
That’s what they did. I was like, “I know exactly what we need.” But then I surveyed what was out there in open-source way, and of course, like, Netflix, gets a lot of—deserves a lot of credit for it; there was nothing that could be applied to the way we’re running infrastructure internally. And I also felt that if we built this centrally and we build a catalog of tasks ourselves, and that’s it, people are not going to use it. We have a bunch of developers, software engineers.
They’ve got to feel like—they want to, they want to feel—and rightfully so—that they wanted control and they are in control, and they want to customize the system. So, in two weeks, I hack a prototype where it was almost like a workflow engine for chaos engineering tests, and I wrote two or three tests, but there was an API for people to bring their own test to the system, so they could register a new test and basically send me a patch to add their own tests. And, yeah, to my surprise, like, a year later—and the absolute number of comparison is not really fair, but we had an order of magnitude more testing being done through the software than manual tests. So, on a per-unit basis, the quality of the ultimate tasks was lower, but the cool thing was that people were testing a lot more often. And it was also very surprising to see the teams that were testing.
Because there were teams that refused to do the manual disaster recovery testing exercise, they were using the software now to test, and that was part of the regular integration test infrastructure. So, they’re not quite starting with okay, we’re going to test in production, but they were testing staging, they were testing a developer environment. And in staging, they had real data; they were finding regressions. I can mention the most popular testing, too, because I spoke about this publicly before, which was this fuzz testing. So, a lot of things are RPC or RPC services, RPC, servers.
Fuzz testing is really useful in the sense that, you know, if you send a random data in RPC call, will the server crash? Will the server handling this gracefully? So, we fought a lot of people—not us—a lot of people use or shared service bringing their own test, and fuzz testing was very popular to run continuously. And they would find a ton of crashes. We had a lot of success with that program.
This team that I ran that was dedicated to building this shared service as a chaos engineering tool—which ironically named Catzilla—and I’m not a cat person, so there’s a story there, too—was also doing more than just Catzilla, which we can also talk about because there’s a little bit more of the incident management space that’s out there.
Jason: Yeah. Happy to dive into that. Tell me more about Catzilla?
Gustavo: Yeah. So, Catzilla was sort of the first project from scratch from the team that ended up being responsible to share a coherent vision around the incident prevention. And then we would put Catzilla there, right, so the chaos engineering shared service and prevention, detection, analysis and response. Because once I started working on this, I realized, well, you know what? People are still being paged, they have good training, we had a good incident management process, so we have good training for people to coordinate incidents, but if you don’t have SREs working directly with you—and most teams didn’t—you also have a struggle to communicate with executives.
It was a struggle to figure out what to do with prevention, and then Catzilla sort of resolved that a little bit. So, if you think of a team, like an SRE team in charge of not running a SaaS necessarily, but a team that works in function of a company to help the company to think holistically about incident prevention, detection, analysis, and response. So, we end up building more software for those. So, part of the software was well, instead of having people writing postmortems—a pet peeve of mine is people write postmortems and them they would give to the new employees to read them. So, people never really learned the postmortems, and there was like not a lot of information recovery from those retrospectives.
Some teams were very good at following up on extra items and having discussions. But that’s kind of how you see the community now, people talking about how we should approach retrospectives. It happened but it wasn’t consistent. So then, well, one thing that we could do consistently is extract all the information that people spend so much time writing on the retrospectives. So, my pitch was, instead of having these unstructured texts, can we have it both unstructured and structured?
So, then we launch postmortem template that was also machine-readable so we could extract information and then generate reports for to business leaders to say, “Okay, here’s what we see on a recurring basis, what people are talking about in the retrospectives, what they’re telling each other as they go about writing the retrospectives.” So, we found some interesting issues that were resolved that were not obvious on a per retrospective basis. So, that was all the way down to the analysis of the incidents. On the management part, we built tooling. It’s basically—you can think of it as a SaaS, but just for the internal employees to use that is similar to externally what would be an incident dashboard, you know, like a status page of sorts.
Of course, a lot more information internally for people participating in incidents than they have externally. For me is thinking of the SRE—and I manage many SRE teams that were responsible for running production services, such as Compute Engine, Google Plus, Hangouts, but also, you know, I just think of SRE as the folks managing production system going on call. But thinking of them a reliability specialists. And there’s so many—when you think of SREs as reliability specialists that can do more than respond to pages, then you can slot SREs and SRE teams in many other areas of a organization.
Jason: That’s an excellent point. Just that idea of an SRE as being more than just the operation’s on-call unit. I want to jump back to what you mentioned about taking and analyzing those retrospectives and analyzing your incidents. That’s something that we did when I was at Datadog. Alexis Lê-Quôc, who’s the CTO, has a fantastic talk about that at Monitorama that I’ll link to in the [show notes 00:19:49].
It was very clear from taking the time to look at all of your incidents, to catalog them, to really try to derive what’s the data out of those and get that information to help you improve. We never did it in an automated way, but it sounds like with an automated tool, you were able to gather so much more information.
Gustavo: Yeah, exactly. And to be clear, we did this manually before, and so we understood the cost of. And our bar, company-wide, for people writing retrospectives was pretty low, so I can’t give you a hard numbers, but we had a surprising amount of retrospectives, let’s say on a monthly basis because a lot of things are not necessarily things that many customers would experience. So, near misses or things that impact very few customers—potentially very few customers within a country could end up in a retrospective, so we had this throughput. So, it wasn’t just, like, say, the highest severity outages.
Like where oh, it happens—the stuff that you see on the press that happens once, maybe, a year, twice a year. So, we had quite a bit of data to discuss. So, then when we did it manually, we’re like, “Okay, yeah, there’s definitely something here because there’s a ton of information; we’re learning so much about what happens,” but then at the same time, we were like, “Oh, it’s painful to copy and paste the useful stuff from a document to a spreadsheet and then crunch the spreadsheet.” And kudos—I really need to mention her name, too, Sue Lueder and also Jelena Oertel. Both of them were amazing project program managers who’ve done the brunt of this work back in the days when we were doing it manually.
We had a rotation with SREs participating, too, but our project managers were awesome. And also
Jason: As you started to analyze some of those incidents, every infrastructure is different, every setup is different, so I’m sure that maybe the trends that you saw are perhaps unique to those Google teams. I’m curious if you could share the, say, top three themes that might be interesting and applicable to our listeners, and things that they should look into or invest in?
Gustavo: Yeah, one thing that I tell people about adopting the—in the books, the SRE books, is the—and people joke about it, so I’ll explain the numbers a little better. 70, 75% of the incidents are triggered by config changes. And people are like, “Oh, of course. If you don’t change anything, there are no incidents, blah, blah, blah.” Well, that’s not true, that number really speaks to a change in the service that is impacted by the incident.
So, that is not a change in the underlying dependency. Because people were very quickly to blame their dependencies, right? So meaning, if you think of a microservice mesh, the service app is going to say, “Oh, sure. I was throwing errors, my service was throwing errors, but it was something with G or H underneath, in a layer below.” 75% of cases—and this is public information goes into books, right—of retrospectives was written, the service that was throwing the errors, it was something that changed in that service, not above or below; 75% of the time, a config change.
And it was interesting when we would go and look into some teams where there was a huge deviation from that. So, for some teams, it was like, I don’t know, 85% binary deploys. So, they’re not really changing config that much, or the configuration issues are not trigger—or the configuration changes or not triggering incidents. For those teams, actually, a common phenomenon was that because they couldn’t. So, they did—the binary deploys were spiking as contributing factors and main triggers for incidents because they couldn’t do config changes that well, roll them out in production, so they’re like, yeah, of course, like, [laugh] my minor deploys will break more on my own service.
But that showed to a lot of people that a lot of things were quote-unquote, “Under their control.” And it also was used to justify a project and a technique that I think it’s undervalued by SREs in the wild, or folks running production in the wild which is canary evaluation systems. So, all these numbers and a lot of this analysis was just fine for, like, to give extra funding for the scene that was basically systematically across the entire company, if you tried to deploy a binary to production, if you tried to deploy a config change to production, will evaluate a canary if the binary is in a crash loop, if the binary is throwing many errors, is something is changing in a clearly unpredictable way, it will pause, it will abort the deploy. Which back to—much easier said than done. It sounds obvious, right, “Oh, we should do canaries,” but, “Oh, can you automate your canaries in such a way that they’re looking to monitoring time series and that it’ll stop a release and roll back a release so a human operator can jump in and be like, ‘oh, okay. Was it a false positive or not?’”
Jason: I think that moving to canary deployments, I’ve long been a proponent of that, and I think we’re starting to see a lot more of that with tools such as—things like LaunchDarkly and other tools that have made it a whole lot easier for your average organization that maybe doesn’t have quite the infrastructure build-out. As you started to work on all of this within Google, you then went to the CRE team and started to help Google Cloud customers. Did any of these tools start to apply to them as well, analyzing their incidents and finding particular trends for those customers?
Gustavo: More than one customer, when I describe, say our incident lifecycle management program, and the chaos engineering program, especially this lifecycle stuff, in the beginning, was, “Oh, okay. How do I do that?” And I open-sourced a very crufty prototype which some customers pick up on it and they implement internally in their companies. And it’s still on GitHub, so /google/sredocs.
There’s an ugly parser, an example, like, of template for the machine-readable stuff, and how to basically get your retrospectives, dump the data onto Google BigQuery to be able to query more structurally. So yes, customers would ask us about, “Yeah. I heard about chaos engineering. How do you do chaos engineering? How can we start?”
So, like, I remember a retail one where we had a long conversation about it, and some folks in tech want to know, “Yeah, instant response; how do I go about it?” Or, “What do I do with my retrospectives?” Like, people started to realize that, “Yeah, I write all this stuff and then we work on the action items, but then I have all these insights written down and no one goes back to read it. How can I get actionable insights, actionable information out of it?”
Jason: Without naming any names because I know that’s probably not allowed, are there any trends from customers that you’d be willing to share? Things that maybe—insights that you learned from how they were doing things and the incidents they were seeing that was different from what you saw at Google?
Gustavo: Gaming is very unique because a lot of gaming companies, when we would go into incident management, [unintelligible 00:26:59] they were like, “If I launch a game, it’s ride or die.” There may be a game that in the first 24, or 48 hours if the customers don’t show up, they will never show up. So, that was a little surprising and unusual. Another trend is, in finance, you would expect a little behind or to be too strict on process, et cetera, which they still are very sophisticated customers, I would say. The new teams of folks are really interested in learning how to modernize the finance infrastructure.
Let’s see… well, tech, we basically talk the same language, with the gaming being a little different. In retail, the uniqueness of having a ton of things at the edge was a little bit of a challenge. So, having these hubs, where they have, say, a public cloud or on-prem data center, and these of having things running at the stores, so then having this conversation with them about different tiers and how to manage different incidents. Because if a flagship store is offline, it is a big deal. And from a, again, SaaS mindset, if you’re think of, like, SRE, and you always manage through a public cloud, you’re like, “Oh, I just call with my cloud provider; they’ll figure it out.”
But then for retail company with things at the edge, at a store, they cannot just sit around and wait for the public cloud to restore their service. So again, a lot of more nuanced conversations there that you have to have of like, yeah, okay, yeah. Here, say a VMware or a Google. Yeah, we don’t deal with this problem internally, so yeah, how would I address this? The answers are very long, and they always depend.
They need to consider, oh, do you have an operational team that you can drive around? [laugh]. Do you have people, do you have staffing that can go to the stores? How long it will take? So, the SLO conversation there is tricky.a secret weapon of SRE that has definitely other value is the project managers, program managers that work with SREs. And I need to shout out to—if you’re a project manager, program manager working with SREs, shout out to you.Do you want to have people on call 24/7? Do you have people near that store that can go physically and do anything about it? And more often than not, they rely on third-party vendors, so then it’s not staffed in-house and they’re not super technical, so then remote management conversations come into play. And then you talk about, “Oh, what’s your network infrastructure for that remote management?” Right? [laugh].
Jason: Things get really interesting when you start to essentially outsource to other companies and have them provide the technology, and you try to get that interface. So, you mentioned doing chaos engineering within Google, and now you’ve moved to VMware with the Tanzu team. Tell me a bit more about how do you do chaos engineering at VMware, and what does that look like?
Gustavo: I’ve seen varying degrees of adoption. So, right now, within my team, what we are doing is we’re actually going as we speak right now, doing a big reliabilities assessment for a launch. Unfortunately, we cannot talk about it yet. We’re probably going to announce this on October at VMworld. As a side effect of this big launch, we started by doing a reliability risk assessment.
And the way we do this is we interview the developers—so this hasn’t launched yet, so we’re still designing this thing together. [unintelligible 00:30:05] the developers of the architecture that they basically sketch out, like, what is it that you’re going to? What are the user journeys, the user stories? Who is responsible for what? And let’s put an architecture diagram, a sketch together.
And then we tried to poke or holes on, “Okay. What could go wrong here?” We write this stuff down. More often than not, from this list—and I can already see, like, that’s where that output, that result fits into any sort of chaos engineering plan. So, that’s where, like—so I can get—one thing that I can tell you for that risk assessment because I participated in the beginning was, there is a level of risk involving a CDN, so then one thing that we’re likely going to test before we get to general availability is yeah, let’s simulate that the CDN is cut off from the clients.
But even before we do the test, we’re already asking, but we don’t trust. Like, trust and verify, actually; we do trust but trust and verify. So, we do trust the client is actually another team. So, we do trust the client team that they cache, but we are asking them, “Okay. Can you confirm that you cache? And if you do cache, can you give us access to flush the cache?”
We trust them, we trust the answers; we’re going to verify. And how do we verify? It’s through a chaos engineering test which is, let’s cut the client off from the CDN and then see what happens. Which could be, for us, as simple as let’s move the file away; we should expect them to not tell us anything because the client will fail to read but it’s going to pick from cache, it’s not reading from us anyways. So, there is, like, that level of we tell people, “Hey, we’re going to test a few things.”
We’ll not necessarily tell them what. So, we are also not just testing the system, but testing how people react, and if anything happens. If nothing happens, it’s fine. They’re not going to react to it. So, that’s the level of chaos engineering that our team has been performing.
Of course, as we always talk about improving reliability for the product, we talked about, “Oh, how is it that chaos engineering as a tool for our customers will play out in the platform?” That conversation now is a little bit with product. So, product has to decide how and when they want to integrate, and then, of course, we’re going to be part of that conversation once they’re like, “Okay, we’re ready to talk about it.” Other teams of VMWare, not necessarily Tanzu, then they do all sorts of chaos engineering testing. So, some of them using tools, open-source or not, and a lot of them do tabletop, basically, theoretical testing as well.
Jason: That’s an excellent point about getting started. You don’t have a product out yet, and I’m sure everybody’s anticipating hearing what it is and seeing the release at VMworld, but testing before you have a product; I feel like so many organizations, it’s an afterthought, it’s the, “I’ve built the product. It’s in production. Now, we need to keep it reliable.” And I think by shifting that forward to thinking about, we’ve just started diagramming the architecture, let’s think about where this can break. And how we can build those tests so that we can begin to do that chaos engineering testing, begin to do that reliability testing during the development of the product so that it ships reliably, rather than shipping and then figuring out how to keep it reliable.
Gustavo: Yeah. The way I talked to—and I actually had a conversation with one of our VPs about this—is that you have technical support that is—for the most part, not all the teams from support—but at least one of the tiers of support, you want it to be reactive by design. You can staff quite a few people to react to issues and they can be very good about learning the basics because the customers—if you’re acquiring more customers, they are going to be—you’re going to have a huge set of customers early in the journey with your product. And you can never make the documentation perfect and the product onboarding perfect; they’re going to run into issues. So, that very shallow set of issues, you can have a level of arterial support that is reactive by design.
You don’t want that tier of support to really go deep into issues forever because they can get caught up into a problem for weeks or months. You kind of going to have—and that’s when you add another tier and that’s when we get to more of, like, support specialists, and then they split into silos. And eventually, you do get an IC SRE being tier three or tier four, where SRE is a good in-between support organizations and product developers, in the sense that product developers also tend to specialize in certain aspects of a product. SRE wants to be generalists for reliability of a product. And nothing better than to uncover reliability for product is understanding the customer pain, the customer issues.
And actually, one thing, one of the projects I can tell you about that we’re doing right now is we’re improving the reliability of our installation. And we’re going for, like, can we accelerate the speed of installs and reduce the issues by better automation, better error handling, and also good—that’s where I say day zero. So, day zero is, can we make this install faster, better, and more reliable? And after the installs in day one, can we get better default? Because I say the ergonomics for SRE should be pretty good because we’re TKG SREs, so there’s [unintelligible 00:35:24] and SRE should feel at home after installing TKG.
Otherwise, you can just go install vanilla Kubernetes. And if vanilla Kubernetes does feel at home because it’s open-source, it’s what most people use and what most people know, but it’s missing—because it’s just Kubernetes—missing a lot of things around the ecosystem that TKG can install by default, but then when you add a lot of other things, I need to make sure that it feels at home for SREs and operators at large.
Jason: It’s been fantastic chatting with you. I feel like we can go [laugh] on and on.
Jason: I’ve gone longer than I had intended. Before we go, Gustavo, I wanted to ask you if you had anything that you wanted to share, anything you wanted to plug, where can people find you on the internet?
Gustavo: Yeah, so I wrote an ebook on how to start your incident lifecycle program. It’s not completely out yet, but I’ll post on my Twitter account, so twitter.com/stratus. So @stratus, S-T-R-A-T-U-S. We’ll put the link in the notes, too. And so yeah, you can follow me there. I will publish the book once it’s out. Kind of explains all about the how to establish an incident lifecycle. And if you want to talk about SRE stuff, or VMware Tanzu or TKG, you can also message me on Twitter.
Jason: Thanks for all the information.
Gustavo: Thank you, again. Thank you so much for having me. This was really fun. I really appreciate it.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more