Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode of the Break Things on Purpose podcast, we speak with Taylor Dolezal, Senior Developer Advocate at HashiCorp
- It's always DNS (2:28)
- Focus on learning (9:29)
- Chaos Engineering and improvement (11:25)
- Tips for learning (16:33)
- More Chaos Engineering (21:53)
Taylor Dolezal: It's just, everything's a distributed system, if you can dice it up appropriately. It's kind of fun. Yeah. Rock, paper, scissors.
Jason Yee: Welcome to Break Things on Purpose, a podcast about Chaos Engineering and building reliable systems. In this episode, Jason Yee and Ana Medina chat with Taylor Dolezal is all about DNS failures, incident response, retrospectives, and learning.
Ana Medina: I'm excited to have you on here, Taylor. We were kind of just wondering what you do for your job. Can you tell us a little bit more about that?
Taylor Dolezal: Absolutely. So, thank you so much for having me. This is an absolute pleasure and a blast. I'm excited to talk about things that I've broken on purpose. I know you can't see it, but I just did air quotes. But very, very excited.
So what I do right now is, I work as a developer advocate at HashiCorp. And I'm mostly focused on infrastructure needs and application delivery, developer experience. And this year we're working through some really cool workflows on keeping things secure by default. And really just trying to offer a little bit more polish and kind of sand those rough edges down, when it comes to actually deploying your applications. And then, I've had a lot of fun working with Teams and kind of getting some insights about what their application stacks look like and how to even monitor those, as well. So, quite a bit of fun.
Right now in an average week, I'd say, there's a lot of focus on creating new content for webinars that are coming up, or getting to sit down and talk with people, and jumping into community calls about what problems that they're facing. So again, it's funny, being an engineer and working on code and doing things on that front at my previous job at Disney Studios, loved talking with people. And now, I finally get to do that at HashiCorp on a day-to-day basis. And I just really, really love that.
So, fun to be immersed in the community and to be able to have the chance to take the feedback that's given. And hear about some of these problems too, that people are facing, and bring that back to the product teams and actually come up with a game plan of how to fix that problem. Or how to kind of approach that in a different way. So yeah, just, I get to have fun all day, is basically what that sums up to. Absolutely, absolutely love my job.
Ana Medina: I know you've gotten a chance to work on all sorts of application stacks. What is one of those horror stories, one of those horrible incidents that you've encountered in your career that has allowed for you to reflect on, "Uh-oh, what did I just deploy? What could have actually gone on in this configuration?" Walk us through some of those things that you've seen.
Taylor Dolezal: Absolutely. Like you said, no shortage while working at several jobs with... Mostly, I'd say, mostly in operations. I began my career as a software engineer, very focused on code. Earlier on in my life, that was kind of something that, once I realized was a possibility, was so much fun. And then later on, started my own company, and then got into, how do we actually ship these things. And then begin the... You think that they'd be bad getting into all of these horror stories, but actually, I love talking about them after the fact. But my goodness, yeah, the sweat on your brow is real, when you're in it.
I think some of the most memorable, horror stories for me have to definitely be around two things. One of which is, DNS, of course it is. And the other one is around clustering, which, spoiler, is also DNS. So, I think that there was one notable incident that happened that became public news, even, it was just so intense, was, while I was working at Disney studios, there was a DNS outage that affected disney.com. And as you can imagine, there are a lot of services that use sub domains off of that. So once disney.com got taken down, we started to see this kind of propagate across all of our properties and applications, our CNAMEs and our A records. And that was just an absolutely wild bridge to be on. I think that there were over like a hundred, 200 people.
And I feel like since we've gone into COVID and dealt with that and people being on Zooms and people actually living their lives, being on Zooms, dogs, barking, things like that, this is in a time before that was a, okay, so you hear dogs barking. People being like, "Hey, what's going on? What are you doing on the phone?", which kind of made it entertaining. It was all we had in that outage. But I'd say that was interesting.
One other interesting domain thing was working to get Kubernetes laid out for some of our application teams. So, when we had started working to help our application teams at Disney Studios, there were a lot of applications and there wasn't a lot of standardization around how we actually worked with those, how we troubleshot them. And then it made getting them up and running, when we add new services, really problematic. Right? Because there's no set process around it. Everything was so different. So, we took a look at containerization, at the time that I was there, as well as getting these applications moved on to Kubernetes.
And we decided to... we were all in on Kubernetes, let's get these workloads lifted and shifted, moved to Kubernetes. And we can actually support that because it's a standardized API. And it also made SRE operations quite a bit easier, as well, because then we knew what we were working with. Right? Kind of like a Ruby on Rails application, that standardized layout, and you know exactly where to go. Same thing with many other frameworks.
Once we got that all set in place, we had to figure things out too, about what we would want to put into the cluster, like maybe a Rabbit MQ cluster, and obviously the applications. But things that we'd also leave out of that cluster like RDS and things like that. So, one of the things that we tried to put into to the cluster was a Rabbit MQ cluster. And we started using Helm Charts back when they first came out. So when we got that up and running, we didn't really have a good idea. The application team said, "Hey, I think that we'd want eight write nodes and we'd want disc. And we don't want some of these to be memory." They just kind of gave what they thought was the best recommendation for how to set this cluster up. And we said, "Okay." We weren't Rabbit MQ experts at the time. We just said, "This sounds good to us."
Once we got all that rolled out, there was a bug in one of the Helm Charts that would always look up. It used StatefulSets in Kubernetes, and it would always look up the Zero Index service. So what we had happen, of course, at 3:00 AM local time, one fine morning, was that pod had gone down, gotten into a split brain, bad, bad state. And then it kept trying to reload, and then that thus cascaded out and didn't actually failover to any of the other pods or services that were running. And so, it took a very long time for us to figure out what was happening by looking at the logs and kind of getting some insight on that front. And then once we had finally figured it out, got the state back up and running on that Zero Index pod, then we were back and functional.
But that was a very intense service, and it was impacting some of the movie delivery functions, as well. So, would not have been fun for some people to show up to a movie and just kind of have a dark screen to look at. But thankfully, we never had that happen while I was at Disney, but it was always something to think about, in terms of a downstream effect when you're working on some of those things.
Ana Medina: I know you mentioned that you all were expecting for that failover to kind of kick in. How is it that you came across this, discovering the issue? We thought we had failover enabled in this cluster and this implementation of Rabbit MQ didn't do that. And as you were discovering this issue, what are some of the things that you were learning along the way?
Taylor Dolezal: So, we really did have a difficult time solving that, just like you said, because of the fact that it was intermittent. We would notice that the end point would come up and then just fail, come up and fail. And that was what really had us scratching our heads, as well as looking at the logs for those pods. Sometimes they looked like they were starting up as well. And it wasn't until later where... Because we were like, "Okay, this should be joining this Zero Index service." We kind of were aware of how the internals worked. And once we saw that it was able to join and everything and then would fail out, we started scratching our heads as well. It did take a couple iterations and I absolutely blame the lack of caffeine in that situation around kind of like just being able to kind of see that problem for what it actually was.
And then after that, we were able to add in better monitoring and logging on those types of clusters, and looking to see if the queues were filling up, seeing if these different end points were available, adding in some health checks to the applications. And so, I think that the way that we had framed that, mentally, at that point in time, was just like, Rabbit MQ is this static service, but it really is not. It's a distributed system. It's something within this whole body of the application. And that was what we weren't really accounting for, were different ways in which the application was accessing it. And even being proactive in identifying that, "Hey, this might not be working. Let's not actually ship things to the Rabbit MQ cluster right now." Or, "Let's pause that until we can get this under control and then being able to turn it back on." were some of the things that we were able to implement later on.
Ana Medina: I love that you covered those action items. Was that something that y'all were able to fix it and validate that it worked, then maybe later fully migrate things to your Rabbit MQ cluster?
Taylor Dolezal: Yes. And that's such a great question. I feel like I've been in a lot of enterprise and corporate situations where there has been this focus on, who did it? Who can I blame? Who can I point this finger at? And then once we establish blame, they're like, "Okay, it's going to be you this time? Great. All right. See you later." And then kind of everybody disperses and goes away and the problem keeps coming up again and again and again. People just keep getting blamed and there's not really any resolution.
That was something that we definitely did see, as well, happen in those kinds of settings. But we started talking about that and saying, "Hey, we actually want to fix this. This doesn't really make sense. What can we do to prevent this happening next time?" That's going to be the real value add too, and also, when proposing that we don't wake up at three in the morning, that's a very easy sell. If you add that to anything, it's a great way to get something sold and adopted.
And so, that's what we did. We wanted to make sure that we did actually fail things over. We were able to simulate taking down that Zero Index pod after making some modifications to our operators in ways that we had set things up. And then finding ways to kind of... how do we make sure that the state is intact? And how can we validate that? How can we back up these cues and make it easy to restore, or to failover? And then actually, like you said, running through those things, I'd say, was the most helpful thing that we did.
Because again, somebody I worked with gave me the advice that, there is no such thing as a backup, just a verified restore. And I highly agree with that type of mentality, because it's just... You can do something, but unless you inspect it, it's very difficult to sign off on. And why wouldn't you want that certainty? If you have the time to be able to implement that and test that, having that certainty makes things so much better for you.
Ana Medina: I'm loving this advice you're giving to folks. It's really, go through and verify. And we are in this Break Things On Purpose podcast, so I do have to ask you, how is it that something like Chaos Engineering would have helped this team not have to go through such a big pain of an incident?
Taylor Dolezal: When it comes to Chaos Engineering and actually taking a look at that in your application stack and such, that was definitely a big part of the work that I was doing at Disney Studios. And making sure we had that retrospective, making sure we called out times in which each of these things happened. And tried to correlate changes that might've been made to cause that, it's very rarely a single root cause. It's typically, a bunch of things happening that bump into each other, to then... that cause your stack to face some turbulence.
And I'd say that was the biggest thing for us was, getting to the point where we started to document everything and tie that into metrics, tie that into our Datadog alerts, and kind of understand what was going on with the applications. And then taking that to form use cases that we then run against the application, before it actually reached production. And run that in production, to make sure everything was as we expected it to be.
I think that that's been really helpful. And then, that was another big thing too, was, making sure that we actually validated those backups and ran through those exercises to make sure that, "Okay, auto scaling groups, does this actually work? When we fail this out or bring these two applications out, does this actually result in the application still running as we expect it to? Zero downtime, is that real?" And being able to then, bring that back to the teams and the stakeholders that were actually driving these projects, and giving them more confidence too.
Definitely saw... Talking with some of the product teams, it was a way that we really jelled well together and realized that we do have similar goals and kind of focuses on what we're trying to get done. And I think typically, working in a bigger company, that DevOps is a verb thing, doesn't typically go so well in some cases, but here it really did because we were able to say, "Let's level the playing field. Let's rethink how we're doing this work." And then, kind of showing a different way to do things. We didn't have to copy the same thing that we did before. Let's actually rethink this and make this more fun.
People had more time to work on different features. People had time to do rewrites they didn't think that they would ever get to have the time to do. And it made it fun, bringing up certain things. People weren't afraid to bring up shortcomings that they found, or whether it be the application, whether it be a managed service that they were using, because that just was another problem for the team to solve. And it was a great way to create better teamwork and get those teams to gel a little bit better.
I really like the focus of Chaos Engineering because, to me, it's just a very real thing. Nobody's perfect, and that's something that you should work into just about everything that you work on, that you are, that you do in your life. The focus of the teams was mostly that happy path. Here's how things work when everything is okay. And there wasn't really that initial focus on that defensive programming or, what happens if this doesn't work? It really is a different type of problem solving.
And when I worked at Disney and got that first tier experience on being activated and jumping into an incident, into a situation, the type of problem solving that you go about using, is very different than what you use when you're architecting out an application. I'd argue, it shouldn't be, but typically it is, because you, again, have that focus on, here's how things work when they work a hundred percent correct. But if you can kind of get the experience of both camps, troubleshooting incidents and actually getting to architect applications, then you have that ability to see the full picture. And then be able to break things on purpose, or at least anticipate what's going to break. And then try to simulate that and make sure that everything is as you expect it to be. So, I think that's absolutely a great point.
Ana Medina: I think the way that you do mention, where you get to be part of the troubleshooting and then the architecture, that was, to me, how I fell in love and to start reliability engineering. It's like, I get to learn from incidents, and as I continue being embedded in critical services, I'm able to help them? And then we get to code and make things better? Keep me on this. This is great.
Taylor Dolezal: There were so many times too, where it was fun to have seen a problem crop up a few times. And it takes you a good amount of time to figure that out, or really walking through the fire to figure out what one of the root causes was. I also, again, don't believe in one root cause in most situations. There are several that kind of propagate together to create that context or that problem, that situation, that incident.
And it was a lot of fun to sit back and be like, "Oh, I know what it is. Try this." "Hey, that fixed it. What is... How did you do that?" That was a lot of fun to be on those calls. And then to be able to be like, "Okay, check this out." And then kind of like walk through what the problem was, that gave me a lot of joy in talking with people about that. And then seeing that spark light up in their eye and them be like, "Oh, I understand why this is a problem." I love that in developer advocacy now, too, and kind of like, "Do you understand what we're trying to do here?" And then seeing people grasp a new concept or topic. Absolutely love that.
Jason Yee: That's a dovetail onto the being in developer advocacy. Your job now is to help teach people. And one of the things that you mentioned about that incident was, you were told, "Hey, get Rabbit MQ set up, get it running." And you definitely weren't experts. I feel like in my career, especially doing ops work, you're thrown something, make sure this runs, along with, make sure it's reliable, but you know almost nothing about this technology. I'm curious, having been on both sides now, being an engineer, and then now being tasked with teaching people, do you have tips on better ways that people can learn about new technology?
Taylor Dolezal: Absolutely. So, I really liked that focus that you brought up too, in terms of, when you work in SRE, I feel like in most cases, you are really split across many tools, many workflows, many ways of doing things. And you don't typically have the time to be able to dig into each of those and read up on those things. Personally, I try to RIP Google Reader. I have a whole bunch of RSS feeds and Reeder, R-E-E-D-E-R, as my application of choice, to actually go try when I can, to go and read through those things, apps.
I'd be lying if I didn't admit the giant backlog that I have. But when I do have time, I do try read some of the things in there and try to do that. I like... Reading is my main way of learning, but in teaching people, finding out all of the different ways in which people learn, has been fun too. I think the biggest percentage of people I work with really like getting to see those examples, having a GitHub repository to pull down, be able to go through and step through, make issues on, ask questions about, and get that firsthand experience, and actually running the thing. As opposed to looking at a concept or listening to a video or a podcast. So I do find that interesting.
And getting to work at HashiCorp, I really do like that I do have the time, now, and time to talk with the product teams directly about how each of these things work. I get to be more of an expert, when it comes to the HashiCorp stack and the suite of tools that we have available. So that's my cheat code there, is just, work for the company of the tool that you like. And you'll absolutely, have a good chunk of time to be able to dedicate to that.
But when that doesn't work for you, I'd say, that is the next biggest thing, is kind of, try to find whatever way that you learn best. Definitely try to take some time and learn on those fronts. It might be difficult to get started on that front too. I absolutely know. I can relate to that. However, if you look at it as an investment, and if you know this tool just 1% more, that's going to be a huge amount of time savings to you, if you're going about using that. It won't feel like it at first. I absolutely agree. But if you spend that time, you'll definitely get to see benefits on that front.
I also recommend checking out Thoughtworks Tech Radar as well. And as I find more things, I'm absolutely going to share them, via Twitter and everything else. But, I do like the organizations that take the time to put together these compiled list of data or practices or things that are new, worth adopting, worth talking about. And then trying to share that in an abbreviated and compressed way with everyone. I really do value that. I do wish there were more, but yeah, right now, I just know of just a handful.
Jason Yee: I like that you brought up talking with product people. And I've worked for SaaS companies for a while now, and obviously internally, yes, we can talk with our product folks and they love it. And they love to hear customer feedback, and they're always asking us to relay that. But ultimately, I realize that they're actually really wanting that direct connection with customers. And I think, prior to that, as an engineer, my thought was always, "Avoid them." Right? I'm going to go to StackOverflow first, I'm going to go to all these other resources. When in actuality I'm like, "Wait, I was paying for that service. They owe me support. And they actually should be available, so I should talk to their support folks. I should talk with our product folks, especially when something's broken or I want a feature." So, I think that's a great takeaway.
Taylor Dolezal: One thing that I like, that we do at HashiCorp, is we have community office hours, regularly, and we vary it up across different products. So, we will talk with the different teams. So whether it be Terraform and the AWS provider team or the SDK team, I do like that they'll have those... And again, I understand, it's hard to join, especially when you're all Zoom fatigued out. In some cases, "Let me jump on another thing to stream and to look at." But being able to kind of have that ability in real time, to be able to ask questions and get responses from those people that are directly working on those products, is really helpful.
And then, I have seen forums and I do asynchronous means of communicating too. But really, just getting that mix of that live and asynchronous type of communication going, I find that most helpful. I do like the StackOverflows and other types of solutions as well, like you said. But absolutely agree, it's just nice to talk to somebody with the familiarity with that product. Or just have the time to like, "Hey, can we just sit down and talk about this?" Whether it be internal to your company, your team, or to somebody else.
I agree. I've got an allergic reaction too like, "Oh no, you're trying to sell me something? I don't want to... I just want to talk about it, I don't want to buy it just yet." I do feel like, in our industry, there could be more of that. And I know that's where developer advocacy is kind of supposed to solve that a little bit. But definitely wish that that was kind of an easier conversation to have, is like, "Don't feel like buying right now. Can we just talk about this for half an hour? Something like that.
Jason Yee: We've chatted about how Chaos Engineering is useful and can be useful to test things. I'm curious, Taylor, have you done some Chaos Engineering recently? And if so, can you tell us about that experience?
Taylor Dolezal: So when it comes to Chaos Engineering, I definitely did that a lot more. I'd say, I'm doing less of it as a developer advocate. However, those ideologies and those practices are still very much instilled in what I try to do. So even now, talking with coworkers about, "Let's talk about this talk. What was good about it? What didn't go so well?" Having a retrospective on that front. It's not, hopefully not a DNS issue when you're doing a webinar or anything like that. But did have a colleague that actually did have some Zoom troubles and CDN issues, and was actually a very wild session that she had to go through.
But we're able to take those and kind of talk through like, "How do we simulate that again? How do we recreate this type of situation." And then try to make the team better too. If we use a platform and it has a couple of sharp edges, calling those out. And so, those ideas definitely still exist, but just less of an engineering sense and more of a workflow type sense with what we do now. So, I'm very bullish on just Chaos Engineering, no matter what it is, just being curious and being able to share those things with other people, is just really, really helpful.
Jason Yee: So speaking of sharing things with other people, I'm curious, as we start to wrap this up, do you have any shout outs, any things that you want to promote?
Taylor Dolezal: Ooh, that's a good question. Keep your eyes out on the Terraform repository for new versions, as they come out. Granted, as we hit one dot oh, those won't be as frequent, so that'll be kind of nice for everybody that is dealing with all of these upgrades. If you do find any issues or if you just want to talk to Terraform or anything else, please feel free to reach out to me as @onlydole O-N-L-Y-D-O-L-E on Twitter. And my DMS are open. Please at me. I love talking about problems and finding ways that we can help you out.
Or just honestly, if you have any questions at all, when it comes to infrastructure, always a good conversation. Give me five minutes. I'll go rip a pot of coffee and we'll sit down and talk about it. But always fun. Thank you so much for having me. This is absolutely wonderful.
Jason Yee: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.