Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below.
You can subscribe to Break Things on Purpose wherever you get your podcasts.
In this episode of the Break Things on Purpose podcast, we speak with Veronica Lopez, Senior Software Engineer at Digital Ocean.
- When marketing goes too well (2:34)
- Introduction to Go (5:28)
- Using Elixir for fault tolerance (18:50)
Veronica Lopez: You'll have to work with what you have. It's like a bit of me and me CTO in oneself.
Patrick Higgins: Welcome to this week's episode of break things on purpose. I'm your host Patrick Higgins. I'm a chaos engineer at Gremlin. We're joined today by Ana Medina, who is a senior chaos engineer at Gremlin as well. How are you Ana? How's it going?
Ana Medina: Good, good. It's been a great day, very productive and very excited to be back online talking to our podcast listeners today.
Patrick Higgins: Awesome. So today we're showcasing a chat that we had quite recently with Veronica Lopez, who is a senior software engineer at Digital Ocean. But before we cut to it, Ana I just wanted to ask you what you enjoyed about our chat with Veronica.
Ana Medina: One of the things that I enjoyed the most about talking to Veronica was getting to hear her experiences about being one of the first engineers in Mexico to adopt Golang and being part of that community back in the early days of how is it that we can actually adopt Go for the things that are needed in her organization at that time. But what were the incidents and issues that they saw by having to scale assistance?
Patrick Higgins: Yeah, it was really interesting to hear about her pain points coming from a smaller shop and having to deal with larger scale complexity that comes with when you end up with a bigger user base than you'd expect. So very interesting stuff. Without further ado, let's jump to the conversation. Here's our chat with Veronica Lopez.
Ana Medina: Hi Veronica, how are you doing today?
Veronica Lopez: I'm very happy to be here. I am doing well. I am in Ireland. So in the middle of nowhere, as far as I'm concerned, the world is perfect, peaceful and green.
Ana Medina: So today we're very excited to have you on the Break Things on Purpose Podcast, where we'll be talking about distributed systems, learning from failure, chaos engineering, anything in the software engineering space, what has been one of those incidents that you've been part of that you've encountered, and you're just like, "That was rough." What is it that happened? I want to go learn a little bit more.
Veronica Lopez: Yeah. So in my case, it has been interesting, at least for me, because I was very lucky to deal with this type of incidents very early in my career, which definitely was not fun at all at that point. But now I'm extremely grateful because I could learn a lot of things. And actually that's how I ended up in distributed systems, because when I faced this type of incidents, I didn't know they were a distributed system thing. I was just hired to be a backend developer with some Python, Go was still in diapers. It was very backend shop. And I'm from Mexico city. So that's definitely not a tech hub. So, I was working there back then and you'd never imagined that when you're working for a place like that, that was basically a consultancy, you wouldn't even have to deal with that type of problems.
Veronica Lopez: It was all very shop, very software shop that is very generic. One after the other, the other, the other. Well, so long story short, some of this little innocent websites systems started growing a lot. One of them specifically, it was for a bank. Obviously, they had their own internal systems, but they had to outsource many components, many things. And as innocent as this type of service was, it was like the external website for the rewards programs for different credit cards, that are not the central spinal cord of the bank, of course, but still required many validation steps in security. So, I learned, it was very fun actually to learn many things around that. But the thing is like, I'll try to be super quick in this story.
The thing is like it was a rewards program and it was very innocent as it sounds. Every Friday, people with certain credit cards would have different options of... I don't know, like surprise perks. It could be from free tickets to the movies, or you could enter the draw for a trip to the beach where two people. That type of rewards programs. This rewards were launched at a certain time every Friday. So there was a specific time. So what does that mean? Well, concurrency, right? Because a lot of people wanted this rewards because they could be super silly, but sometimes they could be great.
And since they were launched at the same time, and I don't have to explain.
Ana Medina: Yeah.
Veronica Lopez: But I learned concurrency at school, but never had to really deal with that. This was before cloud native. This was before Kubernetes. This was, as I said, probably the first two years of life with Go as a widely non-platform, or a little bit more, but it was around the first GopherCon. And I'll tell you in the story why I know that this, anyway. And so everything was very manual and yes, we knew that we were dealing with traditional servers still like on the cloud but not what cloud native means right now. They were Rackspace servers, sorry, but that you would need to manually configure or if you had a person there, but it was very manual, anyway.
And yeah, we definitely were monitoring CPU usage, but at top of the iceberg level here. And for some reason we saw that we had a lot of volume. Well, we would manually increase the number of servers for a few hours, whatever, simple things. I don't know what happened internally. This was totally not a technological thing. It was more like an internal decision of the bank where the... Let's say that this information for the perks or the benefits was handled through a newsletter that people would receive on their email. But suddenly they started sending the newsletter first, in tiers, like first to the most... No first to the internal employees of the bank. And a few minutes later to VIP customers and then a few minutes later to everyone else.
And I don't know. I'm oversimplifying it, but there were different... The same newsletter was accessible at the same time to different types of customers. So we not only had concurrency, but we had concurrency at different times. If that makes sense to like ... we had to deal with... I know how stupid that sounds.
Ana Medina: It's building the systems and building at scale is what that sounds like.
Veronica Lopez: Yeah. So, we had a huge concurrency issue, but managed by different concurrences. In a span of an hour, we had to deal with little clusters of concurrency every 15 minutes. And we were all, again, engineers that had nothing to do with distributed systems whatsoever. Well, long story short, they had started growing and growing and growing, the price got better. At some point it became obvious that we had to change the services because they were not built with scale in mind. Around this time, my teammates and I started looking for solutions, not only at the technology side, but the human side, because we would literally be woken up by the Rackspace alerts like at 3:00 AM. It was like being on call all the time.
Patrick Higgins: All through it all.
Veronica Lopez: All the time without a strategy, or without having specific roles for that. Because obviously we were a tiny team, no one in digital agencies, or at least not at the time would have on call rotation or things like that. Those fancy things. So it was like being on call the entire time. And our strategy was literally to just restart the servers. Like my only work. Restart ng-next and stuff like that.
Patrick Higgins: So I think it was like a marketing campaign that was like way too successful?
Veronica Lopez: Yeah. So basically we started exploring new technologies and be like, what can we do about it? Is we'd have to talk to the bank and we definitely didn't want to do this in Java because none of us like Java. But one thing took us to the next one. And we ended up discovering Go, which as I've said, it was pretty recent and after a lot of quick research. I found out that the first GopherCon was about to happen and we had like an education pond and I had a great boss. He was not technical, but he was always trying to encourage us. So I was like, "Can I go?" And he was like, "Yeah, sure." So after many tricks, I went to golfer con with one of my peers and it was amazing. And yeah, it was a very rare opportunity for someone like us that had nothing to do with Silicon Valley, that had nothing to do with tech hubs or with big, huge enterprises or services that are for the entire planet.
It was a very humbling experience, but definitely great. And my message... Probably a lot of people that listened to this podcast might think that you can only have access to big distributed systems when you work for big, big companies and fancy and all that. But I don't think that's the case. And that humbling experience. That little rewards program website was my first step into big distributed systems.
Ana Medina: So it sounds like this little reward system that you're working on gave you a really big reward going to GopherCon, learning go. And then I know you've done a huge contribution to the Go community and since then, but I do want to touch back on that reward system. So after y'all thought about switching it to Go, did this fix issues that you were having around concurrency and scale?
Veronica Lopez: Yes, a lot. It was basically rewrite parts of the same system, the same services, but that would be concurrent. That's it. And Go made that available, super easy, thanks to Goroutines. So we didn't have to get extremely creative or do many things. It was literally the introduction of Goroutines. We waited for... Sorry, I don't remember the numbers, but we waited for certain seconds that would be wrapped into Goroutines and shift into our service. And that made a huge, but literally a lot of different in both human, because we didn't need to be restarting servers for a very stupid reason, honestly, like in the middle of the night.
Also, we didn't have to scale our infrastructure for a very silly reason as well, because at some point we had to buy many additional servers just to scale the thing. And again, this was before the cloud native era. So it was not that easy to just call your server company and tell them, "I just need a bit of the server." No, they will sell you the entire thing. So it was very expensive to scale.
Patrick Higgins: I'm really curious, when you started off on this particular campaign, did anyone flag that this could happen and that scaling might be an issue?
Veronica Lopez: No one. Even the bank itself thought that it was just like an innocent thing. And at the very beginning, as I mentioned, we definitely found a surge. When the newsletter was sent, there was definitely a surge. But because before that it was basically 5%. But 5% the rest of the week and went to 80% to the servers that... Well, the CPU and that 90%, and I will know that, but beyond 80%, I was not an expert back then, but I knew that that was not right. So yeah, I know. And no one knew. Really, no one knew.
Patrick Higgins: I'm really interested in how you spoke about the organizational scaling as well. I'm like that it's really interesting that a lot of the pain points you'll hit, when you're a small place that isn't ready to be a big place, but big places like built out and ready and I think that's like... I'd love if you could speak more to that as someone who's worked in a bunch of different places at this point.
Veronica Lopez: Yeah. Well, I think honestly that it was a very lucky situation because as I said our manager back then... It was the owner of the company.
Patrick Higgins: Mm-hmm (affirmative).
Veronica Lopez: So we didn't have a proper CTO. We didn't have a proper... Which sounds rough but at the same, it has its good things because it didn't have anyone blocking things with a very... We did have some technical guidance decisions from... It was a temporary CTO or a consultant CTO, but we didn't have someone who really had this very passionate vision about something or who would be obsessed about a certain programming language or things like that. It was mostly like our job... It was not easy, but we had... It was entirely our job of the programmers to present our case into the non-technical bosses or managers and tell them like, "Look, we managed to avoid buying more servers." Probably you don't know what servers are for CPU usage is for. But you know for a fact that you pay for this and I'm telling you that if you authorize this type of code to be introduced into our code base, you won't have to pay for this and that, and that, and that, and users will be happy and you won't get called by the bank and things like that.
So you have to be very smart on that political side or things that sometimes developers we're not great at, like the advocating for ourselves on the social side. We tend to be like, "These people are idiots because they don't understand it." But that you have to make your case. Not every time was successful, but for example, and at that time Go, again was not popular at all, let alone in Mexico.
So one of the owner's concerns was that he didn't know how to code, but he dealt with developers. So his main concern was that if we left, who would maintain that, or if someone, even with us, or if another developer will join, we couldn't expect them to know Go. So, what was our plan to teach them Go or probably not teach them, but what was our strategy? If we hired someone who knew Python, if we could work with that, or what type of resources would we provide them to learn a little bit of Go? So you also have to think of that, which is totally valid. Because as developers, we just come with like, "We know we discovered this amazing programming language with Goroutines and concurrency", but we don't have all that business site thing. So, that was hard experience.
Ana Medina: That's actually a really interesting, I think I never consider that part of, wait, there is no people in my country to do this job, so how is it that I replace the single point of failure that's now being added to my system on this point that already had issues around concurrency?
Veronica Lopez: Yeah. And also you can not do that every time. You cannot be introducing new programming languages or new tools every single time that you're dealing with something. You have to work with what you have. So, it's like a bit of me CTO-ing oneself.
Ana Medina: Well, it sounds like that was very much learning from failure and you getting your start out a really good career. And I also know that you've given a few talks on fault tolerance systems. What would you share about your passion and knowledge about fault tolerance system specifically in our distributed systems space?
Veronica Lopez: Well, for me, the concept of fault tolerance comes tightly attached to distributed systems. And I'm not saying that that's the only valid way to build them, but in my mind, in my experience and the type of systems that I have worked with, one can not exist without the other. You cannot or you shouldn't have, and again, this is my opinion. You shouldn't have a distributed system that doesn't deal with fault tolerance in some way, or that you have a robust fault tolerance scenario or strategy. Now, languages like Go don't deal with this natively. Fault tolerance doesn't come out of the box. This doesn't mean that it's a bad language or that I'm trashing the language itself. But later when I started going deeper into this discipline or this area, I discovered other tools like Erlang or Elixir and all that community of this languages. Well, Erlang that was built initially for telecommunications a long time ago.
So first of all, I saw that there are actually many strategies for fault tolerance that were built many years ago, because back then those systems were for... As I said, for telecommunications, which are areas that where you cannot fail or that you cannot say like, "Yeah, let's just deploy it like that. Let's ship it like that and we'll patch in two releases." Because I understand that is a completely different paradigm, because in our industry, in the software industry nowadays, obviously we have to be accountable for the business side and we have to ship things and we have to show the work constantly, constantly, constantly. But I also think that there has to be a balance. And so coming back to the fault tolerance thing, I think that many modern distributed systems do not take fault tolerance that seriously because, and they're more focused on getting shipped quickly.
And that is obviously very important. I think that this part where we have democratized building of distributed systems or the architecture of distributed systems that is now widely available to almost anyone that wants to learn and wants to build them or work for a company, even with no expertise beforehand, stuff like that. I think that's great. But the democratization has come at a cost of ignoring those pillars, like fault tolerance. So you end up... You have to have it either way. So I feel that nowadays that since people probably don't understand or the... Not even the technical parts or how to do it, but don't understand the importance of having it. They end up with very fragile strategies that put a lot of load or a lot of responsibility in the wrong tool.
Ana Medina: So what would... I know you just mentioned Elixir and Erlang. I'm not sure if all of our listeners do actually understand how those two languages do help fault tolerance systems. In 30 seconds how would you summarize a little bit of the way they do this?
Veronica Lopez: Yeah. Okay. So with Erlang a little bit of history. With Erlang creatives for telecommunications and Ericsson many years ago, one of their main things for them was to be able to change, introduce any hard changes live without taking the system down, without having to have maintenance windows, without telling your customers like, "Sorry, your system will be offline for an hour or a day or whatever it was, because we need to do some upgrades." Now, it's like, everything has to be online forever, but also we want to be able to introduce changes or upgrades or whatever it is and because telecommunications work like that. So extending... Erlang has been open-source and available for everyone for a long time. I don't know exactly for how long, but for a long time. The thing is, even if it was created with telecommunications in mind, people started using it for other types of distributed systems and you can use it for anything, even for websites these days.
So that is Erlang in a nutshell. And one of the mechanisms. You can watch any of my talks about it or not even mine, but literally go to YouTube and type Erlang, look at fault tolerance, and you will find a bunch of excellent talks, not by me, and you will be able to learn. So one of the mechanisms that they use to be able to be always online and do all these changes and all that, it's a fault tolerance.
One of them is not allowing the system to panic. This means that one little thing cannot... Things will always fail. And I'm not saying, and they're not saying that with Erlang things never fail. It's how you deal with that failure. How do you contain that failure and the heart of the matter and how Erlang achieves it is that they contain... Everything is... I was going to say containerized, but no, people will think that I'm talking about Docker containers, no. In the generic meaning connotation of the word containerize, or compartmentalize, let's say. Isolation. Erlang is great at isolating things for the better, and for the worst.
So when something fails, there are many mechanisms, there are many paths that you could follow to deal with it. And panicking is not the first option. And it comes built in. You don't have to deal with it. There are many ways to deal with it manually, but you don't have to, because inherently out of box, it comes with that. So that is huge. And I don't think that a lot of people are aware of that, but there are tools that already do that and that they have been around for many years, for decades.
Ana Medina: Yeah. I think it's an interesting aspect because it's like, how do you hold two truths together at the same time? In two hands where you cannot fail and things will always fail. And then now you have to build a strategy around making sure that when things fail they're not customer effecting or that your systems are just going to downgrade one tier, but still give the customer a really good user experience.
Veronica Lopez: Yeah. So it's mostly about that. It's not like black or white, it's not zero or one, it's like, what are you... Basically the manual process when you're working with this technologies with Erlang/Elixir and well, by the way, the technology behind these two languages it's called the beam, that is the virtual machine that powers them. So now, instead of using Erlang or beam, they're the same for the context of this podcast. So one of the things that you can configure with the beam is what type of user experience are you willing to sacrifice when something fail? But it's very...You can customize it a lot. You can play with it instead of thinking of the tool that you're going to use, or the third party library that you're going to use, or the paid third party service that you're going to use, just to avoid panicking.
Veronica Lopez: Just to avoid that a single surface that you didn't have in mind could make your entire website or your entire service fail. And it's something that you don't have to have in mind at all when you think that technology is... So, it allows you to be more creative and like to focus on the actual thing that you're building, not in babysitting your system.
Patrick Higgins: So Veronica, I wanted to ask you, what are you excited about in the future? What's coming up for you now in terms of your work, in terms of the space itself, that would be quite exciting? What are you working on that's really interesting too?
Veronica Lopez: Yeah. Well, definitely distributed systems and infrastructure these days have usually the connotation of double-ups or SREs and all that, that is great. And that I'm very happy that is happening right now in our industry. But I wanted to comment on this because I would consider myself an infrastructure engineer, but I work for an infrastructure company. So being an infrastructure engineer for an infrastructure company is way different than being an infrastructure engineer for software as a service or a company.
None of them are better than the other. I just wanted to highlight the difference. That said, so I'm thrilled right now with my current day job on distributed systems but are applied into very specific area where... Well, I work for a servers company, a cloud company. So, my work in distributed systems is applied specifically to virtualization, to hypervisors, to almost... Well, one layer ahead of the bare metal.
So the type of problems that I am learning about are very specific and very fun, and that you don't have the opportunity to work with at many companies because they don't have the need and that's fine. But I am very excited about the opportunity of specializing in something very, very, very... I don't know how to say, like specializing in something very specific, and that it excites me because not a lot of people know it, and it is one of my purposes to be one of the few people. And I don't know if in the world or in the country, or maybe just in my tiny bubble, that is able to solve something that you cannot find in stack overflow, or that you cannot find in the books, that you really have to be creative. Like really, really creative, because no one else, not even other infrastructure companies will have the same exact problem because they don't have the exact same customers.
They don't have the exact same... Even the physical architecture, the physical rocks, the virtualization technologies. The combinations are very specific to every company, even. So, without rumbling, that's what gets me very excited. The opportunity to fix problems, to solve problems that no one has ever had to solve before in the same way. I know that many people have dealt with the type of job that I have right now, but the very specific day to day scenarios, even as silly as they are, or as simple as they are, are completely different to anything that I had experienced before. So that's great.
Patrick Higgins: That's really cool.
Veronica Lopez: Yeah.
Ana Medina: So this has been a super fun episode. I think if you can tell listeners just one thing that they can do or anything that you want them to go check out, what would that plug be?
Veronica Lopez: Yeah. So to play, you have to play a lot. My first method is to anytime that you feel that you have discovered something, that's great, but also take two seconds to look if someone else had a similar idea. Chances are in my experience that 99% of the time someone else already has that idea. It doesn't necessarily mean that you can not contribute to it or that their implementation was better than yours will be, but at least it will give you some perspective that you're not working in a silo. Who knows? Probably you could even end up contributing with them or working with them or things like that.
Veronica Lopez: A worst case scenario, no one has done it before and that's amazing. So that would be the first thing, the second thing will play. So once you already know what's available, try to use all the resources. There are many free things to say. There so many papers, so many things and play with things that don't come in the books. That don't come straight out of the books. Find your own thing. This is almost philosophical, but I promise that is 100% technical.
Patrick Higgins: Well, thanks so much for joining us today, Veronica. I've really appreciated it and really enjoyed chatting with you and hearing about your amazing story. So you had so many amazing experiences.
Veronica Lopez: I'm very, very happy, and very humbled that you reached out.