Podcast: Break Things on Purpose | Carmen Saenz, Senior DevOps Engineer at Apex Clearing
This week Ana sits down with Carmen Saenz, Senior DevOps Enginner at Apex Clearing and PhD student at DePaul University in Chicago, sits down this week to talk about her history in engineering. She brings to the table some anecdotes about her own time engineering chaos. Carmen goes into detail about the early days of chaos engineering and her work there, going from on-prem to the cloud, how she is always learning, her passion for teaching and more. The lessons learned are supremely valuable, listen in for the details!
In this episode, we cover:
- Intro and an Anecdote: 00:00:27
- Early Days of Chaos Engineering: 00:04:13
- Moving to the Cloud and Important Lessons: 00:07:22
- Always Learning and Teaching: 00:11:15
- Figuring Out Chaos: 00:16:30
- Advice: 00:20:24
Jason: Welcome to the Break Things on Purpose podcast, a show about chaos engineering and operating reliable systems. In this episode, Ana Medina is joined by Carmen Saenz, a senior DevOps engineer at Apex Clearing Corporation. Carmen shares her thoughts on what cloud-native engineers can learn from our on-prem past, how she learned to do DevOps work, and what reliable IT systems look like in higher education.
Ana: Hey, everyone. We have a new podcast today, we have an amazing guest; we have Carmen Saenz joining us. Carmen, do you want to tell us a little bit about yourself, a quick intro?
Carmen: Sure. I am Carmen Saenz. I live in Chicago, Illinois, born and raised on the south side. I am currently a senior DevOps engineer at Apex and I have been in high-frequency trading for 11 out of 12 years.
Ana: DevOps engineers, those are definitely the type of work that we love diving in on, making sure that we’re keeping those systems up-to-date. But that really brings me into one of the questions we love asking about. We know that in technology, we sometimes are fighting fires, making sure our engineers can deploy quickly and keep collaboration around. What is one incident that you’ve encountered that has marked your career? What exactly happened that led up to it, and how is it that your team went ahead and discovered the issue?
Carmen: One of the incidents that happened to us was, it was around—close to the beginning of the teens [over 00:01:23] 2008, 2009, and I was working at a high-frequency trading firm in which we had an XML configuration that needed to be deployed to all the machines that are on-prem at the time—this was before cloud—that needed to connect to the exchanges where we can trade. And one of the things that we had to do is that we had to add specific configurations in order for us to keep track of our trade position. One of the things that happened was, certain machines get a certain configuration, other machines get another configuration. That configuration wasn’t added for some machines, and so when it was deployed, we realized that they were able to connect to the exchange and they were starting to trade right away. Luckily, someone noticed from external system that we weren’t getting the positions updates.
So, then we had to bring down all these on-prem machines by sending out a bash script to hit all these specific machines to kill the connection to the exchange. Luckily, it was just the beginning of the day and it wasn’t so crazy, so we were able to kill them within that minute timeframe before it went crazy. We realized that one of the big issues that we had was, one, we didn’t have a configuration management system in order to check to make sure that the configurations we needed were there. The second thing that we were missing is a second pair of eyes. We need someone to actually look at the configuration, PR it, and then push it.
And once it’s pushed, then we should have had a third person as we were going through the deployment system to make sure that this was the new change that needed to be in place. So, we didn’t have the measures in place in order for us to actually make sure that these configurations were correct. And it was chaos because you can lose money because you’re down when the trading was starting in the day. And it was just a simple mistake of not knowing these machines needed a specific configuration. So, it was kind of intense, those five minutes. [laugh].
Ana: [laugh]. So, amazing that y’all were able to catch it so quickly because the first thing that comes to mind, as you said, before the cloud—on-prem—and it’s like, do we start needing to making ‘BC’, like, ‘Before Cloud’ times when we talk about incidents? Because I think we do. When we look at the world that we live in now in a more cloud-native space, you tell someone about this incident, they’re going to look at us and say, “What do you mean? I have containers that manage all my config management. Everything’s going to roll out.”
Or, “I have observability that’s going to make us be resilient to this so that we detect it earlier.” So, with something like chaos engineering, if something like this was to happen in an on-prem type of data center, is there something that chaos engineering could have done to help prepare y’all or to avoid a situation like this?
Carmen: Yeah. One of the things that I believe—the chaos engineering, for what it’s worth, I didn’t actually know what chaos engineering was till 2012, and the specific thing that you mentioned is actually what they were testing. We had a test system, so we had all these on-prem machines and different co-locations in the country. And we would take some of our test systems—not the production because that was money-based but our test systems that were on simulated exchanges—and what would we do to test to make sure our code was up-to-date is we actually had a Chaos Monkey to break the configuration.
We actually had a Chaos Monkey and it would just pick a random function to run that day. It would be either send a bad config to a machine or bring down a machine by disconnecting its connection, doing a networking change in the middle to see how we would react. [unintelligible 00:05:01] with any machine in our simulation. And then we had to see how it was going to react with the changes that was happening, we had to deduce, we had to figure out how to roll it back. And those are the things that we didn’t have at the time. In 2012—this was another company I was working for in high-frequency trading—and they implemented chaos engineering in that simulation, specifically for them, we would catch these problems before we hit production. So yeah, that’s definitely was needed.
Ana: That’s super awesome that a failure encountered four years prior to your next company, you ended up realizing, wait, if this company actually follows what they do have of let’s roll out a bad deploy; how does our system actually engage with it? That’s such an amazing learning experience. Is there anything more recent that you’ve done in chaos engineering you’d want to share about?
Carmen: Actually, since I’ve just started at this company a couple of months ago, I haven’t—thankfully—run into anything, so a lot of my stories are more like war stories from the PC days. So.
Ana: Do you usually work now, mostly on-prem systems or do you find yourself in hybrid environments or cloud type of environments?
Carmen: Recently, in the last three to four years I spent in cloud-only. I rarely have to encounter on-prem nowadays. But coming from an on-prem world to a cloud world, it was completely different. And I feel with the tools that we have now we have a lot of built-in checks and balances in which even with us trying to manually delete a node in our cluster, we can see our systems auto-heal because cloud engineering tries to attempt to take care of that for us, or with, you know, infrastructure as code, we’re able to redeploy at will. So, with the cloud infrastructure, a lot of what would cause me anxiety and give me more white hairs is slightly less than [unintelligible 00:06:51].
Ana: I love the way of putting it the less amount of white hairs is because of cloud. So, thank you all, cloud providers. As this comes to mind and we think about your background of coming in from on-prem systems, is there anything that you’ve encountered in this cloud world that you think that’s a gotcha? Like, I’ve had an incident in bare metal that cloud is not really necessarily having a use case or reliability mechanism built-in, just out of the box.
Carmen: It’s easy to catch, but it’s a gotcha at the same time. So, when you come from on-prem into Cloud, the networking is… not all the same. The words are there from networking, like ‘gateway,’ and ‘firewall,’ click a few buttons, as opposed to you running [Arista 00:07:38] commands versus on a router. [laugh]. And then you have your VPC, which you can say that’s your little world and your internal network.
The words are there, but they’re different in cloud, and that’s the got me part of that transition. But at the same time, you have an easier way to visualize those things. For example, if my machine can’t connect to another machine, are they in the same subnet? I don’t have to run Arista commands to figure that out, or look at the logs on the router; it’s literally right there in front of me. So, a lot of that pain that we would have to going—you know, switching from—going to your Linux machine to then getting into the router and then running these different commands, I feel like you needed to learn more commands and more different types of languages of the things that you were using in order to interact with, as opposed to now in the cloud, I feel that those things are more blatantly in front of you to fix.
And they were a little bit more abstract in on-prem and that’s why you would need someone like a network engineer more as opposed to a DevOps engineer who—I feel like it’s easier in that sense. So, once you know it, you’re able to solve those problems that you would need a networking engineer for.
Ana: I guess now when we look at the DevOps site reliability engineers have this cloud world or this hybrid world, you end up wearing a lot of hats and you end up having to master, to an extent, various levels of networking, or knowing at least the operator side of how a lot of our infrastructure is running. It does bring me to the next question where you get a chance to come from that BC world—Before Cloud—and we now see that a lot of DevOps SREs that are joining in, they come into the magic of the cloud. What do you think is one of the things that the engineers that are just getting started and are just touching the cloud are not getting a chance to dive into, that the cloud abstraction layer really misses out on this amazing fundamental of the work that we do.
Carmen: I think it goes down to the nitty-gritty. With DevOps, you wear many hats. You’re good at everything, but not a master of one thing. You’re a little bit everything [unintelligible 00:09:49]. Before cloud, even though the term DevOps didn’t exist and you were called, like, an operations developer, operations engineer, you worked closely with the people who wore that one hat and you were working with them.
And now that people are coming into the cloud only with no on-prem, they get the layers abstracted on what is the three-way handshake in networking? What is indexes really used for in databases? How do you know that you’re not using—doing a linear search because your indices are incorrect in your database, versus doing an algorithmic search with that specific algorithm, that specific query language is using. Those are things that are so abstracted but are still very necessary because you may have to work with an on-prem system that connects to your cloud infrastructure and you may need to use Wireshark.
Who still uses that? But you do. There’s systems, older mainframe systems that mainly finance uses, or there’s still COBOL systems out there. So, I feel that’s what’s missing from being in cloud. But I hope that education and other programs like Courseras and Lindas that if people feel like they’re lacking in something, they’re able to go and learn those fundamentals somewhere.
Ana: I know you love learning. Two things come to mind. Do you have any resources where DevOps and SREs can start learning more? And do you want to share with our listeners a little bit more about your passion and your path in learning?
Carmen: Sure. So, a lot of the things that people look at are common things like Udemy. I feel that Udemy has a lot of great DevOps courses, believe it or not. I have used them to study, I’ve used them for refreshers. I came in with Amazon cloud experience but no Google Cloud experience, so I basically took a Udemy to get my feet wet, as you would say, to get into that world.
Linda is also good. If you have your student ID email, you can get it for free. So, [laugh] as a student, now, I use that. And then there’s just various resources. A good thing, also, like, finding groups like Techqueria DevOps, as well as Latinas in Tech, and TECHNOLOchicas that if you join groups where you meet other people who are starting in that space or have been in that space for a long time, they have the resources as well. But those are the resources.
Ana: Do you want to share a little bit more about your path on learning about DevOps and what you’re up to now?
Carmen: With DevOps, since I’m passionate in learning—because in DevOps, you have to always keep learning—as I was going through my education, and even now being in industry, is that I don’t know that many Latinas, especially back in the 2000s. What I noticed when I was in school is also that the majority of my teachers were men, they went to Harvard and MIT, and their great schools, majority of them, were [unintelligible 00:12:35] and I never had a Latina teacher or any of that. And I said to myself that I wanted to be that teacher that I didn’t have. So, I started teaching part-time at my alma mater, at Loyola University. And I loved it.
And I loved—I taught, like, data structures in [C+ 00:12:55] and Java. I’ve taught DevOps classes, I’ve taught bash scripting, I’ve taught open-source computing, intro to object-oriented programming. And I just loved engaging with students. I’ve noticed that I knew I was missing something and then I realized it was teaching, being that difference, being that change, being the face that I didn’t have. And I figured, what’s the best place than my alma mater to start that at?
As I was doing this for my fifth year, I realized that if I love it so much, I should do something about it. So, I decided to get a Ph.D. So, 10 years later [laugh], I went back to the school, and I’m currently in my third year at DePaul University in Chicago as a Ph.D. student, working in the American Sign Language Lab Avatar Project, creating an avatar to do not just American Sign Language but other sign languages—yes, there are many different ones. And that’s where I’m currently at now because I would love to teach again somewhere, full time, be it after industry or maybe at the same time I’m still in industry. Who knows what the path is. I love teaching, and I love helping, and I love engaging, and I love technology, so that’s why I wanted to go back to school and become a teacher, at one point.
Ana: So, amazing to see your passions and your background come together into that mission of pushing forward the industry and bringing more representation to it.
Carmen: Definitely. Thanks. It’s still a [ride 00:14:19]. I don’t know what the outcome will be, but [unintelligible 00:14:22] so I just hope that I pass my second exam for my Ph.D. that’s coming up in September. [laugh].
Ana: I wish you a lot of luck, and I’m sure our listeners are also rooting for you. Are we going to be seeing Dr. Carmen Saenz that’s going to be teaching DevOps, teaching [unintelligible 00:14:39], teaching chaos engineering, or would you stick to something more in ASL?
Carmen: I think a little bit of both. I think that the experience that I bring as a DevOps engineer is that some of these systems don’t exist. Some of the stuff that I was doing that I brought to the lab was, they already have the programmers, but they don’t have—they’re running on, like, a Windows machine in an internal network at school. So, how can we make this widely available? One of my posters that I did my second year was specifically in engineering, and architecting, and infrastructure that can handle creating high visualizations with, you know, a GPU graphics card, but also being scalable and then it has to be [GDP 00:15:22] compliant because we work with European schools.
So, there is DevOps in my Ph.D.; it’s just a little hidden. But I’m hoping that I continue to bring that to the table in my lab and the work that I have won’t just be on an internal network in three years. I hope in three years, you’ll be able to—you all—can connect to it to be like, “That’s the work her lab did, and we all know that the reason why it’s visually seen by everybody was because Carmen was the one that created the infrastructure for it to work, with a group of other engineers.” So, I’m hoping that I can bring my two loves together in that sense, in three years’ time.
Ana: I mean, I think you’re already doing it. It’s super amazing to just even hear when you get those learning stories of someone is an engineer in the industry, has 10-plus years, and then they go and they help assist them that is not tied to our technology space, that is not running on the cloud, that they’re running on just one Windows Server and they’re hoping to reach, what, 3000 people a month; like, how is that possibly even going to scale for them to be successful? So, I think even you just going into the lab and putting in some of those DevOps principles. And I love also what you mentioned, you know like, how do you make it be a highly scalable system when you’re running with so much GPUs, and you have all this different types of compliance in it? Which I think is always interesting when we talk about certain industries, whether it’s healthcare or finance, that it’s like, yes, we need to have compliancy based on data that we store, but then there’s certain regulations and government that might tell us we also might need to have a certain uptime or you might be breaching this type of service-level-agreement.
Carmen: So, a lot of the things I’m used to is more internal in the sense of we need to keep our logs X amount of years, and we need to know who logs into what machines. And so a lot of the compliance is pretty standard across the board for a lot of internal networks. But for something as big as this project for the Ph.D. that I’m doing on, our group has to make sure that [GDR 00:17:27] compliance is very different. And what PIA, what data do we have and how can we abstract it or break it down enough where then it won’t actually go back to a person? And those are the things that I feel that I personally don’t really have to deal with right now at work, but I have to deal with them at school.
So, there is a trade-off, something that I’m lacking at my position at work, I actually have to think about, working with different countries, to have this software and the things that they’re lacking like now having a scalable uptime system that people can communicate with, that is something that I do here at work; I’m trading off in both places. And compliance is very difficult to [ticket 00:18:10]. That’s chaos engineering, too, because you’re going to have to hire someone, or third-party company, or yourself have to literally attack your own system and see what you’re missing and make sure you’re compliant. And I think that’s the beauty of—also—chaos engineering, trying to figure that out and making sure [laugh] that you’re good, you know?
Ana: For these highly visual systems that you have in your Ph.D. program, what are some of the unknowns that you’ve had to encountered as you’re working on them since they’re not very similar to the stuff that you work on your day-to-day?
Carmen: I actually had to backtrack to my on-prem experience to a point. Unfortunately, the code was written in the late ’90s and it’s still [unintelligible 00:18:54] like that now, some of it. And it uses an executable; it had to be built by my professor, they gave me the EXE, I had to put it on the machine. And one of the things is, in Amazon, they have your GPU systems, right, and you could say, “I want this server with this graphics card and AMD and so forth.” One of the things that a lot of people don’t recall if they never worked on on-prem systems is that drivers are problematic.
And as I was trying to run this executable, I kept getting this error and I was like, “This has to be a driver issue, but how do we troubleshoot a driver on a cloud system that is pre-built for you?” So, [laugh] I was trying to figure it out? I’m like, “Is it the executable and how it was built on whatever machine? Or is it the machine that’s in the cloud? And if it is, how do I update the driver? How do I downgrade the driver?”
And so I had to Google how to downgrade drivers in VMs in the cloud. There’s specific commands that you have to run that are AWS only. You don’t have the manageability that you had when it’s your own on-Prem system. Like, you just know, you run a general AMD command or a general package installer for the driver. It’s not the case, all the time, for cloud systems.
You have to run a specific AWS command. Luckily, what I found out was my professor, I brought it up to him, and he’s like, “Oh, I have this driver. You’re using this driver. I need to do some magic on my end to build this executable and it should work on the driver for this VM.” And I was like, “Sure.” But I didn’t know how to troubleshoot those things in the cloud. But I knew how to troubleshoot them from back in the day when it was my on-prem system so there—it’s weird understanding that drivers are still an issue, you just didn’t think so because they’re so abstracted nowadays.
Ana: It’s always interesting to remember where the abstraction layers push us forward in so many ways, but that they always bring this kind of catch on the other side of it of, “Wait, no, now you actually don’t get a chance to just drive over to your data center, switch out a certain type of resource. Oh, the cable is starting to look a little hot, maybe we should stretch it out.” We now assume that a lot of these things are being handled for us.
Ana: Do you have any advice on how do you maintain systems that you don’t build, or how is it that you can hand over things better when you’re working with systems that are maybe even BC, Before Cloud? I’m going to trademark this. [laugh].
Carmen: You should trademark it because, seriously, that is such a great way to explain it. That was literally what you do when you started a new job, right? They’re like, “There’s this old system.” I asked if there’s any documentation, they usually laugh and chuckle at you. And then [laugh] they give you some notes that tech that some person left for you to look at. “Do you have any infrastructure as code?” They also might slightly chuckle at you and just give you some version that’s 15 versions behind, that if you try to [unintelligible 00:21:52], they’ll tell you, you’re missing 50 other things.
So, you do have to work with what you’ve got. And you’re the whole point of being a DevOps engineer is that you investigate, investigate. And you shouldn’t be afraid to ask questions. And I think that’s something I learned as I got older. I was always afraid to ask questions.
And I always felt like people were going to judge the crap out of me because I was asking questions. But how are you going to understand the system that you didn’t build, and try to get into the head of the person that did build it in order for you to make it better? And… that’s okay to ask those questions. And you should get those notes, that tech, and that rando Terraform that only works for a quarter of the things that were built. And see what’s missing and try to see if you can devise a plan of attack of how are you going to break this down for yourself.
And then, there may not be no diagrams. So, I’m not telling you, use Creately or anything like that to diagram, but it’s also good just to have a piece of paper and a pen and just start drawing some of that stuff out. And then, a lot of it also is okay, let’s make a test. On Saturday, I’m going to bring this down—chaos engineering—and I’m going to see who yells about it. Who’s going to care?
Who’s going to care if I break this. And that’s how you know who are the stakeholders. Sometimes, that’s what you need to do; you need to create a little chaos, to understand what your next steps are in order to get rid of all that technical debt to make your company and your product better. That’s how you have to start, and then from there, you’ll get more stakeholders that are going to care because you caused a little chaos, in order to bring the system up-to-date, that is not yours, that now is yours. [laugh].
Ana: You actually touched upon something that I was telling someone about two weeks ago where it’s that we have this mental model of what our system looks like, they gave us an architecture diagram because, you know, this was only built five years ago, but we now have the thought that all of this is perfect, and until you start unplugging things, you start doing some chaos engineering of what in this architecture diagram is actually correct? What is not? Do I really have a database in my high critical services, or do I not? And then you can kind of really start thinking about understanding your system and build it to be better.
Carmen: And also, one of the things that is just because it says dev, don’t assume it’s dev. It might be prod. Just because it’s called dev doesn’t mean that it’s dev. That is one of my biggest rules now that I’ve learned recently in the last three years. Because with cloud, it’s a little bit different than on-premise: if that’s the name, that’s what it’s going to stay and most likely it is what it is. But here, because we’re iterating so quickly, we try to fail fast in order for us to learn from our mistakes and build our product, dev becomes prod. [laugh]. More so now than it did before, you know?
Ana: It brings me to that portion you mentioned also earlier: always ask questions. Always poke holes at it. If someone tells us, “Oh, no, don’t worry, nothing is running here on production,” take a deep dive and try to find out what are some of those services, or what are some of those dependencies that could be going on. I know from my time at Uber, it took forever for us to find out which of the 2000 microservices are needed to just take a trip on the cloud. And it was like, “Uh, we don’t know. We know they’re running on prod, they’re running on dev, but what is needed for this service to actually happen?”
Carmen: Exactly. Sometimes just getting on the machine. And if you have root—which you should—if you’re a DevOps engineer, usually—look at the history and then look at the directories of who has a home directory. People don’t realize the history can give you so much good nuggets about what’s going on in the system. And those are the things that help you figure out, like you said at Uber, what’s running on here, and who’s using it, and what is the systemd daemon telling me? And like… I mean, right?
Ana: And it’s funny, you mentioned that of take a look at the history because that was actually one of the things that I’ve always done, like, reading post-mortems—
Ana: Understanding history that’s being run on systems, understanding past PRDs to try to get a better understanding. And a lot of it actually is because of that other point that you also touched upon, being afraid of ask questions. Like, similar to you, I’ve also been one of the only Latinas in the room, and I’m like, “I don’t want to raise my hand in this class, or in this meeting. I don’t want to be the person that has to ask.” But if I have ways of starting to do my own searching so I make a more informed question, that gave me confidence. So, that was one of the things that I was always doing. But now I tell people, “No. Just ask the questions. Don’t spend those five hours trying to look at history because the person next to you might actually know the answer in just two minutes.”
Carmen: Yeah, exactly. And I noticed that. Just asking that question was literally like, “Oh, it was because of X, Y, and Z.” “Okay, cool.” And then, now that I know that, at least when I look at the history, I have some background of why this was this way, and now I can just pull out what I really care about in the history, as opposed to saying, “Why is this happening in the [unintelligible 00:26:59] in the first place?”
But it’s sucks being the first person to ask that question. And especially if it’s just, like, you and a bunch of dudes—which usually it was, and at the time I was usually the youngest, too. I was, like, 22. Up until now, obviously; now I’m one of the oldest, but at one point, I was the youngest. And also age was a thing, and being the only Latina in the room, and it—you know, and it’s finance; it was scary. [laugh].
Ana: [laugh]. That’s the awesome part. We got to have folks like you, like me, organizations like TECHNOLOchicas and Techqueria that allow for us to create spaces that are going to say, “You’re welcome here. Ask as many questions as you want. No question is going to be stupid.”
Because we’ve all had to start somewhere. And maybe you do get a chance to have Carmen as your teacher and get to pick their brain on what DevOps is. For that, I think that’s all the questions that I had. Do you have anything else you want to share with our listeners that you have upcoming for you? Or just any words of advice?
Carmen: My advice is just keep trucking along. There’s many Carmens, there’s many Anas, there’s many Jason’s out there that are willing to help. And there’s spaces now where we can ask those deep questions, like you said, like Techqueria, Latinas in Tech, TECHNOLOchicas, [unintelligible 00:28:17] Girls Can Code, Girls Who Code. There’s so many places now where you can really dig in and find the community to uplift you and keep pushing you forward in your technology inquiries and your technology career path. So, stick with it, keep going.
Ana: I love it. What are some ways that folks can get in touch with you?
Ana: Awesome. Thank you so much, Carmen.
Carmen: Thank you so much for having me. I had such a great time with both of you.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more