Podcast: Break Things on Purpose | Chris Martello: Day of Darkness

‍

Dad jokes lead the way in this episode as we interview Chris Martello, manager of application performance at Cengage. Chris is a wearer of many testing hats, but his passion is chaos and breaking things on purpose. Chaos was a natural fit for Chris with his background as a middle school science teacher, so when he made the jump to tech chaos engineering was a natural fit. Now at Cengage Chris is steeped in higher education, the result is a lot of seasonal fluxes that serve as great testing grounds for peak trafficking events. Tune in to hear the fruitful results of following the academic schedule!

Episode Highlights

In this episode, we cover:

00:00:00 - Introduction
00:02:11 - How Chris got into the world of chaos and teaching middle school science
00:05:56 - The Cengage seasonal model and preparing for the
00:11:10 - How Cengage schedules the chaos and the “day of darkness”
00:15:28 - Scaling and migration and “the inches we need”
00:18:18 - Communicating with different teams and the customers
00:24:30 - Chris’s biggest lesson from practicing chaos engineering
00:27:40 - Chris and working at Cengage/Outro

Links:

Cengage: https://www.cengagegroup.com/
LinkedIn: https://www.linkedin.com/in/christophermartello/

Transcript

Julie: Wait, I got it. You probably don’t know this one, Chris. It’s not from you. How does the Dalai Lama order a hot dog?

Chris: He orders one with everything.

Julie: [laugh]. So far, I have not been able to stump Chris on—[laugh].

Chris: [laugh]. Then the follow-up to that one for a QA is how many engineers does it take to change a light bulb? The answer is, none; that’s a hardware problem.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, quality, and ways to focus on the user experience. In this episode, we talk with Chris Martello, manager of application performance at Cengage, about the importance of Chaos Engineering in service of quality.

Julie: Welcome to Break Things on Purpose. We are joined today by Chris Martello from Cengage. Chris, do you want to tell us a little bit about yourself?

Chris: Hey, thanks for having me today, Julie, Jason. It’s nice to be here and chat with you folks about Chaos Engineering, Chaos Testing, Gremlin. As Julie mentioned I’m a performance manager at Cengage Learning Group, and we do a fair amount of performance testing, both individual platforms, and coordinated load testing. I’ve been a software manager at Cengage for about five years, total of nine altogether there at Cengage, and worn quite a few of the testing hats, as you can imagine, from automation engineer, performance engineer, and now QA manager. So, with that, yeah, my team is about—we have ten people that coordinate and test our [unintelligible 00:01:52] platforms. I’m on the higher-ed side. We have Gale Research Library, as well as soft skills with our WebAssign and ed2go offerings. So, I’m just one of a few, but my claim to fame—or at least one of my passions—is definitely chaos testing and breaking things on purpose.

Julie: I love that, Chris. And before we hear why that’s your passion, when you and I chatted last week, you mentioned how you got into the world of QA, and I think you started with a little bit of different type of chaos. You want to tell us what you did before?

Chris: Sure, even before a 20-year career, now, in software testing, I managed chaos every day. If you know anything about teaching middle school, seventh and eighth-grade science, those folks have lots of energy and combine that with their curiosity for life and, you know, their propensity to expend energy and play basketball and run track and do things, I had a good time for a number of years corralling that energy and focusing that energy into certain directions. And you know back, kind of, with the jokes, it was a way to engage with kids in the classroom was humor. And so there was a lot of science jokes and things like that. But generally speaking, that evolved into I had a passion for computers, being self-taught with programming skills, project management, and things like that. It just evolved into a different career that has been very rewarding.

And that’s what brings me to Cengage and why I come to work every day with those folks is because instead of now teaching seventh and eighth-grade science to young, impressionable minds, nowadays I teach adults how to test websites and how to test platforms and services. And the coaching is still the same; the mentoring is still the same. The aptitude of my students is a lot different, you know? We have adults, they’re people, they require things. And you know, the subject matter is also different. But the skills in the coaching and teaching is still the same.

Jason: If you were, like, anything like my seventh-grade science teacher, then another common thing that you would have with Chaos Engineering and teaching science is blowing a lot of things up.

Chris: Indeed. Playing with phosphorus and raw metal sodium was always a fun time in the chemistry class. [laugh].

Julie: Well, one of the things that I love, there are so many parallels between being a science teacher and Chaos Engineering. I mean, we talk about this all the time with following the scientific process, right? You’re creating a hypothesis; you’re testing that. And so have you seen those parallels now with what you’re doing with Chaos Engineering over there at Cengage?

Chris: Oh, absolutely. It is definitely the basis for almost any testing we do. You have to have your controlled variables, your environment, your settings, your test scripts, and things that you’re working on, setting up that experiment, the design of course, and then your uncontrolled variables, the manipulated ones that you’re looking for to give you information to tell you something new about the system that you didn’t know, after you conducted your experiment. So, working with teams, almost half of the learning occurs in just the design phase in terms of, “Hey, I think this system is supposed to do X, it’s designed in a certain way.” And if we run a test to demonstrate that, either it’s going to work or it’s not. Or it’s going to give us some new information that we didn’t know about it before we ran our experiment.

Julie: But you also have a very, like, cyclical reliabilities schedule that’s important to you, right? You have your very important peak traffic windows. And what is that? Is that around the summertime? What does that look like for you?

Chris: That’s right, Julie. So, our business model, or at least our seasonal model, runs off of typical college semesters. So, you can imagine that August and September are really big traffic months for us, as well as January and part of February. It does take a little extra planning in order to mimic that traffic. Traffic and transactions at the beginning of the semester are a lot different than they are at the middle and even at the end of the semester.

So, we see our secondary higher education platforms as courseware. We have our instructors doing course building. They’re taking a textbook, a digitized textbook, they’re building a course on it, they’re adding their activities to it, and they’re setting it up. At the same time that’s going along, the students are registering, they are signing up to use the course, they’re signing up to their course key for Cengage products, and they’re logging into the course. The middle section looks a lot like taking activities and tests and quizzes, reading the textbook, flipping pages, and maybe even making some notes off to the side.

And then at the end of the semester, when the time is up, quite literally on the course—you know, my course semester starts from this day to this day, in 15th of December. Computers being as precise as they are, when 15th of December at 11:59 p.m. rolls off the clock, that triggers a whole bunch of cron jobs that say, “Hey, it’s done. Start calculating grades.”

And it has to go through thousands of courses and say, “Which courses expired today? How many grades are there submitted? How many grades are unsubmitted and now I have to calculate the zeros?” And there’s a lot of math that goes in with that analytics. And some of those jobs, when those midnight triggers kick off those jobs, it will take eight to ten hours in order to process that semester’s courses that expire on that day.

Julie: Well, and then if you experience an outage, I can only assume that it would be a high-stress situation for both teachers and students, and so we’ve talked about why you focus so heavily on reliability, I’d love to hear maybe if you can share with us how you prepare for those peak traffic events.

Chris: So yeah, it’s challenging to design a full load test that encompasses an entire semester’s worth of traffic and even the peaks that are there. So, what we do is, we utilize our analytics that give us information on where our peak traffic days lie. And it’s typically the second or third Monday in September, and it’s at one or two o’clock in the afternoon. And those are when it’s just what we’ve seen over the past couple of years is those days are our typical traffic peaks. And so we take the type of transactions that occur during those days, and we calibrate our load tests to use those as a peak, a one-time, our performance capacity.

And then that becomes our x-factor in testing. Our 1x factor is what do we see in a semester at those peaks? And we go gather the rest of them during the course of the semester, and kind of tally those up in a load test. So, if our platforms can sustain a three to six-hour load test using peak estimate values that come from our production analysis, then we think we’re pretty stable.

And then we will turn the dial up to two times that number. And that number gives us an assessment of our headroom. How much more headroom past our peak usage periods do we have in order to service our customers reliably? And then some days, when you’re rolling the dice, for extra bonus points, we go for 3x. And the 3x is not a realistic number.

I have this conversation with engineering managers and directors all the time. It’s like, “Well, you overblow that load test and it demonstrated five times the load on our systems. That’s not realistic.” I says, “Well, today it’s not realistic. But next week, it might be depending on what’s happening.”

You know, there are things that sometimes are not predictable with our semesters and our traffic but generally speaking it is. So, let’s say some other system goes down. Single-sign-on. Happens to the best of us. If you integrate with a partner and your partner is uncontrolled in your environment, you’re at their mercy.

So, when that goes down, people stop entering your application. When the floodgates open, that traffic might peak for a while in terms of, hey, it’s back up again; everybody can log in. It’s the equivalent of, like, emptying a stadium and then letting everybody in through one set of doors. You can’t do it. So, those types of scenarios become experimental design conversations with engineering managers to say, “At what level of performance do you think your platform needs to sustain?”

And as long as our platforms can sustain within two to three, you know, we’re pretty stable in terms of what we have now. But if we end up testing at three times the expected load and things break catastrophically, that might be an indication to an architect or an engineering director, that, hey, if our capacity outlives us in a year, it might be time to start planning for that re-architecture. Start planning for that capacity because it’s not just adding on additional servers; planning for that capacity might include a re-architecture of some kind.

Julie: You know, Chris, I just want to say to anybody from Coinbase that’s out there that’s listening, I think they can find you on LinkedIn to talk about load testing and preparing for peak traffic events.

Chris: Yeah, I think the Superbowl saw one. They had a little QR code di—

Julie: Yeah.

Chris: —displayed on the screen for about 15 seconds or so, and boy, I sure hope they planned for that load because if you’re only giving people 15 seconds and everybody’s trying to get their phone up there, man I bet those servers got real hot real fast. [laugh].

Julie: Yeah, they did. And there was a blip. There was a blip.

Chris: Yeah. [laugh].

Julie: But you’re on LinkedIn, so that’s great, and they can find you there to talk to you. You know, I recently had the opportunity to speak to some of the Cengage folks and it was really amazing. And it was amazing to hear what you were doing and how you have scheduled your Chaos Engineering experiments to be something that’s repeatable. Do you want to talk about that a little bit for folks?

Chris: Sure. I mean, you titled our podcast today, “A Day of Darkness,” and that’s kind of where it all started. So, if I could just back up to where we started there with how did chaos become a regular event? How did chaos become a regular part of our engineering teams’ DNA, something that they do regularly every month and it’s just no sweat to pull off?

Well, that Day of Darkness was 18 hours of our educational platforms being down. Now, arguably, the students and instructors had paid for their subscriptions already, so we weren’t losing money. But in the education space and in our course creations, our currency is in grades and activities and submissions. So, we were losing currency that day and losing reputation. And so we did a postmortem that involved engineering managers, quality assurance, performance folks, and we looked at all the different downtimes that we’ve had, and what are the root causes.

And after conferring with our colleagues in the different areas—we’ve never really been brought together in a setting like that—we designed a testing plan that was going to validate a good amount of load on a regular basis. And the secondary reason for coordinating testing like that was that we were migrating from data center to cloud. So, this is, you know, about five, six years ago. So, in order to validate that all that plumbing and connections and integrations worked, you know, I proposed I says, “Hey, let’s load test it all the same time. Let’s see what happens. Let’s make sure that we can run water through the pipes all day long and that things work.”

And we plan this for a week; we planned five days. But I traveled to Boston, gathered my engineers kind of in a war room situation, and we worked on it for a week. And in that week, we came up with a list of 90 issues—nine-zero—that we needed to fix and correct and address for our cloud-based offerings before it could go live. And you know, a number of them were low priority, easy to fix, low-hanging fruit, things like that. But there were nine of them that if we hadn’t found, we were sure to go down.

And so those nine things got addressed, we went live, and our system survived, you know, and things went up. After that, it became a regular thing before the semesters to make sure, “Hey, Chris, we need to coordinate that again. Can you do it?” Sure enough, let’s coordinate some of the same old teams, grab my run sheet. And we learned that we needed to give a day of preparation because sometimes there were folks that their scripts were old, their environment wasn’t a current version, and sometimes the integrations weren’t working for various reasons of other platform releases and functionality implementation.

So, we had a day of preparation and then we would run. We’d check in the morning and say, “Everybody ready to go? Any problems? Any surprises that we don’t know about, yet?” So, we’d all confer in the morning and give it a thumbs up.

We started our tests, we do a three-hour ramp, and we learned that the three-hour ramp was pretty optimal because sometimes elastic load balancers can’t, like, spin up fast enough in order to pick up the load, so there were some that we had to pre-allocate and there were others that we had to give enough time. So, three hours became that magic window, and then three hours of steady-state at our peak generation. And now, after five years, we are doing that every month.

Jason: That’s amazing. One of the things you mentioned in there was about this migration, and I think that might tie back to something you said earlier about scaling and how when you’re thinking of scaling, especially as I’m thinking about your migration to the cloud, you said, “Scaling isn’t just adding servers. Sometimes that requires re-architecting an application or the way things work.” I’m curious, are those two connected? Or some of those nine critical fixes a part of that discovery?

Chris: I think those nine fixes were part of the discovery. It was, you can’t just add servers for a particular platform. It was, how big is the network pipe? Where is the DNS server? Is it on this side or that side? Database connections were a big thing: How many are there? Is there enough?

So, there was some scaling things that hadn’t been considered at that level. You know, nowadays, fixing performance problems can be as easy as more memory and more CPU. It can be. Some days it’s not. Some days, it can be more servers; some days, it can be bigger servers.

Other times, it’s—just, like, quality is everybody’s job, performance fixing is not always a silver bullet. There are things like page optimization by the designers. There’s code optimization by your front-end engineers. And your back-end engineers, there are database optimizations that can be made: Indexing, reindexing on a regular basis—whatever that schedule is—for optimizing your database queries. If your front-end goes to an API for five things on the first page, does it make five extra calls, or does it make one call, and all five things come across at the same time?

So, those are considerations that load performance testing, can tell you where to begin looking. But as quality assurance and that performance lead engineer, I might find five things, but the fixes weren’t just more testing and a little bit of extra functionality. It might have involved DevOps to tweak the server connections, it might have involved network to slim down the hops from four different load balancers to two, or something like that. I mean, it was always just something else that you never considered that you utilized your full team and all of their expertise and skills in order to come up with those inches.

And that’s one of my favorite quotes from Every Given Sunday. It’s an older football movie starring Al Pacino. He gives this really awesome speech in a halftime type of setting, and the punch line for this whole thing is, “The inches we need are everywhere around us.” And I tell people that story in the terms of performance is because performance, at the software level, is a game of inches. And those inches are in all of our systems and it’s up to us as engineers to find them and add them up.

Julie: I absolutely love everything about that. And that would have made a great title for this episode. “The Inches we Need are Everywhere Around Us.” We’ve already settled on, “A Day of Darkness with Chris Martello,” though. On that note, Chris, some of the things that you mentioned involve a lot of communication with different teams. How did you navigate some of those struggles? Or even at the beginning of this, was it easy to get everybody on board with this mindset of a new way of doing things? Did you have some challenges?

Chris: There were challenges for sure. It’s kind of hard to picture, I guess, Cengage’s platform architecture and stuff. It’s not just one thing. It’s kind of like Amazon. Amazon is probably the example is that a lot of their services and things work in little, little areas.

So, in planning this, I looked at an architecture diagram, and there’s all these things around it, and we have this landscape. And I just looked down here in the corner. I said, “What’s this?” They said, “Well, that’s single-sign-on.” I says, “Well, everything that touches that needs to be load tested.”

And they’re like, “Why? We can’t do that. We don’t have a performance environment for that.” I said, “You can’t afford not to.” And the day of darkness was kind of that, you know, example that kind of gave us the [sigh] momentum to get over that obstacle that said, “Yeah, we really do need a dedicated performance environment in order to prove this out.”

So, then whittling down that giant list of applications and teams into the ones that were meaningful to our single-sign-on. And when we whittled that down, we now have 16 different teams that regularly participate in chaos. Those are kind of the ones that all play together on the same playing field at the same time and when we find that one system has more throughput than another system or an unexpected transaction load, sometimes that system can carry that or project that load onto another system inadvertently. And if there’s timeouts at one that are set higher than another, then those events start queuing up on the second set of servers. It’s something that we continually balance on.

And we use these bits of information for each test and start, you know, logging and tracking these issues, and deciding whether it’s important, how long is it going to take to fix, and is it necessary. And, you know, you’re balancing risk and reward with everything you’re doing, of course, in the business world, but sometimes the, you know—“Chris, bring us more quality. You can do better this month. Can you give us 20 more units of quality?” It’s like, “I can’t really package that up and hand it to you. That’s not a deliverable.”

And in the same way that reputation that we lose when our systems go down isn’t as quantifiable, either. Sure, you can watch the tweets come across the interwebs, and see how upset our students are at those kinds of things, but our customer support and our service really takes that to heart, and they listen to those tweets and they fix them, and they coordinate and reach out, you know, directly to these folks. And I think that’s why our organization supports this type of performance testing, as well as our coordinated chaos: The service experience that goes out to our customers has to be second to none. And that’s second to none is the table stakes is your platform must be on, must be stable, and must be performing. That’s just to enter the space, kids. You've got to be there. [laugh].

You can’t have your platform going down at 9 p.m. on a Sunday night when all these college students are doing their homework because they freak out. And they react to it. It’s important. That’s the currency. That is the human experience that says this platform, this product is very important to these students' lives and their well-being in their academic career. And so we take that very seriously.

Jason: I love that you mentioned that your customer support works with the engineering team. Because makes me think of how many calls have you been on where something went wrong, you contacted customer support, and you end up reaching this thing of, they don’t talk to engineering, and they’re just like, “I don’t know, it’s broken. Try again some other time.” Or whatever that is, and you end up lost. And so this idea of we often think of DevOps is developers and operations engineers working together and everybody on the engineering side, but I love that idea of extending that.

And so I’m curious, in that vein, does your Chaos Engineering, does your performance testing also interact with some of what customer support is actually doing?

Chris: In a support kind of way, absolutely. Our customer call support is very well educated on our products and they have a lot of different tools at their disposal in order to correct problems. And you know, many of those problems are access and permissions and all that kind of stuff that’s usual, but what we’ve seen is even though that our customer base is increasing and our call volume increases accordingly, the percentage decreases over time because our customer support people have gotten so good at answering those questions. And to that extent, when we do log issues that are not as easily fixed with a tweak or knob toggle at the customer support side, those get grouped up into a group of tickets that we call escalation tickets, and those go directly to engineering.

And when we see groups of them that look and smell kind of the same or have similar symptoms, so we start looking at how to design that into chaos, and is it a real performance issue? Especially when it’s related to slowness or errors that continuously come at a particular point in that workflow. So, I hope I answered that question there for you, Jason.

Jason: Yeah, that’s perfect.

Julie: Now, I’d like to kind of bring it back a little bit to some of the learnings we’ve had over this time of practicing Chaos Engineering and focusing on that quality testing. Is there something big that stands out in your mind that you learned from an experiment? Some big, unknown-unknown that you don’t know that you ever could have caught without practicing?

Chris: Julie, that’s a really good question, and there isn’t, you know, big bang or any epiphanies here. When I talk about what is the purpose of chaos and what do we get out of it, there’s the human factor of chaos in terms of what does this do for us. It gets us prepared, it gets us a fire drill without the sense of urgency of production, and it gets people focused on solving a problem together. So, by practicing in a performance, in a chaos sort of way, when performance does affect the production, those communication channels are already greased. When there’s a problem with some system, I know exactly who the engineer is to go to and ask him a question.

And that has also enabled us to reduce our meantime to resolution. That meantime to resolution factor is predicated on our teams knowing what to do, and how to resolve those. And because we’ve practiced it, now that goes down. So, I think the synergy of being able to work together and triangulate our teams on existing issues in a faster sort of way, definitely helps our team dynamic in terms of solving those problems faster.

Julie: I like that a lot because there is so much more than just the technical systems. And that’s something that we like to talk about, too. It is your people’s systems. And you’re not trying to surprise anybody, you’ve got these scheduled on a calendar, they run regularly, so it’s important to note that when you’re looking at making your people’s systems more resilient, you’re not trying to catch Chris off guard to see if he answered the page—

Chris: That’s right.

Julie: —what we’re working on is making sure that we’re building that muscle memory with practice, right, and iron out the kinks in those communication channels.

Chris: Absolutely. It’s definitely been a journey of learning both for, you know, myself and my team, as well as the engineers that work on these things. You know, again, everybody chips in and gets to learn that routine and be comfortable with fighting fires. Another way I’ve looked at it with Chaos Engineering, and our testing adventures is that when we find something that it looks a little off—it’s a burp, or a sneeze, or some hiccup over here in this system—that can turn into a full-blown fever or cold in production. And we’ve had a couple of examples where we didn’t pay attention to that stuff fast enough, and it did occur in production.

And kudos to our engineering team who went and picked it up because we had the information. We had the tracking that says we did find this. We have a solution or recommended fix in place, and it’s already in process. That speaks volumes to our sense of urgency on the engineering teams.

Julie: Chris, thank you for that. And before we end our time with you today, is there anything you’d like to let our listeners know about Cengage or anything you’d like to plug?

Chris: Well, Cengage Learning has been a great place for me to work and I know that a lot of people enjoy working there. And anytime I ask my teams, like, “What’s the best part of working there?” It’s like, “The people. We work with are supportive and helpful.” You know, we have a product that we’d like to help change people’s lives with, in terms of furthering their education and their career choices, so if you’re interested, we have over 200 open positions at the current moment within our engineering and staffing choices.

And if you’re somebody interested in helping out folks and making a difference in people’s educational and career paths, this is a place for you. Thanks for the offer, Julie. Really appreciate that.

Julie: Thank you, Chris.

Jason: Thanks, Chris. It’s been fantastic to have you on the show.

Chris: It’s been a pleasure to be here and great to talk to you. I enjoy talking about my passions with testing as well as many of my other ones. [laugh].

Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Podcast: Break Things on Purpose | Chris Martello: Day of Darkness

Episode Highlights

Transcript

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Treat reliability risks like security vulnerabilities by scanning and testing for them