Joyce Lin: Lightning Talk: Who Is Responsible for Chaos? - Chaos Conf 2019
The following is a transcript from Postman Lead Developer Advocate, Joyce Lin’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
Well, my name is Joyce. I'm a developer advocate with Postman. Postman is an API dev environment, or an ADE, used by a bunch of people around the world. And a very small piece of my job is talking at conferences like these. A bigger part of my time is actually spent talking with people. I'm lucky enough to get to go to some of these Chaos Conferences, resiliency things, Chaos Community Days. And one of the questions that I had was, "Who is responsible for chaos?" I've been in your seat there, shoulder to shoulder and turned and said, "Who is responsible for chaos? Is it SRE?" We hear about that. "Dev Ops? Is it QA?" And I don't really get a clear answer, at least in the audience.
So the Chaos Engineering Slack group published this beautiful mind map. Can you guys read that? Okay, so this is the who's who diagram of the people and tools that are most famous and pioneering the way in chaos. Let's take a look at this data and break it down by job titles, specifically what job titles are doing chaos?
So the vast majority of famous people in chaos have an engineering title. I saw you guys all raise your hands, a lot of devs in the audience right here. There's also specialized roles like site reliability engineer. And then, you see about a third of these people coming from functions like Security, Dev Ops and R&D.
So, typically, the people that are most motivated to start a Chaos Engineering program are going to be the ones that feel the pain of a production failure. So if you're on call, Colton Andrews had said, "It boils down to who gets paged, if that's an SRE or an Ops team they have the most incentive to start doing this work and making their lives better." And in fact, that's how Colton became personally invested in chaos when he first got started.
So when you're thinking about roles and responsibilities, what typical responsibilities do folks have who might be interested in chaos? What overlap do you see? So some companies clearly have dedicated chaos teams, dedicated chaos engineers, and it might actually be a core competency for their organization. Other companies have SREs, or production engineers that are responsible for continuous improvement, and production support. So at Postman, Postman Engineering we're a microservice architecture, the developers that are building those services are actually responsible for deployment and uptime. If you have an organization that has a traditional Dev Ops team, it might be Dev Ops engineers that are responsible for those service levels.
Other people who care about chaos might be responsible for incident management. We heard about this a little bit earlier or post-mortem analyses. The difference here is that incident management is a reactive process where you're trying to prevent something from happening. Whereas chaos is going to be predicting what might happen and preventing it from happening in the first place.
All right, so some companies who have chaos engineers have a dedicated specific domain like traffic, or data, or storage. And lastly, when we think about the folks that are responsible for chaos and, typically, these are the ones responsible for testing and production. We already heard a caveat earlier today, production is the best environment for doing your chaos experiments. It's the most information rich, comprehensive environment that you can get truly accurate results of your attacks, but some companies can't test in production. There's compliance blockers, maybe you're healthcare, maybe you're fintech. You literally can't take down a database. You can't take down a host. Besides possibly this last testing in production, these are the typical responsibilities of folks that do chaos today.
So what roles does this translate to? Right now, currently the vast majority of people doing Chaos Engineering tend to have a quality driven, or a production focused ops engineer. So I have a question. Chaos engineers are running chaos tests. They're identifying vulnerabilities and they're automating the running of these tests. My question is, why aren't testers doing chaos? Shrug.
So, before there was Chaos Engineering there was chaos testing. And, in fact, I used the Wayback Machine, so I was able to see the very first blog posts of Chaos Monkey coming out. And it was actually called Chaos Testing and Netflix launched it to the test community. Well, that actually makes sense, right? If you introduce the responsibility of resilience earlier in the dev cycle, when the cost of bugs is the lowest, then you have an ideal model. It's a noble and ideal goal, but the testers aren't the ones on call, or responsible for rolling out hotfixes in production.
So then, you have a bunch of people with SRE and ops titles pioneering the field of Chaos Engineering and that's how we ended up here. But because of that aspirational goal of moving that resilience, building it in earlier in the dev life cycle we're starting to see, and it's at the very earliest stages, we're starting to see testers that are curious about chaos. And they're starting to focus on production testing, not just prerelease testing. In fact, one such test engineer said, Abby said, "The biggest limitation in the fear of delivering software faster is the focus on adding more prerelease testing." She goes on further to say that Chaos Engineering is all about, "Building confidence that we aren't fragile." I'm pretty sure that's very similarly taken from the principles of Chaos Engineering. So we have less fear that any one system, one attack, one change is going to take down our system.
Okay, so why aren't more testers doing chaos today? I've talked to some testers who are curious about chaos, but they're still covering the bases when it comes to prerelease testing, there's so much to do, not enough time. And in the very early stages I am talking to some organizations where the SREs are the ones responsible for creating the experiments because they know what the infrastructure looks like, they know where the vulnerabilities are. And the testers are the ones that are executing these tests and automating them.
So job titles aside, who can start a chaos program, who can get the ball rolling? So first, who has the insights, or knows about potential vulnerabilities to properly structure your chaos experiment, limit the blast radius? Who has the access to pull the plug on a system and roll it back? Abort. Abort. Who can roll it back? And lastly, this was in the keynote earlier, but buy-in does not double equals all in.
So if you're thinking about starting a Chaos Engineering program, this actually might be the heaviest ball to roll. And Casey Rosenthal has some advice for you guys, "I wish the best of luck to you in that undertaking, but I wouldn't wager that you get it right on your first try or your second. No one's going to get it right on the first try. There's too many different aspects going into your particular context." He goes on further to say, "Who owns chaos? It depends on a bunch of stuff."
So final thoughts, who can start a chaos program? It depends on your particular context, but as more and more organizations are moving to the cloud their world is becoming more complex. And more people are thinking about Chaos Engineering, and they're focusing more on chaos, and how can chaos testing compliment traditional prerelease testing? And what this means is that as you cover your bases with prerelease testing we're starting to see some more people, more functions involved in chaos. And lastly, a few people have said this in their talks today, but a valuable chaos test is not just going to teach you about your systems, but about your people. So thank you very much.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more