Niran Fajemisin: Lightning Talk: Transitive Logic of Systems Fallibility - Chaos Conf 2019
The following is a transcript from Starbucks Director of Engineering, Niran Fajemisin’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.
Hi, my name is Niran Fajemisin. I'm the Director of Engineering at Starbucks, and my team is responsible for the loyalty platform. So basically, you buy a cup of coffee and you get points as a result. And also my team is responsible for the revamp to version of the mobile order and pay platform. Basically, you buy your coffee and then you can just kind of walk over and pick it up. So I'm going to talk about this, the transitive logic of system palpability, it's not really a thing, but we shall make it one.
So this is how it works in my head. Systems are built by humans. Humans are fallible, and systems are fallible. So basically, the bottom line is that it's not so much a matter of ... Basically, so you talk about failure as a first class construct. The only thing that is constant is failure, right? So we've talked about, as you've heard many speakers come up and say, it's bound to fail. It's going to happen. Don't bother trying to fight it, because it's just a matter of time. So what we're talking about here is really all about trying to somehow figure out a way of mitigating the failure and coming up with mechanisms that allows us to be able to recover in a graceful fashion. So failure must be embraced. You've heard it over and over already today, but you essentially have to take it and just embrace it as a thing, right? Because it's really not a matter of if a system is going to fail, it's a matter of when it's going to fail. And there are various forms of failures that you guys have heard already. But basically that's kind of, that's the point.
So you have to treat it as a first class construct. It means at the very beginning when you're thinking about designing systems, when you're thinking about coming up with services, whatever it be, maybe microservices and maybe monoliths, it doesn't really matter, but you think about failure as a first class construct. And what that really means is that you really have to come up with a different approach in terms of letting people kind of embrace the concept of failure. I see it from a human standpoint, in a sense of like you're creating an environment that makes it safe for people to fail, right? And so really the environment talks about transparency. We've heard about blamelessness, as well as accountability.
So when there's failure, the first thing that happens is we typically all talk about, oh yeah, we should really get behind it. We should kind of do things in the right way. And you know, everyone should be in there together and kind of support one another. But we switch over to our reptilian roots, and nobody wants to be the person in the hot seat, right? So the first thing is that finger pointing begins. People are pointing fingers, teams are insulating themselves and everything, and so that's where the blame comes from. And of course when you have blame, basically it's like people become guarded, right? Because you don't want to be that person. So that's what happens, and then ultimately it stifles innovation and growth.
And so what do we do? So when you think about it, when there is failure, the first thing is at this point, that's when we're most vulnerable. It is the point at which we want to actually try to change the way things are done. We have to kind of think a little bit differently in terms of how we deal with one another, how we interact with one another. And so we are most vulnerable at this time. That's why the language that we use matters, in terms of how we're actually addressing one another, in terms of how you talk to one another, and things along those lines. And so basically from that standpoint, what it means is that during this time you all are kind of thinking about, okay, well it's not so much about, oh, this person made a wrong PR or someone forgot to do something or whatever. It really doesn't matter, because at that point in time, in the moment when there is failure, no one really cares who caused it. It's all about remediating a problem. That's really what you're trying to get to.
And so focus on the problem. This thing seems very obvious, but it's very easy to forget in the heat of the moment. Focus on the problem and not the individuals. And there's a little bit of coaching involved as well, not just coaching the individual that basically might be the person that has created an errant deployment or whatever, because there will be time to kind of coach that person. During the time of the failure is not the time to coach them, because that's the time where you can do the most damage. You want to coach everyone in terms of how you deal with one another during that situation. There's also the concept of mentorship, which helps them to develop skills and things along those lines, so that they can do better the next time.
Now, when you come together and you deal with a failure from a more of a humanistic kind of standpoint, there's a collective ownership of it. In other words, the old [inaudible 00:05:10] concept about we're all in it together and things along those lines. It really does matter. So basically you all kind of collectively own the failure. There will be time to talk about it. There will be time to kind of do, you know, postmortems and all that sort of, all these other things, but you really, you're coming together to really own it. And when you do own it, the flip side of the coin is that you also are sharing the success, because you all kind of work together, you resolve the problem, and basically you get to kind of bask in the glory of the success.
Now, failing for the same reason is not acceptable. That's just incompetence. So let's not do this. So what does that mean? You talk about instruments and all the things, well, you can't, essentially if you can't really measure something, you can't tell how badly you're doing. So for me, from my perspective, really instrumenting services and all the other things kind of around the platforms that you're on is what is key, because that's what really helps you to really understand, first of all, what's the problem at hand? How do I go about remediating the problem? And ultimately, how do we put preventative measures in place that essentially prevents us from having to repeat the same problems? So we're talking about, we've talked about alerts, people have talked about monitoring. Anomaly detection is also another thing that is kind of thrown around. We'll talk about it at another time. But basically, putting those things in place is really what helps you get ahead of the problem, helps you in resolving the problem as it's happening, and ultimately your future self will thank you for this.
So we've talked about instrumenting things, and then there's some basic things, you're kind of like, I assume everybody's doing this already, right? So when you're talking about instrumentation, what comes naturally after that is the ability to be able to observe. Observability is a big topic. We're not going to talk about it. It's an enlightening talk, but basically I like to kind of describe it as insight into what is and possibly what will be. So you can see exactly what's happening right now, and then you get a chance to be able to understand what is going to be, based on looking at the trends, looking at traffic patterns and things of those like. Elasticity is another thing, building systems that are elastic, making sure that you can kind of expand based on different kind of workloads.
You're talking about resiliency, another big topic, thinking about that from the grasp, from the ground level, right, responsiveness in the face of failure and all the other techniques that we know that helps us with resiliency and ultimately transparency. Now I can't stress this enough, because the problem is that what happens is that if you go back to the kind of building a culture of blamelessness and things along those lines, is that we will live in a world where a lot of teams kind of get it together and they're constantly kind of pitted, we feel as if they're pitted against one another instead of working together, collaborating, and being able to serve a common goal, which ultimately is just servicing our customers and making sure we're delivering business value.
When you collaborate together and you have transparency and there is no blame between teams, well, guess what happens? You get a better view of the entire platform across the board. And this is key. So in the wise words of Yoda, "The greatest teacher, failure is." It's during this space in which when there is failure, that's when we actually learn. That's when we understand tolerances. That's when we understand limits of our platform. That's when we actually understand a lot of things about ourselves. You learn a ton about yourself, about how you actually react to the issues and actually how you deal with one another. So I want us to kind of take this going forward and think about it. It's like all these things, it takes time. It's a journey, right? And so basic kind of thinking about it and essentially kind of thinking about as we're kind of going through this journey, it's all about learning along the way and not focusing on perfection or anything along those lines, but really just kind of thinking about the lessons that you learn from failure. Thank you.
See our recap of the entire Chaos Conf 2019 event.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more