Reliability and serverless are at the forefront of today’s conversation. For this episode Gunnar Grosch, Senior Developer Advocate at AWS, is here to talk about Chaos Engineering, AWS Serverless, and the work that AWS is doing when it comes to reliability. Gunnar traces his history from AWS user to AWS Serverless Hero, to his current position where they are shaping “serverless as reliability.” Gunnar talks about his applications of Chaos Engineering to test how their different services work together, and he waxes on the outcomes.

Episode Highlights

In this episode, we cover:

  • 00:00:00 - Intro
  • 00:01:45 - AWS Severless Hero and Gunnar’s history using AWS
  • 00:04:42 - Severless as reliability
  • 00:08:10 - How they are testing the connectivity in serverless
  • 00:12:47 - Gunnar shares a suprising result of Chaos Engineering
  • 00:16:00 - Strategy for improving and advice on tracing
  • 00:20:10 - What Gunnar is excited about at AWS
  • 00:28:50 - What Gunnar has going on/Outro

Links:

Transcript

Gunnar: When I started out, I perhaps didn’t expect to find that many unexpected things that actually showed more resilience or more reliability than we actually thought.

Jason: Welcome to the Break Things on Purpose podcast, a show about Chaos Engineering and building more reliable systems. In this episode, we chat with Gunnar Grosch, a Senior Developer Advocate at AWS about Chaos Engineering with serverless, and the new reliability-related projects at AWS that he’s most excited about.

Jason: Gunnar, why don’t you say hello and introduce yourself.

Gunnar: Hi, everyone. Thanks, Jason, for having me. As you mentioned that I’m Gunnar Grosch. I am a Developer Advocate at AWS, and I’m based in Sweden, in the Nordics. And I’m what’s called a Regional Developer Advocate, which means that I mainly cover the Nordics and try to engage with the developer community there to, I guess, inspire them on how to build with cloud and with AWS in different ways. And well, as you know, and some of the viewers might know, I’ve been involved in the Chaos Engineering and resilience community for quite some years as well. So, topics of real interest to me.

Jason: Yeah, I think that’s where we actually met was around Chaos Engineering, but at the time, I think I knew you as just an AWS Serverless Hero, that’s something that you’d gotten into. I’m curious if you could tell us more about that. How did you begin that journey?

Gunnar: Well, I guess I started out as an AWS user, built things on AWS. As a builder, developer, I’ve been through a bunch of different roles throughout my 20-plus something year career by now. But started out as an AWS user. I worked for a company, we were a consulting firm helping others build on AWS, and other platforms as well. And I started getting involved in the AWS community in different ways, by arranging and speaking at different meetups across the Nordics and Europe, also speaking at different conferences, and so on.

And through that, I was able to combine that with my interest for resiliency or reliability, as someone who’s built systems for myself and for our customers. That has always been a big interest for me. Serverless, it came as I think a part of that because I saw the benefits of using serverless to perhaps remove that undifferentiated heavy lifting that we often talk about with running your own servers, with operating things in your own data centers, and so on. Serverless is really the opposite to that. But then I wanted to combine it with resilience engineering and Chaos Engineering, especially.

So, started working with techniques, how to use Chaos Engineering with serverless. That gained some traction, it wasn’t a very common topic to talk about back then. Adrian Hornsby, as some people might know, also from AWS, he was previously a Developer Advocate at AWS, now in a different role within the organization. He also talked a bit about Chaos Engineering for serverless. So, teamed up a bit with him, and continue those techniques, started creating different tools and some open-source libraries for how to actually do that. And I guess that’s how, maybe, the AWS serverless team got their eyes opened for me as well. So somehow, I managed to become what’s known as an AWS Hero in the serverless space.

Jason: I’m interested in that experience of thinking about serverless and reliability. I feel like when serverless was first announced, it was that idea of you’re not running any infrastructure, you’re just deploying code, and that code gets called, and it gets run. Talk to me about how does that change the perception or the approach to reliability within that, right? Because I think a lot of us when we first heard of serverless it’s like, “Great, there’s Nothing. So theoretically, if all you’re doing is calling my code and my code runs, as long as I’m being reliable on my end and, you know, doing testing on my code, then it should be fine, right?” But I think there’s some other bits in there or some other angles to reliability that you might want to tune us into.

Gunnar: Yeah, for sure. And AWS Lambda really started it all as the compute service for serverless. And, as you said, it’s about having your piece of code running that on-demand; you don’t have to worry about any underlying infrastructure, it scales as you need it, and so on; the value proposition of serverless, truly. The serverless landscape has really evolved since then. So, now there is a bunch of different services in basically all different categories that are serverless.

So, the thing that I started doing was to think about how—I wasn’t that concerned about not having my Lambda functions running; they did their job constantly. But then when you start building a system, it becomes a lot more complex. You need to have many different parts. And we know that the distributed systems we build today, they are very complex because they contain so many different moving parts. And that’s still the case for serverless.

So, even though you perhaps don’t have to think about the underlying infrastructure, what servers you’re using, how that’s running, you still have all of these moving pieces that you’ve interconnected in different ways. So, that’s where the use case for Chaos Engineering came into play, even for serverless. So, testing how these different parts work together to then make sure that it actually works as you intended to. So, it’s a bit harder to create those experiments since you don’t have control of that underlying infrastructure. So instead, you have to do it in a few different ways, since you can’t install any agents to run on the platform, for instance, you can’t control the servers—shut down servers, the perhaps most basic of Chaos Engineering experiment.

So instead, we’re doing it using different libraries, we’re doing it by changing configuration of services, and so on. So, it’s still apply the same principles, the principles of Chaos Engineering, we just have to be—well, we have to think about it in different way in how we actually create those experiments. So, for me, it’s a lot about testing how the different services work together. Since the serverless architectures that you build, they usually contain a bunch of different services that you stitch together to actually create the output that you’re looking for.

Jason: Yeah. So, I’m curious, what does that actually look like then in testing, how these are stitched together, as you say? Because I know with traditional Chaos Engineering, you would run a blackhole attack or some sort of network attack to disrupt that connectivity between services. Obviously, with Lambdas, they work a little bit differently in the way that they’re called and they’re more event-driven. So, what does that look like to test the connectivity in serverless?

Gunnar: So, what we started out with, both me and Adrian Hornsby was create these libraries that we could run inside the AWS Lambda functions. So, I created one that was for Node.js, something that you can easily install in your Node.js code. Adrian has created one for Python Lambda functions.

So, then they in turn contain a few different experiments. So, for instance, you could add latency to your AWS Lambda functions to then control what happens if you add 50 milliseconds per invocation on your Lambda function. So, for each call to a downstream service, say you’re using DynamoDB as a data store, so you add latency to each call to DynamoDB to see how this data affect your application. Another example could be to have a blackhole or a denial list, so you’re denying calls to specific services. Or it could be downstream services, other AWS services, or it could be third-party, for instance; you’re using a third-party for authentication. What if you’re not able to reach that specific API or whatever it is?

We’ve created different experiments for—a typical use case for AWS Lambda functions has been to create APIs where you’re using an API Gateway service, an AWS Lambda function is called, and then returning something back to that API. And usually, it should return a 200 response, but you could then alter that response to test how does your application behave? How does the front-end application, for instance, behave when it’s not getting that 200 response that it’s expecting, instead of getting a 502, a 404, or whatever error code you want to test with. So, that was the way, I think, we started out doing these types of experiments. And just by those simple building blocks, you can create a bunch of different experiments that you can then use to test how the application behaves under those adverse conditions.

Then if you want to move to create experiments for other services, well, then serverless, as we talked about earlier, since you don’t have control over the underlying infrastructure, it is a bit harder. Instead, you have to think about different ways to do with by, for instance, changing configuration, things like that. You could, for instance, restrict concurrent operations on certain services, or you could do experiments to block access, for instance, using different access control lists, and so on. So, different ways, all depending on how that specific service works.

Jason: It definitely sounds like you’re taking some of those same concepts, and although serverless is fundamentally different in a lot of ways, really just taking that, translating it, and applying those to the serverless.

Gunnar: Yeah, exactly. I think that’s very important here to think about, that it is still using Chaos Engineering in the exact same way. We’re using the traditional principles, we’re walking through the same steps. And many times as I know everyone doing Chaos Engineering talks about this, we’re learning so much just by doing those initial steps. When we’re looking at the steady-state of the application, when we’re starting to design the experiments, we learn so much about the application.

I think just getting through those initial steps is very important for people building with serverless, as well. So, think about, how does my application behave if something goes wrong? Because many times with serverless—and for good reasons—you don’t expect anything to fail. Because it’s scales as it should, services are reliant, and they are responding. But it is that old, “What if?” What if something goes wrong? So, just starting out doing it in the same way as you normally would do with Chaos Engineering, there is no difference, really.

Jason: And know, when we do these experiments, there’s a lot that we end up learning, and a lot that can be very surprising, right? When we assume that our systems are one way, and we run the test, and we follow that regular Chaos Engineering process of creating that hypothesis, testing it, and then getting that unexpected result—

Gunnar: Right.

Jason: —and having to learn from that. So, I’m interested, if you could share maybe one of the surprising results that you’ve learned as you’ve done Chaos Engineering, as you’ve continued to hone this practice and use it. What’s a result that was unexpected for you, that you’ve learned something about?

Gunnar: I think those are very common. And I think we see them all the time in different ways. And when I started out, I perhaps didn’t expect to find that many unexpected things that actually showed more resilience or more reliability than we actually thought. And I think that’s quite common, that we run an experiment, and we often find that the system is more resilient to failure than we actually thought initially, for instance, that specific services are able to withstand more turbulent conditions than we initially thought.

So, we create our hypothesis, we expect the system to behave in a certain way. But it doesn’t, instead—it doesn’t break, but instead, it’s more robust. Certain services can handle more stress than we actually thought, initially. And I think those cases, they, well, they are super common. I see that quite a lot. Not only talking about serverless Chaos Engineering experiments; all the Chaos Engineering experiments we run. I think we see that quite a lot.

Jason: That’s an excellent point. I really love that because it’s, as you mentioned, something that we do see a lot of. In my own experience working with some of our customers, oftentimes, especially around networking, networking can be one of the more complex parts of our systems. And I’ve dealt with customers who have come back to me and said, “I ran a blackhole attack, or latency attack, or some sort of network disruption and it didn’t work.” And so you dig into it, well, why didn’t it work? And it’s actually well, it did; there was a disruption, but your system was designed well enough that you just never noticed it. And so it didn’t show up in your metrics dashboards or anything because system just worked around it just fine.

Gunnar: Yeah, and I think that speaks to the complexity of the systems we’re often dealing with today. I think it’s Casey Rosenthal who talked about this quite early on with Chaos Engineering, that it’s hard for any person to create that mental model of how a system works today. And I think that’s really true. And those are good examples of exactly that. So, we create this model of how we think the system should behave, but [unintelligible 00:15:46], sometimes it behaves very unexpected… but in the positive way.

Jason: So, you mentioned about mental models and how things work. And so since we’ve been talking about serverless, that brought to mind one of those things for me with serverless is, as people make functions and things because they’re so easy to make and because they’re so small, you end up having so many of them that work together. What’s your strategy for starting to improve or build that mental model, or document what’s going on because you have so many more pieces now with things like serverless?

Gunnar: There are different approaches to this, and I think this ties in with observability and the way we observe systems today because as these systems—often they aren’t static, they continue to evolve all the time, so we add new functionality, and especially using serverless and building it with AWS Lambda functions, for instance, as soon as we start creating new features to our systems, we add more and more AWS Lambda functions or different serverless ways of doing new functionality into our system. So, having that proper observability, I think that’s one of the keys of creating that model of how the system actually works, to be able to actually see tracing, see how the system or how a request flows through the system. Besides that, having proper documentation is something that I think most organizations struggle with; that’s been the case throughout all of my career, being able to keep up with the pace of innovation that’s inside that organization. So, keeping up with the pace of innovation in the system, continuing to evolve your documentation for the system, that’s important. But I think it’s hard to do it in the way that we build systems today.

So, it’s not about only keeping that mental model, but keeping documentation and how the system actually looks, the architecture of the system, it’s hard today. I think that’s just a fact. And ways to deal with that, I think it comes down to how the engineering organization is structured, as well. We have Amazon and AWS, we—well, I guess we’re quite famous for our two-pizza teams, the smaller teams that they build and run their systems, their services. And it’s very much up to each team to have that exact overview how their part on the bigger picture works. And that’s our solution for doing that,j but as we know, it differs from organization to organization.

Jason: Absolutely. I think that idea of systems being so dynamic that they’re constantly changing, documentation does fall out of step. But when you mentioned tracing, that’s always been one of those really key parts, for me at least coming from a background of doing monitoring and observability. But the idea of having tracing that just automatically going to expose things because it’s following that request path. As you dive into this, any advice for listeners about how to approach that, how to approach tracing whether that’s AWS X-Ray or any other tools?

Gunnar: For me, it’s always been important to actually do it. And I think what I sometimes see is that’s something that’s added on later on in the process when people are building. I tend to say that you should start doing it early on because I often think it helps a lot in the development phase as well. So, it shouldn’t be an add-on later on, after the fact. So, starting to use tracing no matter if it’s as you said, X-Ray or any third-party’s service, using it early on, that helps, and it helps a lot while building the system. And we know that there are a bunch of different solutions out there that are really helpful, and many AWS partners that are willing to help with that as well.

Jason: So, we’ve talked a bunch about serverless, but I think your role at AWS encompasses a whole lot of things beyond just serverless. What’s exciting you now about things in the AWS ecosystem, like, what are you talking about that just gets you jazzed up?

Gunnar: One thing that I am talking a lot about right now that is very exciting is fortunately, we’re in line with what we’ve just talked about, with resilience and with reliability. And many of you might have seen the release from AWS recently called AWS Resilience Hub. So, with AWS Resilience Hub, you’re able to make use of all of these best practices that we’ve gathered throughout the years in our AWS Well-Architected Framework that then guides you on the route to building resilient and reliable systems. But we’ve created a service that will then, in an, let’s say, more opinionated but also easier way, will then help you on how to improve your system with resilience in mind. So, that’s one super exciting thing. It’s early days for Resilience Hub , but we’re seeing customers already starting to use it, and already making use of the service to improve on their architecture, use those best practices to then build more resilient and reliable systems.

Jason: So, AWS Resilience Hub is new to me. I haven’t actually haven’t really gotten into it much. As far as I understand it, it really takes the Well-Architected Framework and combines the products or the services from Amazon into that, and as a guide. Is this something for people that have developed a service for them to add on, or is this for people that are about to create a new service, and really helping them start with a framework?

Gunnar: I would say that it’s a great fit if you’ve already built something on AWS because you are then able to describe your application using AWS Resilience Hub. So, if you build it using Infrastructure as Code, or if you have tagging in place, and so on, you can then define your application using that, or describe your application using that. So, you point towards your CloudFormation templates, for instance, and then you’re able to see, these are the parts of my application. Then you’ll set up policies for your application. And the policies, they include the RTO and the RPO targets for your application, for your infrastructure, and so on.

And then you do the assessment of your application. And this then uses the AWS Well-Architected Framework to assess your application based on the policies you c reated. And it will then see if your application RTO and RPO targets are in line with what you set up in your policies. You will also then get an output with recommendations what you can do to improve the resilience of your application based, once again, on the Well-Architected Framework and all of the best practices that we’ve created throughout the years. So, that means that you, for instance, will get it, you’ll build an application that right now is in one single availability zone, well, then Resilience Hub will give you recommendations on how you can improve resilience by spreading your application across multiple availability zones. That could be one example.

It could also be an example of recommending you to choose another data store to have a better RTO or RPO, based on how your application works. Then you’ll implement these changes, hopefully. And at the end, you’ll be able to validate that these new changes then help you reach your targets that you’ve defined. It also integrates with AWS Fault Injection Simulator, so you’re able to actually then run experiments to validate that through the help of this.

Jason: That’s amazing. So, does it also run those as part of the evaluation, do failure injection to automatically validate and then provide those recommendations? Or, those provided sort of after it does the evaluation, for you to continue to ensure that you’re maintaining your objectives?

Gunnar: It’s the latter. So, you will then get a few experiments recommended based on your application, and you can then easily run those experiments at your convenience. So, it doesn’t run them automatically. As of now, at least.

Jason: That is really cool because I know a lot of people when they’re starting out, it is that idea of you get a tool—no matter what tool that is—for Chaos Engineering, and it’s always that question of, “What do I do?” Right? Like, “What’s the experiment that I should run?” And so this idea of, let’s evaluate your system, determine what your goals are and the things that you can do to meet those, and then also providing that feedback of here’s what you can do to test to ensure it, I think that’s amazing.

Gunnar: Yeah, I think this is super cool one. And as a builder, myself who’s used the Well-Architected Framework as a base when building application, I know how hard it can be to actually use that. It’s a lot of pages of information to read, to learn how to build using best practices, and having a tool that then helps you to actually validate that, and I think it’s great. And then as you mentioned, having recommendations on what experiments to run, it makes it easier to start that Chaos Engineering journey. And that’s something that I have found so interesting through these last, I don’t know, two, three years, seeing how tools like Gremlin, like, now AWS FIS, and with the different open-source tools out there, as well, all of them have helped push that getting-started limit closer to the users. It is so much easier to start with Chaos Engineering these days, which I think it’s super helpful for everyone wanting to get started today.

Jason: Absolutely. I had someone recently asked me after running a workshop of, “Well, should I use a Chaos Engineering tool or just do my own thing? Like do it manually?” And, you know, the response was like, “Yeah, you could do it manually. That’s an easy, fast way to get started, but given how much effort has been put into all of these tools, there’s just so much available that makes it so much easier.” And you don’t have to think as much about the safety and the edge cases of what if I manually do this thing? What are all the ways that can go wrong? Since there are these tools now that just makes it so much easier?

Gunnar: Exactly. And you mentioned safety, and I think that’s a very important part of it. Having that, we’ve always talked about that automated stop button when doing Chaos Engineering experiments and having the control over that in the system where you’re running your experiments, I think that’s one of the key features of all of these Chaos Engineering tools today, to have a way to actually abort the experiments if things start to go wrong.

Jason: So, we’re getting close to the end of our time here. Gunnar, I wanted to ask if you’ve got anything that you wanted to plug or promote before we wrap up.

Gunnar: What I’d like to promote is the different workshops that we have available that you can use to start getting used to AWS Fault Injection Simulator. I would really like people to get that hands-on experience with AWS Fault Injection Simulators, so get your hands dirty, and actually, run some Chaos Engineering experiments. Even though you are far away from actually doing it in your organization, getting that experience, I think that’s super helpful as the first step. Then you can start thinking about how could I implement this in my organization? So, have a look at the different workshops that we at AWS have available for running Chaos Engineering.

Jason: Yeah, that’s a great thing to promote because it is that thing of when people ask, “Where do I start?” I think we often assume not just that, “Let me try this,” but, “How am I going to roll this out in my organization? How am I going to make the business case for this? Who needs to be involved in it?” And then suddenly it becomes a much larger problem that maybe we don’t want to tackle. Awesome.

Gunnar: Yeah, that’s right.

Jason: So, if people want to find you around the internet, where can they follow you and find out more about what you’re up to?

Gunnar: I am available everywhere, I think. I’m on Twitter at @GunnarGrosch. Hard to spell, but you can probably find it in the description. I’m available on LinkedIn, so do connect there. I have a TikTok account, so maybe I’ll start posting there as well sometimes.

Jason: Fantastic. Well, thanks again for being on the show.

Gunnar: Thank you for having me.

Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.

No items found.
Categories
Jason Yee
Jason Yee
Director of Advocacy
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL