September 26, 2019

Jason Yee: Lightning Talk: What Should I Monitor? - Chaos Conf 2019

Jason Yee: Lightning Talk: What Should I Monitor? - Chaos Conf 2019

The following is a transcript from Datadog Technical Evangelist, Jason Yee’s, talk at Chaos Conf 2019, which you can enjoy in the embedded video above. Slides are available here.

Twitter: @gitbisect

Actually big round of applause for the AV guys. They’re doing a fantastic job out there.

So people ask me this all the time. “What should I monitor?” People start throwing around these terms or these frameworks, should I use RED? Which is Rate Errors Duration. Should I use USE, which makes you sound like a New Jersey mobster use. Use guys. Which is Utilization, Saturation and Errors.

And then the new hot thing is the four golden signals, right? If it’s good enough for Google to be promoting this in the SRE book, then it should be good enough for me, right? And people ask me this because I’m a technical evangelist. I am paid to know things and to guide people on their journeys of observability. And I work at Datadog, which is a monitoring platform. So clearly I know things. I work for a monitoring company, I should know how to monitor things people want advice.

And so I always give them this generic advice of monitoring what matters, right? You should monitor what matters. And that’s a silly answer because what matters? Who decides what matters? And so the complete statement is you should monitor what matters to your customers and therefore your business.

But again that can sometimes be a little bit nebulous. Because often the flow looks like this. You start with customers and you have some business or product people that talk to those customers and then they relay that to software developers as requirements for some new feature or new functionality. And then it ends up landing on Ops or SRE to ensure that this thing is reliable, that it stays up and running and often that’s the monitoring. And the problem here is that when you go from business and you have this intermediary and developers, oftentimes Ops or SRE isn’t sure what to monitor, right? They’re not sure what actually matters because business knows what matters. Developers code something and then Ops is just supposed to run it.

So it spells that rather than having Ops actually do the monitoring, similar to what Joyce was saying. If we push things left, if we ship them upfront, developers should code something and monitoring. And notice that I don’t say that and they should do monitoring, and monitoring. They should code something, they should code features and they should code monitoring.

If we started thinking about this, if I’m a developer writing a new feature, I’m writing that requirement. I should know what metrics I need to emit. And it’s pretty easy to emit metrics these days whether you’re using Datadog or some other tool. You include some library, you call a function, have it increment a number, have it set a rate or a gauge. It’s not that hard. And if you’re implementing your metrics emission as code, it’s not much harder to actually code up you’re monitoring. Almost every monitoring tool these days has the ability to use an API to set an alert, right?

So within Datadog you just import our API and we can set a message and we can say what query you want that to alert on. But the great thing is if I have my monitoring as code, I can just keep it with my project, right? So now I can revision it and I have control over it and I can deploy that automatically, right? I can add it to my Jenkins pipeline. So if I’m deploying code out, I’m now deploying all of the monitoring that goes along with that.

So that all begs the question though. How do we know this actually works? We’re still emitting a bunch of stuff. We think it has some relevance to what we’re actually working on. But how do we know that this works? This is where Chaos comes in. Chaos is a fantastic way to test your monitoring tools. Gremlin has this great API. You can kick off all of your Gremlin attacks using an API. It’s fantastic. And if you’re kicking off all your Gremlin attacks, well then you can add that again to your CI pipeline. Now I can deploy my code, I can deploy my monitoring, and I can deploy a test to test that monitoring.

So I now have this monitoring code. And rather than sending a message to wake up an SRE or an Ops person in the middle of the night, I can just feed that back in as the results of my tests and completely automate this.

It’s what I like to call TDD for monitoring. We’ve had this notion of writing tests upfront for decades and yet we haven’t really been doing it with the operational functionality of a code. We write unit tests. We might do some integration tests, but what about for the rest of what’s required for running our code in production?

So I like to say that moving monitoring upfront with Chaos rather than just being test driven development is resilience driven development. If we start to push these upfront, we make resilience a first class citizen. Because ultimately your Ops or your SRE team, they can provide platforms. They can run things that are resilient. They can put Kubernetes in place or OpenShift or whatever platform you want. But it’s up to developers to build resilient applications, and resilient applications are what actually matters to your customers.

So resilience driven development. Validate your application, validate the functionality with monitoring, not with unit tests, not with integration tests. Actually be able to see that you’re doing the things that you want to be able to do and you’re monitoring those. Run those tests with Chaos. Start to think about Chaos testing, not just engineering. There’s value in Chaos as discovery, but there’s also value in Chaos as validating what you’re supposed to be doing.

And finally, monitoring in Chaos upfront. Stop thinking about it as an after the fact thing of let’s deploy out and then maybe we’ll do Chaos. Let’s push it upfront. Do it from the start. Thanks.

See our recap of the entire Chaos Conf 2019 event.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Request a demo