How to Create Chaos Engineering Dashboards with Datadog and Gremlin

Here at Gremlin we run GameDays on our own infrastructure using Gremlin. To ensure we have visibility we use a variety of monitoring and observability tools including Datadog, Sentry and AWS AutoScaling Activity History for tracking Auto Scaling Group (ASG) events. During our most recent GameDay, we created a GameDay dashboard to specifically monitor our Staging Environment Control Plane. We were then ready to use and modify this dashboard to monitor Chaos Engineering attacks in real-time.

Letโ€™s walk through how we created the Chaos Engineering Dashboard for our GameDay. This dashboard was created by our team here at Gremlin (Phil Gebhardt), it does not show up automatically in Datadog. To run GameDays you will need a Gremlin account and Datadog account.

API Metrics

To ensure we have visibility of our API we measure sum of calls by path, server errors by path, latency by path, error rate, run of calls by host and sum of client errors by status.

The overall API metrics dashboard is viewable in Datadog as follows:

api metrics overview

Next we will walk through how we calculate each metric to ensure we are able to create a Control Plane API overview dashboard for our Staging environment.

API - sum of calls by path

api sum of calls by path

API - Server errors by path

api server errors by path

API - Latency by path

api latency by path

API - Error rate

api error rate

API - Sum of calls by host

api sum of calls by host

API - Sum of client errors by status

api sum of client errors by status

Database Metrics

To ensure we have visibility of our Staging Database (AWS DynamoDB) we measure sum of calls by table, errors by table, latency by table and error rate.

The overall Database metrics dashboard is viewable in Datadog as follows:

database chaos dashboard

Next we will walk through how we calculate each metric to ensure we are able to create a Control Plane Database overview dashboard for our Staging environment.

Database - sum of calls by table

database sum calls table

Database - errors by table

database errors by table

Database - latency by table

database latency by table

Database - error rate

database error rate

System Metrics

To ensure we have visibility of our Staging System Metrics (AWS EC2) we measure Idle CPU by host, free memory by host and uptime by host.

The overall System metrics dashboard is viewable in Datadog as follows:

system metrics

System Metrics - Idle CPU by host

system metrics idle cpu host

System Metrics - free memory by host

system metrics free memory by host

System Metrics - uptime by host

system metrics uptime host

Conclusion

Creating custom dashboards for your GameDays will enable you to monitor your Chaos Engineering attacks in real-time. They can also be used to answer new questions asked by attendees as your GameDay progresses. Being able to modify Dashboards to answer questions on the fly with real data is incredibly valuable and a great skill for your engineering team to develop. To get started running GameDays, sign up for Gremlin Free.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free