Here at Gremlin we run GameDays on our own infrastructure using Gremlin. To ensure we have visibility we use a variety of monitoring and observability tools including Datadog, Sentry and AWS AutoScaling Activity History for tracking Auto Scaling Group (ASG) events. During our most recent GameDay, we created a GameDay dashboard to specifically monitor our Staging Environment Control Plane. We were then ready to use and modify this dashboard to monitor Chaos Engineering attacks in real-time.
Let’s walk through how we created the Chaos Engineering Dashboard for our GameDay. This dashboard was created by our team here at Gremlin (Phil Gebhardt), it does not show up automatically in Datadog. To run GameDays you will need a Gremlin account and Datadog account.
To ensure we have visibility of our API we measure sum of calls by path, server errors by path, latency by path, error rate, run of calls by host and sum of client errors by status.
The overall API metrics dashboard is viewable in Datadog as follows:
Next we will walk through how we calculate each metric to ensure we are able to create a Control Plane API overview dashboard for our Staging environment.
To ensure we have visibility of our Staging Database (AWS DynamoDB) we measure sum of calls by table, errors by table, latency by table and error rate.
The overall Database metrics dashboard is viewable in Datadog as follows:
Next we will walk through how we calculate each metric to ensure we are able to create a Control Plane Database overview dashboard for our Staging environment.
To ensure we have visibility of our Staging System Metrics (AWS EC2) we measure Idle CPU by host, free memory by host and uptime by host.
The overall System metrics dashboard is viewable in Datadog as follows:
Creating custom dashboards for your GameDays will enable you to monitor your Chaos Engineering attacks in real-time. They can also be used to answer new questions asked by attendees as your GameDay progresses. Being able to modify Dashboards to answer questions on the fly with real data is incredibly valuable and a great skill for your engineering team to develop. To get started running GameDays, sign up for Gremlin Free.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started