How to safely run experiments with Gremlin’s Status Checks and Datadog Monitoring

Last Updated:

October 22, 2020

Topics:

Datadog

,

Alerts

,

Scenarios

,

This is an older tutorial

This is an older tutorial and may not represent the latest or most up-to-date information. If anything in this tutorial is incorrect, please let us know.

Update

Status Checks are now called Health Checks. Please see our documentation for more information.

‍

Chaos Engineering experiments are controlled, precise attacks designed to reveal weaknesses in our systems. Once your hypothesis is proven or a bug is found, there is no need to continue the attack. If you’re checking your monitoring and alerting, or scheduling scenarios or running them as a part of your pipeline, adding Status Checks is a great way to automatically halt and rollback any impact of an experiment without human intervention. Using Scenarios, we can string together a series of attacks that get close to a set monitor, then halt the attack when the monitor is triggered. This will help us ensure that we’ve set the right monitor for our use case.

Prerequisites

A Gremlin account (request a free trial)
A host with both a Gremlin agent and Datadog agent installed

Step 1: Get your Datadog API Key and Application Key

We need API and Application keys to programmatically access our Datadog monitor endpoint. Head over to https://app.datadoghq.com/account/settings#api and create an API Key, Create an Application Key. You’ll need Standard or Admin permissions. Save those for Step 3.

Step 2: Create a Datadog Monitor

We need to create a monitor to watch for high CPU utilization. In Datadog, in the left navigation bar go to Monitors -> New Monitor. Select “Metrics” as the type of monitor. Set the metric to system.cpu.idle and select your host. Set the monitor to trigger when the metric is below the threshold at least once during the last 1 minute, and set the threshold to 60%. This will trigger an alert if the idle CPU drops below 60%, or, in other words, if usage increases above 40%. This is a very sensitive alert, so tune this according to your internal response needs to avoid noisy alerts. Most of the time, using an average over the past 1, 5, 10 minutes will be sufficient to not have too noisy of alerts and still alert in time to catch an issue.

Take note of your idle CPU level for Step 3. For example, Datadog is showing an idle usage level of 10.85%.

Click “Save” and then click “Export Monitor” and select the monitor id. For example, this monitor id was 22683435.

Step 3: Create a Gremlin Status Check

For the next few steps, you can instead customize this recommended scenario with your own monitor_id, API Key, and Application Key. Then fill in the JSON tag with the information below.

Recommended Scenario not found

‍

In Gremlin, we’ll add a Status Check to our Status Check library so we can reuse it for future Scenarios or as a template for new Status Checks. We’re going to create a status check to test our Datadog monitor. This will automatically halt any attack if the monitor is in “alert” status.

Go to https://app.gremlin.com/health-checks/list and click “New Status Check”. Turn on Continuous Status Check that will check with the Datadog every 10 seconds for a change in status. Give the Status Check a name and description. Add v1/monitor/{monitor_id} and your API and Application keys. Then, click “Test Request” and make sure you get a 200 response.

Next, leave the healthy status code at 200, but change the request timeout to 1500. This gives Datadog a little more time to respond. Set the healthy response body criteria to overall_state String = OK. Gremlin will check the response JSON for that body key value pair. If the key value pair is different, for example, if the value is “Alert”, the Status Check will halt the attack.

Click “Test Evaluation” and “Save”.

Step 4: Create a Gremlin Scenario

Next, we’re going to add that Status Check to a Scenario with a progression of CPU attacks below and above the threshold. Go to “Scenarios” or https://app.gremlin.com/scenarios/my-scenarios. Click “New Scenario”. Give the Scenario a name and description. Click “Add a Status Check”, choose the Status Check you made in Step 3 and click “Add to Scenario”.

Select “Add a new attack” and select the host that Datadog is monitoring.

Next, select “Choose a gremlin” and select Resource -> CPU. Set the length to 120 seconds and the capacity to 20% (30% - 10% pre-attack usage from Step 2) and select All Cores and “Add to scenario”.

Next, click “Add a new attack”. The same target and configurations of the previous step will be populated. Adjust the CPU Capacity to 40% and click “Add to Scenario”.

Finally, add one more attack with 60% CPU Capacity and click “Add to Scenario”. Click “Save Scenario”.

Step 5: Run the Scenario and watch the results

Next, we’ll run the scenario and see if our alerting fires. Click “Run Scenario”. After about 3 minutes, Gremlin will run through the first attack, then the Status Check will halt the second attack when utilization is above that 40% threshold.

We can see that Gremlin halts the attack before it reaches the 60% attack. Looking at Datadog, we can see the charting where the first attack did not breach the threshold, but the second attack did.

Conclusion

You’ve now added a scenario with a Status Check! Try out a few different Status Checks with different monitoring alerts from Datadog, such as error rates or latency thresholds. This is a great mechanism to test your alerting and also a great safety mechanism to ensure that attacks stop when an impact is noticed. This allows you to comfortably schedule and automate attacks in the future, and when the Status Check halts an attack, you can set up notifcations so you know your schedule attack passed or failed.

No items found.

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started