June 23, 2020

Announcing Status Checks to Ensure Safe Chaos Engineering Scenarios

Announcing Status Checks to Ensure Safe Chaos Engineering Scenarios

One of the most important aspects of any Chaos Engineering program is knowing that every experiment is being run safely. And one of the simplest ways to ensure safe experiments is by having safeguards that prevent running chaos experiments on a system that is unhealthy or has an incident in progress.

Today, Gremlin is excited to announce Status Checks, which run before you kick off a Chaos Engineering Scenario in order to verify your system is in a steady state. If there are no disruptions in the system, the Status Check passes and the experiment can run. But if the system is unstable then the Status Check prevents any step in the Scenario after the check from running.

Safely scale your chaos practices and expand the blast radius

Status Checks offer an additional layer of confidence and safety to engineers and organizations regularly scheduling or automating their Chaos Experiments. Status Checks allow you to safely schedule or automate Scenarios that expand the blast radius and magnitude and let those experiments run in the background.

At Gremlin, we use Status Checks to ping PagerDuty and validate there are no active incidents in our environment before we run experiments. We also use Status Checks with Datadog to validate our infrastructure is healthy before and after Chaos Experiments so we know we aren’t introducing more failure into an already unhealthy system.

Status checks overview

To run, Status Checks hit a publicly-available or third party monitoring or alerting endpoint—like Datadog, New Relic, or PagerDuty—to evaluate the status code, request response time, JSON response body, and will pass or fail based on your criteria.

Adding a Status Check as a step in your Scenario is easy. First, click the “Add a Status Check” option. Then, define your status endpoint and set any data you would like to send to the endpoint, such as authentication tokens or API keys. Finally, define your success criteria—this could simply be the HTTP response code or you can parse and evaluate data returned by the status endpoint.

Use our drop down menu on the Status Check form to select Datadog, New Relic, or PagerDuty to pre-populate part of the form to easily start a Status Check, or use the Custom option to build your own.

Once you’ve added a Status Check to a Scenario you can add attacks and more Status Checks as needed. The best practice is to add a Status Check before each attack to validate your service is in a healthy state before introducing failure. In some cases you might want to add a Status Check at the end of the Scenario to validate your service returned to its steady state. You can follow the instructions in the Scenario document for Running a Scenario.

Run a Scenario with Status Checks today

Status Checks protect the safety and reliability of your applications when running Chaos Engineering Scenarios. For more information about adding Status Checks to your Scenarios, see the documentation.

Follow this step-by-step tutorial to get your first Status Checks up and running. If you don’t have a Gremlin account yet, sign up for free!

Create your Gremlin Free account

Run your first Chaos Experiment in minutes.
Log in

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started