One of the most important aspects of any Chaos Engineering program is knowing that every experiment is being run safely. And one of the simplest ways to ensure safe experiments is by having safeguards that prevent running chaos experiments on a system that is unhealthy or has an incident in progress.
Today, Gremlin is excited to announce Status Checks, which run before you kick off a Chaos Engineering Scenario in order to verify your system is in a steady state. If there are no disruptions in the system, the Status Check passes and the experiment can run. But if the system is unstable then the Status Check prevents any step in the Scenario after the check from running.
UpdateWe've now enabled Continuous Status Checks for all Gremlin Starter, Pro, and Enterprise customers.
A Continuous Status Check evaluates the safety and health of your system throughout the duration of the experiment. A Status Check can evaluate if your system is healthy in between attacks and a Continuous Status Check can evaluate how your system is handling the failure during an attack. Enable a Status Check to run continuously by using the toggle on the configuration form.
Status Checks offer an additional layer of confidence and safety to engineers and organizations regularly scheduling or automating their Chaos Experiments. Status Checks allow you to safely schedule or automate Scenarios that expand the blast radius and magnitude and let those experiments run in the background.
At Gremlin, we use Status Checks to ping PagerDuty and validate there are no active incidents in our environment before we run experiments. We also use Status Checks with Datadog to validate our infrastructure is healthy before and after Chaos Experiments so we know we aren’t introducing more failure into an already unhealthy system.
To run, Status Checks hit a publicly-available or third party monitoring or alerting endpoint—like Datadog, New Relic, or PagerDuty—to evaluate the status code, request response time, JSON response body, and will pass or fail based on your criteria.
Adding a Status Check as a step in your Scenario is easy. First, click the “Add a Status Check” option. Then, define your status endpoint and set any data you would like to send to the endpoint, such as authentication tokens or API keys. Finally, define your success criteria—this could simply be the HTTP response code or you can parse and evaluate data returned by the status endpoint.
Use our drop down menu on the Status Check form to select Datadog, New Relic, or PagerDuty to pre-populate part of the form to easily start a Status Check, or use the Custom option to build your own.
Once you’ve added a Status Check to a Scenario you can add attacks and more Status Checks as needed. The best practice is to add a Status Check before each attack to validate your service is in a healthy state before introducing failure. In some cases you might want to add a Status Check at the end of the Scenario to validate your service returned to its steady state. You can follow the instructions in the Scenario document for Running a Scenario.
Status Checks protect the safety and reliability of your applications when running Chaos Engineering Scenarios. For more information about adding Status Checks to your Scenarios, see the documentation.