A Status Check is a step of a Scenario that evaluates the health of your environment. Status Checks hit an endpoint URL to evaluate the status code, the request response time including the JSON response body, and will pass or fail based on your defined criteria. The endpoint can be from a 3rd party tool such as Datadog, New Relic, PagerDuty or your preferred monitoring tool. It could also be a publicly accessible endpoint for your services’s health with or without authentication. If the Status Check fails, the Scenario will automatically halt and will record results from the run.
Automate your chaos experiments with Status Checks through a Scenario knowing the attacks will be safely halted if your system doesn’t meet your expected conditions.
Use the toggle button to allow the Status Check to run continuously during the Scenario (polling every 10 seconds). A Continuous Status Check evaluates the success criteria to help validate how your system handles the failure injected during the attack. If the evaluation fails the Scenario will halt and record the last response result and Scenario step that was interrupted.
Create a Status Check by entering a name, description, and endpoint URL. For a description, it’s helpful to include what services you’re testing or what you’re expecting to happen. The endpoint URL is the endpoint that the Status Check will hit and whose response will get evaluated in order to determine success or failure of the Scenario. Use the drop down menu on the Status Check form to select Datadog, New Relic, or PagerDuty to pre-populate the form to easily start a Status Check or use the Custom option to build your own.
Add the headers needed to authenticate the request, specific to the 3rd party the Status Check is communicating with. For example, to add a Status Check for a DataDog endpoint, you will need two headers
- Your organization’s API key
- Your application key
Once you have added the above fields, use the “Test Request” button to ensure you have successfully authenticated your request. A successful response will include a 200-204 OK HTTP status code, the time it took to respond, and the Request Response Body. An unsuccessful or unauthorized response will respond with a 4XX or 5XX status code.
Header content information can also be added to evaluate content type or specify an API version in the endpoint URL.
Provide success criteria that your Status Check will evaluate the response against to keep the Scenario running.
Add the status code the response should include if the service is healthy. If the status code responds outside of this code or range of codes then the Scenario will automatically halt. See the list of HTTP Status Codes for more guidance. Besides a single HTTP Status Code, you can also enter in a range such as 200-204.
For the Request Timeout, add the maximum time in milliseconds to wait for a response before halting the Scenario. For example, you might add a Status Check before starting a latency attack to validate your service is responding within your Service Level Indicator (SLI) and Service Level Objectives (SLO) requirements. This would ensure that a Scenario halts prior to introducing even more latency on your service.
Add the key that you expect from the response body, and then add a comparator to ensure the value associated with that key is accurate. If the value doesn’t pass the comparator you add, the Scenario will halt. This field is especially important for evaluating the responses from 3rd party monitoring software. At this time, we support JSON response bodies. This was implemented using the Jayway JsonPath library. Please refer to their docs for options for evaluating response body criteria as well as the basic Operators and Functions tables below.
|The root element to query. This starts all path expressions.|
|The current node being processed by a filter predicate.|
|Wildcard. Available anywhere a name or numeric are required.|
|Deep scan. Available anywhere a name is required.|
|Bracket-notated child or children|
|Array index or indexes|
|Array slice operator|
|Filter expression. Expression must evaluate to a boolean value.|
Tables are from the Jayway JSONpath library
Functions can be invoked at the tail end of a path - the input to a function is the output of the path expression. The function output is dictated by the function itself.
|Provides the min value of an array of numbers||Double|
|Provides the max value of an array of numbers||Double|
|Provides the average value of an array of numbers||Double|
|Provides the standard deviation value of an array of numbers||Double|
|Provides the length of an array||Integer|
|Provides the sum value of an array of numbers||Double|
Tables are from the Jayway JSONpath library
Once you’ve added the above fields, use the “Test Evaluation” button to ensure that you’ve successfully set up the Status Check criteria. A successful response will confirm your success criteria and enable the “Add to Scenario” button. If your endpoint URL responds with failed criteria you will still be able to add the Status Check to the scenario since your service could be unhealthy at that point in time.
Once you’ve added a Status Check to a Scenario you can add attacks and more Status Checks as needed. The best practice is to add a Status Check before each attack to validate your service is in a healthy state before introducing failure. In some cases you might want to add a Status Check at the end of the Scenario to validate your service returned to its steady state. Use the Continuous Status Check option for Status Checks that you want evaluated throughout the duration of the Scenario. You can follow the instructions in the Scenario document for Running a Scenario.
Email notifications can be configured for Scheduled Scenarios with Status Checks as long as you have the Team Manager role. If a Status Check fails the Scenario will be halted and an email will be sent either to the Scenario Author or the entire Team. Navigate to “Team Settings” and click on “Notifications” to enable or disable these options. Email notifications are disabled by default.
Email example for a halted Scenario due to a failed Status Check.