Platform > Health Checks

Health Checks

Supported platforms:

N/A

A Health Check checks the state of systems before, during, and after an experiment, Scenario, or reliability test. They're used to monitor the state of your systems to ensure they're still operating within your expectations. Health Checks also provide a level of safety when running tests: if your systems become unstable, unresponsive, or unhealthy, Health Checks will automatically halt ongoing tests and return your systems to normal operation.

Health Checks work by sending REST API requests to an endpoint that you specify: typically this is a monitor or alert in your observability tool. Gremlin sends a request to the monitor and checks the response time, response status code, and optionally a JSON payload against your success criteria. If any field fails to meet the criteria, the Health Check identifies the failure and signals the active test to stop. We strongly recommend using monitors and/or alerts as endpoints for Health Checks, since your team likely already uses these to measure the health of your service.

Gremlin natively supports Datadog, New Relic, PagerDuty, and several other tools listed below. Gremlin also supports custom endpoints so you can integrate any tool of your choice.

‍

Managing Health Checks

You can view your Health Checks in the Gremlin web app by clicking on Configurations in the left-hand navigation menu and then clicking Health Checks. This page shows every Health Check available to your Gremlin team and lets you create, edit, or delete Health Checks.

‍

Health Check authentication

When you authenticate with an observability tool for the first time, Gremlin saves the authentication details. This way, you can reuse the same authentication details instead of having to redefine them for every new health check you create. These authentication details are also available to other members of your Gremlin team.

You can view and edit all saved authentication details by first navigating to the Health Checks list, then clicking the Authentication tab.

Note

If you delete an authentication that is in use, any Health Checks using that authentication will report back as failed. This can subsequently cause reliability tests and Scenarios to fail.

‍

Authenticating using a secrets management tool

The default method of authentication securely saves your credentials to the Gremlin application. Alternatively, Gremlin can retrieve your credentials at runtime from a secrets management tool like AWS Secrets Manager. This lets Gremlin authenticate with your observability tool without requiring you to store your credentials with Gremlin.

Note

This feature is currently only available for AWS Secrets Manager.

‍

How it works

Normally, when you authenticate an observability tool in the Gremlin web app, Gremlin securely saves your credentials and only uses them to invoke Health Checks belonging to that tool. With runtime authentication, your credentials instead remain stored in your secrets management tool (e.g. AWS Secrets Manager). Instead of reading your credentials directly, Gremlin uses the Private Network Integration (PNI) agent as a proxy. As a result, your secrets never touch Gremlin’s servers.

‍

Using a secrets management tool for Health Checks

Before starting the configuration process, make sure you have the following:

AWS Secrets Manager.
The Private Network Integration (PNI) agent (version 0.5.0 or later) deployed to your environment.
An AWS Identity and Access Management (IAM) role that can access your AWS Secrets Manager credentials. Attach the role to the instance that your PNI agent is running on. The PNI agent will inherit this role and use it to access secrets.

To use a secret from AWS Secrets Manager:

In AWS Secrets Manager, retrieve the Amazon Resource Name (ARN) of your observability tool credentials secret. If you haven’t yet added your credentials, follow the instructions in Create an AWS Secrets Manager secret.
1. On the Choose secret type page, select Other type of secret.
2. Enter the name and value of your credentials into the key and value fields, respectively. Gremlin will treat each key/value pair as a request header.
3. After creating the secret, copy the resulting ARN.
In the Gremlin web app, navigate to Configurations > Health Checks and click + Health Check.
Select your observability tool from the Observability Tool drop-down.
Select Yes for Is this observability tool behind a firewall or on-prem?
Under Authentication, select Retrieve credentials from AWS Secrets Manager.
Enter the ARN of your AWS Secrets Manager secret into the AWS ARN box.
Click Save.

Now, whenever a Health Check for this observability tool runs, the PNI agent will use your AWS Secrets Manager secret to authenticate with the observability tool.

‍

Intelligent Health Checks

For certain platforms, Gremlin has the ability to create Health Checks automatically. These work like normal Health Checks, but are fully managed by Gremlin and don't require you to set up or configure an observability tool. You can use these in addition to normal Health Checks.

‍

How Intelligent Health Checks work

Intelligent Health Checks use three metrics about your service: error rate, request rate, and latency. Gremlin first determines a baseline threshold for each metric using historical data. While a test is running, Gremlin monitors each metric in real-time to determine whether your service passes or fails the test.

Intelligent Health Checks are available for the following platforms:

‍

Enabling Intelligent Health Checks

To enable Intelligent Health Checks:

Navigate to the service you want to enable them for.
Select Settings > Health Checks.
Under Intelligent Health Checks, use the Type drop-down to select the platform you want to create Intelligent Health Checks for, then click + Add.
Follow the instructions for your platform.

For AWS users, you'll first need to specify which Elastic Load Balancer (ELB) is mapped to this service. The Intelligent Health Checks will query CloudWatch for metrics based on the ELB, so mapping the correct ELB is critical for accurate test results.

Enabling Intelligent Health Checks in Gremlin.

‍

Disabling Intelligent Health Checks

To disable Intelligent Health Checks, click the Edit button next to any one of your Intelligent Health Checks. From the list of services, click the checkbox next to your service so that it appears unchecked. Then, click Save.

Note

IIf you disable Intelligent Health Checks on a service, make sure the service has at least one other Health Check. As a safety precaution, Gremlin will not test services that have no Health Checks.

‍

Creating Health Checks

To create a new Health Check, click on Configurations in the left-hand navigation menu, then click Health Checks. From here, click the+ Health Check button in the top-right corner of the page.

‍

Choosing an observability tool

First, you must select the observability tool you wish to use. Click on the Observability Tool drop-down menu to see the list of available options. Click on the link for your tool in the list below for instructions on how to use that tool. Note that the options may change depending on which tool you select, and whether someone else in your Gremlin Team has already authenticated with that tool:

Once you've entered authentication details for an observability tool, you can reuse the same authentication for other monitors. After you've followed the instructions for your specific tool, continue with the following instructions.

‍

Accessing observability tools on a private network

By default, Health Checks require your endpoint to be accessible by Gremlin's servers. If your endpoint is hosted internally (i.e., behind a firewall) or you have strict network security policies that prevent exposing your systems to the Internet, you can use the Private Network Integration (PNI) agent. The PNI agent proxies requests to and from the Gremlin API, letting you use Health Checks and other integrations from within a private network. You can enable this by selecting Yes under Is this observability tool behind a firewall or on-prem?

‍

Naming your Health Checks

Enter a name for the Health Check in the Name text box. As a best practice, we recommend naming your Health Checks after their respective monitors or endpoints. This makes it easier to mentally connect a Health Check to a monitor, and to know which Health Checks to use when creating new services in Gremlin.

Once you create a Health Check, it will appear in your team's health check list. All members of your Gremlin team will be able to use these Health Checks in their experiments, Scenarios, or reliability tests. Saved Health Checks also include success evaluation criteria.

Tip

We recommend creating a Health Check for each of your most important monitors. For example, if your service has monitors for CPU usage, response time, and availability, we recommend creating a new Health Check for each one. This is to ensure that if any one of those checks fails, Gremlin can quickly halt any ongoing tests. You can use multiple Health Checks in a single experiment, Scenario, or reliability test.

‍

Adding tags to a Health Check

Health Checks support the use of tags. This is mainly so you can specify which Private Network Integration (PNI) agent the Health Check is proxied through, if you have them deployed. Tags can also be used to organize Health Checks by adding more detailed information.

Note

Some tags cannot be edited because they come from your observability provider. These are listed separately next to Tags from observability tool.

To add tags to a Health Check:

Search for the Health Check you wish to edit and click Edit.
Look for the Tags field beneath the Health Check name, then click Edit.
1. Enter the name of the tag you want to add.
2. Enter the value(s) that this tag will contain. You can enter a single value or a comma-separated list of values.
3. Click Add Tag to add the tag and value(s). You can add more tags by repeating these steps.
4. To remove a tag, click the X next to the tag.
When you’ve added all your desired tags, click Save.

‍

Routing Health Checks to specific PNI agents

If you have multiple Private Network Integration (PNI) agents deployed, you can use tags to determine which agent your Health Checks get routed to. Gremlin will match your Health Check to a PNI agent using the following behavior:

If both the Health Check and agent share a tag with the same name, Gremlin compares their values. If at least one value matches, Gremlin will route the Health Check through the agent.
If the Health Check has tags, but an agent doesn’t, the Health Check won’t route to that agent.
If the Health Check has no tags, Gremlin will fall back to its default behavior of routing it to a random PNI agent.

To learn how to add tags to a PNI agent, see the PNI agent documentation.

‍

Editing Health Checks

To edit a health check, click the Edit button next to the health check you want to modify. You can change its name, URL, success evaluation criteria (for custom tools), and the headers used for authentication.

‍

Copying Health Checks

To copy a health check, click the Clone button next to the health check you want to copy. This opens the health check creation form with the details of the cloned health check already entered. This is useful for quickly adding a new health check without having to re-enter authentication details, custom URLs, or custom headers.

‍

Deleting Health Checks

To delete a health check, click the Delete button next to the health check you want to delete. Click Delete again to confirm.

Note

You won't be able to delete this health check if a service is currently using it, and the service has no other health checks to use instead. If that's the case, you'll see the following message: This is the last health check for [service name]. If you need to delete it, please add a new one first. Be sure to add another health check to the service before deleting this one.

‍

Health Checks IP ranges

If your firewall is blocking the utilization of the Health Check feature, and you are unable to use Private Network Endpoints, you will need to add the following IP address to your allow list.


44.236.227.116
54.186.237.228
44.239.162.49
44.240.200.121

‍

Privileges required

Privilege	Description
INTEGRATIONS_READ	Allows reading all team integrations
INTEGRATIONS_WRITE	Allows updating all team integrations

‍

Integrations

Restricting Testing Times