Reliability Management

Reliability Tests

Reliability tests test a specific behavior of your service, such as autoscaling CPU and memory, zone and host redundancy, and dependency failures. While a test is running, Gremlin continuously monitors your service's state using its Health Checks. If any of your Health Checks become unhealthy during a test, then the test is immediately halted and marked as a failure. Otherwise, it's marked as passed.

A Test Suite is a collection of reliability tests. Each Gremlin team has one Test Suite assigned to it, and that Test Suite defines the reliability tests available to services owned by that team. To learn more, see Test Suites.

Running reliability tests

To run a reliability test, first click on the service you wish to test to open the Service Details page. From there, find the test that you wish to run and click Run. A modal window will appear asking you to confirm. To run the test, click Run again. The test will start and Gremlin will display details about the test along with its current status. On this screen, you can monitor the progress of the test and drill down into its execution details. If the test fails, you can see the cause of the failure. If the failure was caused by a Health Check, you can see which of the Health Checks triggered the failure.

Screenshot of a completed CPU reliability test

To run the full suite of tests, click the Run All button at the top of the service overview page, then click Run All Tests to confirm. Gremlin will run each test sequentially. The page will automatically refresh to show the current running test and the results of completed tests.

Run All excludes running tests on dependencies that have been marked as Single Points of Failure. See the dependency documentation to learn more.

Running dependency tests

When you define your service in Gremlin and select its process name, Gremlin uses network traffic data to identify network resources that your service communicates with. It then lists these resources in the Dependencies section. For each dependency, Gremlin automatically creates three tests:

  • The Failure Test drops all network traffic to the dependency.
  • The Latency Test delays all network traffic to this dependency by 100ms.
  • The Certificate Expiry Test opens a secure connection to your dependency, retrieves the certificate chain, and validates that no certificates expire in the next 30 days. If there is no secure connection available, and therefore no certificates, this test will pass.

You can run these tests for each dependency and they will contribute to the service's reliability score.

A list of dependencies for a service created in the Gremlin web app.

Running a reliability test on a dependency works the same way as running a reliability test on a service. Simply click the Run button and click Run again to confirm.

Note on dependency testing
When running a dependency test, Gremlin doesn't actually impact the dependency. Instead, it impacts the service's network connection to the dependency. For example, if you have a web server connected to a dependency over port 3306 and run a latency experiment on the dependency, Gremlin will introduce that latency on port 3306 on the service. Then, it monitors the service's Health Checks to ensure the service still functions.

Autoscheduling Reliability Tests

A consistent testing schedule is key to improving a Service's Reliability Score. You can schedule Reliability Tests to run automatically during a weekly testing window. Gremlin will run as many eligible Reliability Tests as possible during the specified window and track the scores over time so you can see how it improves with regular testing.

You can set up a schedule to run:

  • All Reliability Tests
  • Any Reliability Tests that have passed at least once
  • Any Reliability Tests that have been run at least once
Autoscheduling is optional. You can run the Reliability Tests manually if you do not wish to use autoscheduling. Autoscheduled tests will also not run on dependencies that are marked Single Points of Failure. See the dependency documentation to learn more.

To schedule Reliability Tests for a Service:

  • On the Service page, click Autoschedule (or Settings and then Scheduling).
  • Select the test that you want to schedule:
  • All Reliability Tests
  • Any Reliability Tests that have passed at least once
  • Any Reliability Tests that have been run at least one
  • Under Test Window, specify the parameters for the test window:
  • Day
  • Start hour
  • Length of window (must be at least 2 hours)
  • Click Save.
No items found.
This is some text inside of a div block.
Installing the Gremlin Agent
Authenticating the Gremlin Agent
Configuring the Gremlin Agent
Managing the Gremlin Agent
User Management
Health Checks
Command Line Interface
Updating Gremlin
Quick Start Guide
Services and Dependencies
Detected Risks
Reliability Tests
Reliability Score
Deploying Failure Flags on AWS Lambda
Deploying Failure Flags on AWS ECS
Deploying Failure Flags on Kubernetes
Classes, methods, & attributes
API Keys
Container security
Additional Configuration for Helm
Amazon CloudWatch Health Check
AppDynamics Health Check
Blackhole Experiment
CPU Experiment
Certificate Expiry
Custom Health Check
Custom Load Generator
DNS Experiment
Datadog Health Check
Disk Experiment
Dynatrace Health Check
Grafana Cloud Health Check
Grafana Cloud K6
IO Experiment
Install Gremlin on Kubernetes manually
Install Gremlin on OpenShift 4
Installing Gremlin on AWS - Configuring your VPC
Installing Gremlin on Kubernetes with Helm
Installing Gremlin on Windows
Installing Gremlin on a virtual machine
Installing the Failure Flags SDK
Latency Experiment
Memory Experiment
Network Tags
New Relic Health Check
Packet Loss Attack
PagerDuty Health Check
Preview: Gremlin in Kubernetes Restricted Networks
Private Network Integration Agent
Process Collection
Process Killer Experiment
Prometheus Health Check
Role Based Access Control
Running Failure Flags experiments
Scheduling Scenarios
Shared Scenarios
Shutdown Experiment
Time Travel Experiment
Troubleshooting Gremlin on OpenShift
User Authentication via SAML and Okta
Integration Agent for Linux
Test Suites
Restricting Testing Times
Process Exhaustion Experiment
Enabling DNS collection