Fault Injection > Experiments

Experiments

Supported platforms:

N/A

An experiment is Gremlin’s term for injecting fault into some part of a system in a safe and secure way. This can include causing network outages, creating latency, shutting down hosts, and exhausting compute resources. In addition to running ad-hoc experiments, you can also schedule regular or recurring experiments, create experiment templates, and view reports of historical experiments. Experiments form the foundation of Gremlin’s other features, such as Scenarios and Test Suites.

Note

Experiments were previously called Attacks. You may still see references to "attack" in the Gremlin CLI and REST API.

‍

Experiment categories

Gremlin provides three categories of experiments:

Resource experiments: test against sudden changes in consumption of computing resources.
Network experiments: test against unreliable network conditions.
State experiments: test against unexpected changes in your environment, such as power outages, node failures, clock drift, or application crashes.

The tables below show the experiments that belong to each category.

‍

Resource Experiments

Resource experiments consume compute resources like CPU, memory, and I/O throughput.

Experiment	Impact
CPU	Generates high load for one or more CPU cores.
Memory	Allocates a specific amount of RAM.
GPU	Generates high compute load on a GPU.
IO	Puts read/write pressure on I/O devices such as hard disks.
Disk	Writes files to disk to fill it to a specific percentage.
Process Exhaustion	Simulates running processes on a target to consume process IDs (PIDs).

‍

State Experiments

State experiments modify the state of a target so you can test for auto-correction and similar fault-tolerant mechanisms.

Experiment	Impact
Shutdown	Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time Travel	Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process Killer	Kills the specified process, which can be used to simulate application or dependency crashes. Note: Process experiments do not work for Process ID 1, consider a Shutdown experiment instead.

‍

Network Experiments

Network experiments test the impact of lost or delayed traffic to a target. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.

Experiment	Impact
Blackhole	Drops all matching network traffic.
Certificate Expiry	Checks for expiring security certificates.
Latency	Injects latency into all matching egress network traffic.
Packet Loss	Induces packet loss into all matching egress network traffic.
DNS	Blocks access to DNS servers.

‍

Note

Please see these important considerations when running network experiments on Kubernetes.

‍

How to run an experiment

There are three ways to run an experiment: using the Gremlin web app, REST API, and Gremlin CLI (command-line interface). To run an experiment using the Gremlin web app:

Log into the Gremlin web app and click on the Experiments link in the left-hand navigation menu.
Click the New Experiment button.
Select the type of system that you want to run the experiment on. Gremlin supports host, container, and Kubernetes-based targets (for running experiments on applications and serverless functions, see Failure Flags).
1. After selecting a type, you’ll need to refine your selection to specific targets. For all three target types, you can select specific targets individually, or by using tags (or “Selectors” for Kubernetes). The default view groups targets by tag, but you can click the Exact button to switch to the individual target list.
2. You can review and refine your selection further using the Blast Radius graph. The graph shows each host, container, and Kubernetes resource, and will highlight the resources that your experiment will impact. You can hover the mouse over any item in the graph to see its name.
3. If you want to limit the impact to a random subset of resources, you can use the Percent to impact box. For example, if you have two hosts and set this value to 50%, Gremlin will randomly choose one of the hosts as the experiment target when the experiment is run. If you'd rather define a percentage instead of a number, use the drop-down box to switch units. See the image below for an example.
4. By default, Gremlin will automatically include any newly-detected targets that match the target criteria. This is useful if, for example, your target is an autoscaling cluster and you want to include new nodes while the experiment is running. To prevent this behavior, uncheck the Include New Targets check box.
Under Choose a Gremlin, configure the experiment that you want to run.
1. Select the Category of the experiment, then click on the experiment name.
2. Depending on the experiment you choose, you'll see different settings for setting up the experiment. Check each experiment’s documentation page to learn what each parameter does, and what values it allows.
If you want to schedule this experiment to run in the future, check the Schedule for later toggle.
1. To run the experiment at one specific time, select Only once and enter a date and time.
2. To run the experiment at random during a window, select Randomly within a time frame. You can choose one or more days of the week to run the experiment, how many times to run it each day, and the time window. Gremlin will select a random time during the window to run the experiment.
Click Run Experiment to start the experiment (or Schedule Experiment to schedule it for later). After clicking Run Experiment, Gremlin brings up the experiment progress screen where you can track the progress of the experiment in real-time.

Configuring the blast radius for a host-based experiment. There are two hosts in the us-west-1a zone, but we've set the "Percentage of hosts to impact" to 50. When the experiment starts, Gremlin will randomly select which of these hosts to target.

‍

Monitoring system metrics using experiment visualizations

For certain resource experiments, Gremlin can automatically collect and display metrics from the target. This lets you quickly verify the impact of your experiment without having to open an observability tool. For CPU experiments, you can see the amount of CPU load. For memory experiments, you can see RAM usage vs. capacity. For shutdown experiments, you can see when the target went offline.

CPU experiment results screen showing a graph of CPU metrics.

‍

Disabling experiment visualizations

Experiment visualizations are enabled by default. Users with the Company Owner or Company Admin roles can disable them by going to the Company Settings page, clicking Preferences, and unchecking Experiment Visualizations. You can also view our data collection policy by clicking the link. Gremlin only collects metrics relevant to the experiment, and does not collect metrics when experiments are not running.

‍

Disable experiment visualizations company-wide on the Settings page.

‍

Disabling experiment visualizations for individual hosts

To prevent a host from sending metric data to Gremlin for visualization, open the configuration file for the Gremlin agent running on that host and add the line PUSH_METRICS=0. Then, restart the agent.

‍

Scheduling experiments

Experiments can be run ad-hoc or scheduled, from the web app or programmatically. You can schedule experiments to execute on certain days and within a specified time window. You can also set the maximum number of experiments a schedule can generate.

Scheduling an experiment to run every Monday, Wednesday, and Friday, between 2 and 4 AM.

‍

Using agent tags

The Gremlin agent automatically detects certain metadata about the systems it’s running on. This metadata is made available in the form of tags. A tag is a simple key-value pair used to identify something about the system, like its hostname, public and private IP address, CPU architecture, and operating system. Gremlin can also find the cloud platform provider, region, availability zone, and other important information for systems running on cloud platforms like AWS.

Tags let you create groups of resources for experimentation. For example, ‌instances in a Kubernetes cluster are automatically assigned a cluster tag with the name of the cluster as the value. Instead of having to remember which instances are part of the cluster, you can simply select the tag with the corresponding cluster name.

To learn more about tags, see Network Tags.

‍

Creating custom tags

You can define custom tags in the agent configuration file. When configuring an experiment in the Gremlin web app, these tags will appear under the Other Tags category. For example, you can add this to your config.yaml file to create a tag with the name service and the value pet-store:

YAML


## Gremlin Client Tags; Tag your machine with key-value pairs that help you target this machine during experiments
## (can also set with GREMLIN_CLIENT_TAGS environment variable)
tags:
  service: pet-store
  interface: http

‍

How experiments work

Every experiment in Gremlin is made up of one or more Executions. An Execution is an instance of the experiment running on a single target. An experiment can have multiple Executions if you’ve selected multiple targets to run it on.

‍

Experiment stages

The Stage progression of an experiment is derived from the Stages of the experiment's Executions. Gremlin weighs the importance of each Execution’s Stage to determine the experiment's overall Stage.

Stages are sorted below by descending order of importance (i.e. the Running stage holds the highest importance):

Stage	Description
Running	Experiment running on the host
Halt	Experiment told to halt
RollbackStarted	Code to roll back has started
RollbackTriggered	Daemon started a rollback of client
InterruptTriggered	Daemon issued an interrupt to the client
HaltDistributed	Distributed to the host but not yet halted
Initializing	Experiment is creating the desired impact
Distributed	Distributed to the host but not yet running
Pending	Created but not yet distributed
Failed	Client reported unexpected failure
HaltFailed	Halt on client did not complete
InitializationFailed	Creating the impact failed
LostCommunication	Client never reported finishing/receiving execution
ClientAborted	Something on the client/daemon side stopped the Gremlin and it was aborted without user intervention
UserHalted	User issued a halt, and that is now complete
Successful	Completed running on the Host
TargetNotFound	Experiment not scoped to any current targets

As an example, an experiment with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are <span class="code-class-custom">TargetNotFound, Running, TargetNotFound</span>, the resulting stage for the experiment will be <span class="code-class-custom">Running</span>.

‍

Running experiments on Kubernetes

Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.

Only parent Kubernetes objects are available to target. Pods will be listed only if they don't belong to a Set or Deployment.

‍

Selecting containers

For State and Resource experiment types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Experiment page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the experiment, the underlying containers within the objects selected will be impacted.

Any, all, or specific options for container experiments

Containers share resources with their hosts. Running resource experiments on Kubernetes objects will impact the hosts where the targeted containers are running, including the host's full set of containers. Targeted containers also need to be able to resolve api.gremlin.com, otherwise the experiment will fail. Gremlin adopts all the configuration and resources of the pod it is experimenting.

‍

Additional configuration options

This section lists common configuration options and how to use them. For details on experiment-specific parameters, check out the links to each experiment in the tables at the top of this page.

‍

Including new targets in ongoing experiments

When selecting targets by tag, you have the option to check the Include New Targets checkbox. When this is checked, any newly-detected targets that meet the experiment's selection criteria will join the experiment. By default, new targets won't run the experiment, even if they match the criteria.

For example, imagine you want to run a CPU experiment on all EC2 hosts in the AWS us-east-1 region. When you run the experiment, AWS detects the increased CPU usage, automatically provisions a new EC2 instance, and installs the Gremlin agent. If Include New Targets is checked, Gremlin will add this new instance to the ongoing CPU experiment.

‍

Specifying multiple network addresses and ports

You can specify multiple network addresses and ports using a comma-separated list. For ports, you can specify ranges by adding a dash between the lowest number and the highest number in the range (e.g. 3000-4000). This also applies to experiments run via the REST API and CLI.

Entering multiple port numbers and ranges in a Gremlin network experiment.

‍

For a range of IP addresses, CIDR values can be used (i.e. 10.0.0.0/24).

‍

Excluding network addresses and ports

To exclude a hostname, IP address, or port from an experiment, add a caret ^ directly in front of it. For example, in the above screenshot, ^53 prevents DNS traffic from being impacted. This also works for ranges and CIDR values.

Note

By default, network experiments impact all traffic. You can use exclude rules to create a whitelist of unimpacted traffic.

‍

Targeting traffic to network provider services

For network experiments, Gremlin includes an easy way to target network traffic going to and from third-party service providers. When configuring a network experiment, click on the Providers drop-down and look for the service you want to impact. You can also search for services by typing in the box.

Searching for AWS EC2 endpoints via the Providers box in Gremlin.

‍

Specifying which network device to use during network experiments

All network experiments accept a --device argument that refers to the network interfaces to target. Starting with Linux agent version 2.30.0 / Windows agent version 1.9.0, you can specify one or more network interfaces using either a comma-separated list or with multiple --device arguments.

When unspecified, Gremlin targets all physical network interfaces as reported by the operating system. For virtual / cloud machines that typically includes the expected network interfaces like eth0 and eth1 for Linux and Ethernet for Windows.

‍

Device discovery on older agents

Agents before Linux version 2.30.0 / Windows version 1.9.0 use a different strategy described here. All network experiments accept a --device argument that refers to the network interface to target. Gremlin network experiments target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:

Gremlin omits all loopback devices (determined by RFC1122).
Gremlin selects the device with the lowest interface index that starts with eth, en, or for Windows, Ethernet.
If nothing is found, Gremlin selects the device with the lowest interface index that is non-private (according to RFC1918).
If nothing is found, Gremlin selects the first device with the lowest interface index.

‍

Relevant privileges

Privilege	Description
EXPERIMENTS_RUN	Allows running an experiment within a team
EXPERIMENTS_READ	Allows reading all experiment information within a team
EXPERIMENTS_WRITE	Allows creating or updating an experiment for a team
HALT_WRITE	Allows halting a specific experiment
HALT_ALL	Allows halting all running experiments and tests company-wide

‍

Targets

Scenarios