Dashboard
Fault Injection

Experiments

Note
Experiments were previously called Attacks. You may still see references to "attack" in the Gremlin CLI and REST API.

An experiment is a method of injecting failure into a system in a simple, safe, and secure way. Gremlin provides a range of experiments which you can run against your infrastructure. This includes impacting system resources, delaying or dropping network traffic, shutting down hosts, and more. In addition to running onetime experiments, you can also schedule regular or recurring experiments, create experiment templates, and view experiment reports.

Gremlin provides three categories of experiments:

  • Resource experiments: test against sudden changes in consumption of computing resources
  • Network experiments: test against unreliable network conditions
  • State experiments: test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes

Each experiment tests your resilience in a different way.

Resource Experiments

Resource experiments are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk space.

Experiment Impact
CPU Generates high load for one or more CPU cores.
Memory Allocates a specific amount of RAM.
IO Puts read/write pressure on I/O devices such as hard disks.
Disk Writes files to disk to fill it to a specific percentage.
Process Exhaustion Simulates running processes on a target to consume process IDs (PIDs).

State Experiments

State experiments modify the state of a target so you can test auto-correction and similar fault-tolerant mechanisms.

ExperimentImpact
ShutdownPerforms a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time TravelChanges the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process KillerKills the specified process, which can be used to simulate application or dependency crashes. Note: Process experiments do not work for Process ID 1, consider a Shutdown experiment instead.

Network Experiments

Network experiments test the impact of lost or delayed traffic to a target. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.

ExperimentImpact
BlackholeDrops all matching network traffic.
Certificate ExpiryChecks for expiring security certificates.
LatencyInjects latency into all matching egress network traffic.
Packet LossInduces packet loss into all matching egress network traffic.
DNSBlocks access to DNS servers.

Warning: Important considerations for targeting Kubernetes Pods with Network experiments

Network host tags

You can use tags to target IP addresses where traffic should be impacted during network experiments. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an experiment should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between <span class="code-class-custom">serviceA</span> and <span class="code-class-custom">serviceB</span>, select all clients with the tag <span class="code-class-custom">service:serviceA</span> when choosing the Hosts to target, and select the tag <span class="code-class-custom">service:serviceB</span> when configuring the Network experiment. IP addresses assigned to the network interface by the container runtime are also automatically included.

Network providers

Limit the impact of a network experiment to specific external service providers. Select one or many services and their associated region to impact. Gremlin currently supports AWS, Azure, and Datadog services. The destination network configuration is automatically updated daily using these sources: AWS discovery service, Azure service tags.

Network device selection

All network experiments accept a <span class="code-class-custom">--device</span> argument that refers to the network interfaces to target. Starting with Linux agent version 2.30.0 / Windows agent version 1.9.0, you can specify one or more network interfaces using either a comma-separated list or with multiple <span class="code-class-custom">--device</span> arguments.

When unspecified, Gremlin targets all physical network interfaces as reported by the operating system. For virtual / cloud machines that typically includes the expected network interfaces like <span class="code-class-custom">eth0</span> and <span class="code-class-custom">eth1</span> for Linux and <span class="code-class-custom">Ethernet</span> for Windows.

Device discovery on older agents

Agents before Linux version 2.30.0 / Windows version 1.9.0 use a different strategy described here. All network experiments accept a <span class="code-class-custom">--device</span> argument that refers to the network interface to target. Gremlin network experiments target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:

  • Gremlin omits all loopback devices (determined by [RFC1122]).
  • Gremlin selects the device with the lowest interface index that starts with <span class="code-class-custom">eth</span>, <span class="code-class-custom">en</span>, or for Windows <span class="code-class-custom">Ethernet</span>.
  • If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to [RFC1918]).
  • If nothing was found, Gremlin selects the first device with the lowest interface index.

Experiment stage progression

Every experiment in Gremlin is composed of one or more Executions, where each Execution is an instance of the experiment running on a specific target.

The Stage progression of an experiment is derived from the Stage progression of all of an experiment's Executions. Gremlin weighs the importance of Stages to mark an experiment with the most important Stage of its executions.

Example

An experiment with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are <span class="code-class-custom">TargetNotFound, Running, TargetNotFound</span>, the resulting stage for the experiment will be <span class="code-class-custom">Running</span>.

You can see Stages ordered by their importance in the following section.

Stages

Stages are sorted by descending order of importance (the <span class="code-class-custom">Running</span> Stage holds the highest importance)

StageDescription
RunningExperiment running on the host
HaltExperiment told to halt
RollbackStartedCode to roll back has started
RollbackTriggeredDaemon started a rollback of client
InterruptTriggeredDaemon issued an interrupt to the client
HaltDistributedDistributed to the host but not yet halted
InitializingExperiment is creating the desired impact
DistributedDistributed to the host but not yet running
PendingCreated but not yet distributed
FailedClient reported unexpected failure
HaltFailedHalt on client did not complete
InitializationFailedCreating the impact failed
LostCommunicationClient never reported finishing/receiving execution
ClientAbortedSomething on the client/daemon side stopped the Gremlin and it was aborted without user intervention
UserHaltedUser issued a halt, and that is now complete
SuccessfulCompleted running on the Host
TargetNotFoundExperiment not scoped to any current targets

Scheduling experiments

Experiments can be run ad-hoc or scheduled, from the Web App or programmatically. You can schedule experiments to execute on certain days and within a specified time window. You can also set the maximum number of experiments a schedule can generate.

Running experiments on Kubernetes objects

Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.

Only parent Kubernetes objects are available to target. Pods will be listed only if they don't belong to a Set or Deployment.

Selecting containers

For State and Resource experiment types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Experiment page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the experiment, the underlying containers within the objects selected will be impacted.

Any, all, or specific options for container experiments
Containers share resources with their hosts. Running resource experiments on Kubernetes objects will impact the hosts where the targeted containers are running, including the host's full set of containers. Targeted containers also need to be able to resolve api.gremlin.com, otherwise the experiment will fail. Gremlin adopts all the configuration and resources of the pod it is experimenting.

Monitoring experiments in real time

You can observe your environments in real-time in Gremlin for CPU or Shutdown experiments, to quickly verify the effect of your experiments. For CPU experiments, you can see the statistics for CPU load; for Shutdown experiments, you can see machine uptime.

Monitor CPU experiments in real time

Enabling Experiment Visualizations

Company Admins and Owners can turn this feature on for their company by visiting the Company Settings, clicking Settings, and toggling Experiment Visualizations on. Only data relevant to the experiment is collected and no data is collected when experiments are not running.

Overriding Experiment Visualizations for a host

To prevent any host from sending metrics to populate experiment visualization charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.

Parameter reference

For details on parameters supplied to individual experiments, check out the links to the individual experiment pages at the beginning of this page.

Include new targets in ongoing experiments

When selecting targets by tag, you have the option to check the Include New Targets checkbox. When checked, if Gremlin detects a new target that meets the experiment's selection criteria, it will distribute the experiment to the target. By default, new targets will not run the experiment even if they match the selection criteria.

For example, imagine you select all EC2 hosts in the AWS <span class="code-class-custom">us-east-1</span> region for a CPU experiment. When you run the experiment, AWS detects the increased CPU usage and automatically provisions a new EC2 instance and installs the Gremlin agent. If Include New Targets is checked, Gremlin will add this new instance to the ongoing CPU experiment.

Multiple values

Port and address options can be used multiple times in a single command.

BASH

# Run a latency experiment on both DynamoDB and database.mydomain.org
gremlin attack latency -h dynamodb.us-west-1.amazonaws.com -h database.mydomain.org

Alternatively, a <span class="code-class-custom">,</span> can also be used to specify multiple values.

BASH

gremlin attack latency -p 8080,443

For a range of IP addresses, CIDR values can be used (i.e. 10.0.0.0/24).

Exclude rules

A <span class="code-class-custom">^</span> can be used before a port or address to exclude that argument from the set of impacted network targets.

If only exclude rules are supplied, all other traffic is impacted.

BASH

# Slow down all ports except DNS port
gremlin attack latency -p ^53

This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the experiment.

BASH

# Blackhole all hosts in 10.0.0.0/24 except for 10.0.0.11
gremlin attack blackhole -i 10.0.0.0/24 -i ^10.0.0.11

No items found.
Previous
Next
Previous
This is some text inside of a div block.
Compatibility
Installing the Gremlin Agent
Authenticating the Gremlin Agent
Configuring the Gremlin Agent
Managing the Gremlin Agent
User Management
Integrations
Health Checks
Notifications
Command Line Interface
Updating Gremlin
Quick Start Guide
Services and Dependencies
Detected Risks
Reliability Tests
Reliability Score
Targets
Experiments
Scenarios
GameDays
Overview
Deploying Failure Flags on AWS Lambda
Deploying Failure Flags on AWS ECS
Deploying Failure Flags on Kubernetes
Classes, methods, & attributes
API Keys
Examples
Container security
General
Linux
Windows
Chao
Helm
Glossary
Alfi
Additional Configuration for Helm
Amazon CloudWatch Health Check
AppDynamics Health Check
Application Level Fault Injection (ALFI)
Blackhole Experiment
CPU Experiment
Certificate Expiry
Custom Health Check
Custom Load Generator
DNS Experiment
Datadog Health Check
Disk Experiment
Dynatrace Health Check
Grafana Cloud Health Check
Grafana Cloud K6
IO Experiment
Install Gremlin on Kubernetes manually
Install Gremlin on OpenShift 4
Installing Gremlin on AWS - Configuring your VPC
Installing Gremlin on Kubernetes with Helm
Installing Gremlin on Windows
Installing Gremlin on a virtual machine
Installing the Failure Flags SDK
Jira
Latency Experiment
Memory Experiment
Network Tags
New Relic Health Check
Overview
Overview
Overview
Overview
Overview
Packet Loss Attack
PagerDuty Health Check
Preview: Gremlin in Kubernetes Restricted Networks
Private Network Integration Agent
Process Collection
Process Killer Experiment
Prometheus Health Check
Role Based Access Control
Running Failure Flags experiments
Scheduling Scenarios
Shared Scenarios
Shutdown Experiment
Slack
Teams
Time Travel Experiment
Troubleshooting Gremlin on OpenShift
User Authentication via SAML and Okta
Users
Webhooks
Integration Agent for Linux
Test Suites
Restricting Testing Times
Reports
Process Exhaustion Experiment
Enabling DNS collection