Gremlin is a simple, safe, and secure way to use Chaos Engineering to improve system resilience. The Gremlin Platform provides a range of attacks which you can run against your infrastructure. This includes Resource Gremlins, Network Gremlins and State Gremlins. It is also possible to schedule regular attacks, create attack templates, and view attack reports.
Gremlin provides a library of possible failure modes to test. You can impact system resources, delay or drop network traffic to your dependencies, shut down your hosts, and much more!
Each attack, or "gremlin", tests your resilience in a different way:
Resource gremlins are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk.
|CPU||Generates high load for one or more CPU cores.|
|Memory||Allocates a specific amount of RAM.|
|IO||Puts read/write pressure on I/O devices such as hard disks.|
|Disk||Writes files to disk to fill it to a specific percentage.|
State gremlins introduce chaos into your infrastructure so that you can observe how well your service handles it or fails.
|Shutdown||Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.|
|Time Travel||Changes the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.|
|Process Killer||Kills the specified process, which can be used to simulate application or dependency crashes. (Note: does not work for PID 1, consider a Shutdown attack instead)|
Network gremlins allow you to see the impact of lost or delayed traffic to your application. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.
|Blackhole||Drops all matching network traffic.|
|Latency||Injects latency into all matching egress network traffic.|
|Packet Loss||Induces packet loss into all matching egress network traffic.|
|DNS||Blocks access to DNS servers.|
Tags can be utilized for targeting IP addresses to which traffic should be impacted during network attacks. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an attack should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between
serviceB, select all clients with the tag
service:serviceA when choosing the Hosts to target, and select the tag
service:serviceB when configuring the Network Gremlin to run.
Limit the blast radius of a network attack to specific external service providers. Select one or many AWS services and their associated region to impact. The destination network configuration is automatically updated daily using an AWS discovery service.
Every Attack on Gremlin is composed of one or more Executions, where each Execution is an instance of the attack running on a specific target.
The Stage progression of an Attack is derived from the Stage progression of all of an Attack's Executions. Gremlin weighs the Importance of Stages so as to mark an Attack with the most important Stage of its executions.
An Attack with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are
TargetNotFound, Running, TargetNotFound, the resulting stage for the Attack will be
You can see Stages ordered by their importance below.
Stages are sorted by descending order of importance (the
Running Stage holds the highest importance)
|Running||Attack running on the host|
|Halt||Attack told to halt|
|RollbackStarted||Code to rollback has started|
|RollbackTriggered||Daemon started a rollback of client|
|InterruptTriggered||Daemon issued an interrupt to the client|
|HaltDistributed||Distributed to the host but not yet halted|
|Initializing||Attack is creating the desired impact|
|Distributed||Distributed to the host but not yet running|
|Pending||Created but not yet distributed|
|Failed||Client reported unexpected failure|
|HaltFailed||Halt on client did not complete|
|InitializationFailed||Creating the impact failed|
|LostCommunication||Client never reported finishing/receiving execution|
|ClientAborted||Something on the client/daemon side stopped the Gremlin and it was aborted without user intervention|
|UserHalted||User issued a halt, and that is now complete|
|Successful||Completed running on the Host|
|TargetNotFound||Attack not scoped to any current targets|
Observe your environments in real time in the Gremlin UI, for CPU or Shutdown experiments, to quickly verify the effect of your experiments.
Company Admins and Owners can turn this feature on for their company by visiting the "Company Settings", clicking on the "Settings", and toggling "Attack Visualizations" on. No data is collected when attacks are not running and only data relevant to the attack is collected:
- CPU: statistics for CPU load
- Shutdown" machine uptime
In addition, to prevent any host from sending metrics to populate these charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.
For details on parameters supplied to individual attacks, check out the links to each attack at the top of this page.
Port and address options can be used multiple times in a single command.
1# Attack both DynamoDB and database.mydomain.org2gremlin attack latency -h dynamodb.us-west-1.amazonaws.com -h database.mydomain.org
, can also be used to specify multiple values.
1gremlin attack latency -p 8080,443
^ can be used before a port or address to exclude that argument from the set of impacted network targets.
Note: If only exclude rules are supplied, all other traffic is impacted.
1# Slow down all ports except DNS port2gremlin attack latency -p ^53
This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the attack.
1# Blackhole all hosts in 10.0.0.0/24 except for 10.0.0.112gremlin attack blackhole -i 10.0.0.0/24 -i ^10.0.0.11
All network attacks accept a
--device argument that refers to the network interface to target. Gremlin network attacks target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:
- Gremlin omits all loopback devices (determined by RFC1122).
- Gremlin selects the device with the lowest interface index that starts with
en, or for Windows
- If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to RFC1918).
- If nothing was found, Gremlin selects the first device with the lowest interface index.
The following sub-sections have moved. They are preserved here to support older links.