Platform

Attacks

An attack is a method of injecting failure into a system in a simple, safe, and secure way. Gremlin provides a range of attacks which you can run against your infrastructure. This includes impacting system resources, delaying or dropping network traffic, shutting down hosts, and more. In addition to running onetime attacks, you can also schedule regular or recurring attacks, create attack templates, and view attack reports.

Gremlin provides three categories of attacks:

  • Resource attacks: test against sudden changes in consumption of computing resources
  • Network attacks: test against unreliable network conditions
  • State attacks: test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes

Each attack, or "Gremlin", tests your resilience in a different way.

Resource Gremlins

Resource Gremlins are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk space.

GremlinImpact
CPUGenerates high load for one or more CPU cores.
MemoryAllocates a specific amount of RAM.
IOPuts read/write pressure on I/O devices such as hard disks.
DiskWrites files to disk to fill it to a specific percentage.

State Gremlins

State Gremlins introduce chaos into your infrastructure so that you can observe how well your service handles it, or if it fails.

GremlinImpact
ShutdownPerforms a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time TravelChanges the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process KillerKills the specified process, which can be used to simulate application or dependency crashes. Note: Process attacks do not work for Process ID 1, consider a Shutdown attack instead.

Network Gremlins

Network Gremlins show you the impact of lost or delayed traffic to your application. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.

GremlinImpact
BlackholeDrops all matching network traffic.
LatencyInjects latency into all matching egress network traffic.
Packet LossInduces packet loss into all matching egress network traffic.
DNSBlocks access to DNS servers.

Warning: Important considerations for targeting Kubernetes Pods with Network Attacks

Network host tags

You can use tags to target IP addresses where traffic should be impacted during network attacks. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an attack should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between serviceA and serviceB, select all clients with the tag service:serviceA when choosing the Hosts to target, and select the tag service:serviceB when configuring the Network Gremlin.

Network providers

Limit the blast radius of a network attack to specific external service providers. Select one or many AWS services and their associated region to impact. The destination network configuration is automatically updated daily using an AWS discovery service.

Attack stage progression

Every Attack in Gremlin is composed of one or more Executions, where each Execution is an instance of the attack running on a specific target.

The Stage progression of an Attack is derived from the Stage progression of all of an Attack's Executions. Gremlin weighs the importance of Stages to mark an Attack with the most important Stage of its executions.

Example

An Attack with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are TargetNotFound, Running, TargetNotFound, the resulting stage for the Attack will be Running.

You can see Stages ordered by their importance in the following section.

Stages

Stages are sorted by descending order of importance (the Running Stage holds the highest importance)

StageDescription
RunningAttack running on the host
HaltAttack told to halt
RollbackStartedCode to roll back has started
RollbackTriggeredDaemon started a rollback of client
InterruptTriggeredDaemon issued an interrupt to the client
HaltDistributedDistributed to the host but not yet halted
InitializingAttack is creating the desired impact
DistributedDistributed to the host but not yet running
PendingCreated but not yet distributed
FailedClient reported unexpected failure
HaltFailedHalt on client did not complete
InitializationFailedCreating the impact failed
LostCommunicationClient never reported finishing/receiving execution
ClientAbortedSomething on the client/daemon side stopped the Gremlin and it was aborted without user intervention
UserHaltedUser issued a halt, and that is now complete
SuccessfulCompleted running on the Host
TargetNotFoundAttack not scoped to any current targets

Running attacks on Kubernetes objects

Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.

Selecting containers

For State and Resource attack types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Attack page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the attack, the underlying containers within the objects selected will be impacted.

Any, all, or specific options for container attacks

Monitoring attacks in real time

You can observe your environments in real-time in Gremlin for CPU or Shutdown experiments, to quickly verify the effect of your experiments. For CPU attacks, you can see the statistics for CPU load; for Shutdown attacks, you can see machine uptime.

Monitor CPU attacks in real time

Enabling Attack Visualizations

Company Admins and Owners can turn this feature on for their company by visiting the Company Settings, clicking Settings, and toggling Attack Visualizations on. Only data relevant to the attack is collected and no data is collected when attacks are not running.

Overriding Attack Visualizations for a host

To prevent any host from sending metrics to populate attack visualization charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.

Parameter reference

For details on parameters supplied to individual attacks, check out the links to the individual attack pages at the beginning of this page.

Multiple values

Port and address options can be used multiple times in a single command.

bash
1# Attack both DynamoDB and database.mydomain.org
2gremlin attack latency -h dynamodb.us-west-1.amazonaws.com -h database.mydomain.org

Alternatively, a , can also be used to specify multiple values.

bash
1gremlin attack latency -p 8080,443

Exclude rules

A ^ can be used before a port or address to exclude that argument from the set of impacted network targets.

bash
1# Slow down all ports except DNS port
2gremlin attack latency -p ^53

This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the attack.

bash
1# Blackhole all hosts in 10.0.0.0/24 except for 10.0.0.11
2gremlin attack blackhole -i 10.0.0.0/24 -i ^10.0.0.11

Device discovery

All network attacks accept a --device argument that refers to the network interface to target. Gremlin network attacks target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:

  • Gremlin omits all loopback devices (determined by [RFC1122]).
  • Gremlin selects the device with the lowest interface index that starts with eth, en, or for Windows Ethernet.
  • If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to [RFC1918]).
  • If nothing was found, Gremlin selects the first device with the lowest interface index.