Dashboard
Fault Injection

Targets

A target is any infrastructure or application resource that you can run experiments on. This can include Amazon EC2 instances, Kubernetes cluster resources, DigitalOcean droplets, and bare metal servers in your own data center. You can view all targets in the Agents section of the web app.

Hosts

Hosts are bare metal or virtualized systems (including cluster nodes) that are available for experiments. In the Web UI, users can filter selected hosts based on their tags, and select individual hosts by their <span class="code-class-custom">Identifier</span>, when creating experiments.

In a containerized environment like Kubernetes or Docker, running experiments on hosts can potentially impact containers and processes running in those environments. To limit the scope of your experiments, you can also use container or Kubernetes-based targeting in those environments.

Host states

After installation, the Gremlin agent will create an outbound network connection to <span class="code-class-custom">api.gremlin.com</span> over port 443. When the agent authenticates, its host will appear on the hosts page in the <span class="code-class-custom">Active</span> state. It will remain in this state for as long as the agent can reach the Gremlin API. If the agent can't reach the Gremlin API for five minutes, the host will enter the <span class="code-class-custom">Idle</span> state. If the host does not reconnect in the next twelve hours, the host will be automatically deleted from our systems. If the agent reconnects after this happens, the host will be recreated as if it was a new host.

Users also have the option to manually revoke a host. In this case, the host will be put into a <span class="code-class-custom">Revoked</span> state and will not be able to run experiments. This action can be reversed.

Cloud tagging

As of version <span class="code-class-custom">2.11.6</span> the Gremlin Agent supports automatic tagging of all hosts running in the three major cloud providers. If the Gremlin Agent is able to detect and parse cloud metadata it will automatically append specific attributes to the host in the form of tags. This allows users to easily utilize cloud metadata to target particular sections of their infrastructure out of the box.

AWS

The following AWS metadata attributes are supported for automatic tagging.

  • <span class="code-class-custom">azid</span>
  • <span class="code-class-custom">image-id</span>
  • <span class="code-class-custom">instance-id</span>
  • <span class="code-class-custom">instance-type</span>
  • <span class="code-class-custom">local-hostname</span>
  • <span class="code-class-custom">local-ip</span>
  • <span class="code-class-custom">public-hostname</span>
  • <span class="code-class-custom">public-ip</span>
  • <span class="code-class-custom">region</span>
  • <span class="code-class-custom">zone</span>

To include custom AWS tags, ensure the DescribeTags policy is granted to your EC2 instances. For the case of RPM and DEB installations, you will also need to install the aws CLI. Some details are available here and here.

NOTE: Update Gremlin to at least Linux 2.15.9 or Windows 1.0.11 to use azid or AWS tags.

Azure

The following Azure metadata attributes are supported for automatic tagging.

  • <span class="code-class-custom">azEnvironment</span>
  • <span class="code-class-custom">location</span>
  • <span class="code-class-custom">name</span>
  • <span class="code-class-custom">osType</span>
  • <span class="code-class-custom">privateIpAddress</span>
  • <span class="code-class-custom">publicIpAddress</span>
  • <span class="code-class-custom">sku</span>
  • <span class="code-class-custom">vmId</span>
  • <span class="code-class-custom">vmScaleSetName</span>
  • <span class="code-class-custom">zone</span>
GCP

The following GCP metadata attributes are supported for automatic tagging.

  • <span class="code-class-custom">hostname</span>
  • <span class="code-class-custom">id</span>
  • <span class="code-class-custom">image</span>
  • <span class="code-class-custom">local-ip</span>
  • <span class="code-class-custom">name</span>
  • <span class="code-class-custom">public-ip</span>
  • <span class="code-class-custom">zone</span>

Containers

Container and pod labels are automatically detected by the Gremlin agent and displayed in the Web UI. Users can individually select the container they wish to target. By definition, containers of a Kubernetes Pod all share a namespace and cgroup. This means when Gremlin applies a network impact to one container within a Kubernetes pod, the impact will be observed for all containers in the Pod. Note that this does not apply to containers in Pod replicas. If you target a specific Pod replica, the effect applies to containers within that replica only, and does not apply to the rest of the replicas.

It is always recommended to target only a single container of a Pod. If you wish to exclude some containers from the network impact, reduce your blast radius by specifying ports relevant to the containers you wish to see impacted.

Containers and resource experiments

What are cgroups?

Control Groups (cgroups) are used by the Linux kernel to limit access to a host's hardware resources among a group of processes. Container run times like Docker use cgroups to establish Memory, CPU, Disk, and IOPS limits.

Multiple containers can run within the same cgroup. When an experiment is started, the Gremlin agent attaches a Gremlin sidecar to the target container(s). The sidecar operates within the same cgroup as the target so that it shares the same resource limits and namespace. When the experiment is finished, any resources used by the Gremlin sidecar are freed and the sidecar is removed.

Gremlin and cgroups

With Cgroup integration, Gremlin resource consumption is designed to behave as if it were running inside the containers it targets. Gremlin accomplishes this by running under the cgroup of its target container so that any limits enforced on the target are also enforced on Gremlin. Kernel protections like Out-of-Memory killers (OOMKillers) can terminate container processes when they compete for all of the resources available to them, including Gremlin.

For example, when running a memory experiment, Gremlin queries the cgroup for limits and the current memory usage of the target container(s). From those values, we derive how much additional memory to consume. Likewise, the CPU experiment identifies the total system capacity allocated to the container(s) and adjusts the size of the CPU experiment accordingly.

Kubernetes pod eviction and failed experiments

On Gremlin versions 2.13 and later, Kubernetes may evict pods targeted by Gremlin resource experiments when they exhaust their resource limits. When this happens, the Kubernetes scheduler deletes Gremlin and all targeted pod resources. Sometimes Gremlin displays this as a <span class="code-class-custom">Failed</span> experiment. Gremlin will show some warning messages when targets are potential candidates for eviction.

Updates to the Gremlin Agent regularly improve the observability around pod eviction, and the effect Gremlin made on the Kubernetes targets.

Supported cgroup drivers

Update Gremlin to at least 2.16 in order to use resource experiments against containers with proper integration with the systemd cgroup driver. See Gremlin's Kubernetes installation documentation for more information on installing Gremlin with a container driver that supports the systemd cgroup driver. See Gremlin's Considerations: Container Drivers for more information on the requirements of Gremlin's container drivers.

Container runtimes generally provide support for two cgroup drivers: <span class="code-class-custom">cgroupfs</span>, and <span class="code-class-custom">systemd</span>. If your system is running the systemd cgroup driver, and you are running Gremlin's legacy docker container driver, you may observe the following:

  • Experiments against Kubernetes pods will abide by the resource limits of the pods targeted in the experiment, but the resulting resource usage Gremlin generates will not be reflected in cAdvisor metrics (like from <span class="code-class-custom">metrics-server</span>) and may not impact autoscaling group triggers
  • Experiments against Docker containers (not running within Kubernetes) may not abide by container limits and instead abide by the <span class="code-class-custom">system.slice</span> root cgroup of the host machine

Container experiment examples

Here are some examples of infrastructure layer experiment options you can try that may be better suited to your intended experiment:

  • Use a CPU, Memory, Disk, or IOPS experiment on the target container to test how your orchestrator handles the corresponding resource pressure
  • Use a shutdown experiment to find out what happens when a container disappears, becomes uncommunicative, or is otherwise unusable.
  • Use a blackhole experiment to find out what happens when communication between a container and the rest of your system is disrupted; make sure you target a single container within a pod here if that is what you are looking to test. If you target a pod it will impact all containers in the pod.
  • Use a process killer experiment to kill specific processes on the underlying host machine to find out what happens as you simulate dependency crashes, for example. If you do, remember that it has the potential to impact all containers on that host.

Exact vs random

There are two ways to select targets when you create an experiment, choosing the exact targets, or to randomly select some, or all, of the targets by the supplied tags. Exact target selection uses the host or container's Identifier as the selection method, while Random target selection takes advantage of tags on that host or container. For automated scheduling of experiments in a dynamic environment where hosts and containers are spun up and down, Random selection is preferred.

Exact

Exact selection offers deterministic behavior. It impacts exactly the target hosts or containers that are selected. To select the specific hosts or containers you would like to target, narrow your search by a tag, and select your targets. All hosts or containers selected will be included in the experiment.

Random

Random selection helps introduce entropy, and produces a more realistic scenario of partial and probabilistic failure on some targets. Hosts and containers that are impacted as a result of an experiment are selected using a combination of tags to select them, and the impact field to narrow the blast radius.

To select targets using the Random method, select 1 or more tags per tag category. Tags in the same category select targets using the <span class="code-class-custom">OR</span> operation. Tags in different categories use the <span class="code-class-custom">AND</span> operation.

For example, if there are 5 tags in a category named <span class="code-class-custom">Services</span> and 3 tags <span class="code-class-custom">API</span>, <span class="code-class-custom">Events</span>, <span class="code-class-custom">Caching</span> are selected, Gremlin will Target <span class="code-class-custom">API</span> or <span class="code-class-custom">Events</span> or <span class="code-class-custom">Caching</span>. Then, if you select a tag in a category <span class="code-class-custom">Zone</span>, Gremlin will target any of the 3 services <span class="code-class-custom">API</span>, <span class="code-class-custom">Events</span>, or <span class="code-class-custom">Caching</span> but only in zone <span class="code-class-custom">us-west-1a</span>.

To further narrow your experiment beyond the tags selected, random targets can be selected by a specific count of targets to impact (M of N targets), or by percentage of targets to impact (X percent of total applicable targets). For example, Gremlin can impact <span class="code-class-custom">5 of 37</span> targets, or <span class="code-class-custom">50% of 37</span> targets.

In the case of a Kubernetes experiment, after selecting the deployments to test, you can indicate a percentage or a number of pods to be selected at random to run the experiment on. For example, Gremlin can impact a randomly selected <span class="code-class-custom">5 of 37</span> pods, or <span class="code-class-custom">50% of 37</span> pods.


No items found.
Previous
This is some text inside of a div block.
Compatibility
Installing the Gremlin Agent
Authenticating the Gremlin Agent
Configuring the Gremlin Agent
Managing the Gremlin Agent
User Management
Integrations
Health Checks
Notifications
Command Line Interface
Updating Gremlin
Quick Start Guide
Services and Dependencies
Detected Risks
Reliability Tests
Reliability Score
Targets
Experiments
Scenarios
GameDays
Overview
Deploying Failure Flags on AWS Lambda
Deploying Failure Flags on AWS ECS
Deploying Failure Flags on Kubernetes
Classes, methods, & attributes
API Keys
Examples
Container security
General
Linux
Windows
Chao
Helm
Glossary
Alfi
Additional Configuration for Helm
Amazon CloudWatch Health Check
AppDynamics Health Check
Application Level Fault Injection (ALFI)
Blackhole Experiment
CPU Experiment
Certificate Expiry
Custom Health Check
Custom Load Generator
DNS Experiment
Datadog Health Check
Disk Experiment
Dynatrace Health Check
Grafana Cloud Health Check
Grafana Cloud K6
IO Experiment
Install Gremlin on Kubernetes manually
Install Gremlin on OpenShift 4
Installing Gremlin on AWS - Configuring your VPC
Installing Gremlin on Kubernetes with Helm
Installing Gremlin on Windows
Installing Gremlin on a virtual machine
Installing the Failure Flags SDK
Jira
Latency Experiment
Memory Experiment
Network Tags
New Relic Health Check
Overview
Overview
Overview
Overview
Overview
Packet Loss Attack
PagerDuty Health Check
Preview: Gremlin in Kubernetes Restricted Networks
Private Network Integration Agent
Process Collection
Process Killer Experiment
Prometheus Health Check
Role Based Access Control
Running Failure Flags experiments
Scheduling Scenarios
Shared Scenarios
Shutdown Experiment
Slack
Teams
Time Travel Experiment
Troubleshooting Gremlin on OpenShift
User Authentication via SAML and Okta
Users
Webhooks
Integration Agent for Linux
Test Suites
Restricting Testing Times
Reports
Process Exhaustion Experiment
Enabling DNS collection