Infrastructure Layer

Targets

Gremlin can inject fault into targets running on any infrastructure environment whether it be an EC2 instance on Amazon Web Services, a droplet in DigitalOcean, or a bare metal server in your own data center.

With your hosts registered to the Gremlin Control Plane, users can create attacks targeting either hosts or containers in your environment.

Hosts

Clients (hosts or cluster nodes) registered to the Gremlin Control Plane will show as active targets available for attacks. In the Web UI, users can filter selected hosts based on their tags, and select individual hosts by their Identifier, in order to run attacks and inject fault into targeted hosts.

In a container environment, when users select hosts as targets to run attacks, the resulting attack is scoped at the host level with the potential to affect anything (processes, containers, etc.) on the host. Users are encouraged to minimize the blast radius of their attacks via attack parameters (network device, port, etc.).

Cloud Tagging

As of version 2.11.6 the Gremlin client supports automatic tagging of all hosts running in the three major cloud providers. If the Gremlin client is able to detect and parse cloud metadata it will automatically append specific attributes to the host in the form of tags. This allows users to easily utilize cloud metadata to target particular sections of their infrastructure out of the box.

AWS

The following AWS metadata attributes are supported for automatic tagging.

  • image-id
  • instance-id
  • instance-type
  • local-hostname
  • local-ip
  • public-hostname
  • public-ip
  • region
  • zone
Azure

The following Azure metadata attributes are supported for automatic tagging.

  • azEnvironment
  • location
  • name
  • osType
  • privateIpAddress
  • publicIpAddress
  • sku
  • vmId
  • vmScaleSetName
  • zone
GCP

The following GCP metadata attributes are supported for automatic tagging.

  • hostname
  • id
  • image
  • local-ip
  • name
  • public-ip
  • zone

Containers

Container and pod labels are exposed in the Web UI. Users can individually select the container they wish to attack. By definition, containers of a Kubernetes Pod all share a namespace and cgroup. This means when Gremlin applies a network impact to one container within a Kubernetes pod, the impact will be observed for all containers in the Pod. Note that this does not apply to containers in Pod replicas. If you attack a specific Pod replica, the effect applies to containers within that replica only, and does not apply to the rest of the replicas.

It is always recommended to target only a single container of a Pod. If you wish to exclude some containers from the network impact, reduce your blast radius by specifying ports relevant to the containers you wish to see impact.

Containers and Resource Attacks

Running a resource attack on a container elicits different results than running the same attack on a machine or virtual machine. With containers, container-available resources are shared across the host.

Running a resource attack on one container may indirectly impact all other containers on that host. Due to the boundaries of containers, Gremlin is only able to consume resourses on the underlying host, adjacent to the target container. This is why, for example, a CPU attack set to run at 100% does not always return the results you might expect.

To account for this, we must think differently when running resource attacks on containers or when trying to design chaos experiments that would traditionally be accomplished using resource attacks. Here are some use cases and our recommendations for experiments to fit the use cases.

First, make sure you have good testing, monitoring and observability, and alerting thresholds set. These are critical in a container environment.

When you want to test that a container-managed service and/or orchestrator (on Kubernetes) is working as expected, you don’t always want to test the same way as you would with traditional deployments. Examples of such moments include when you want to:

  • Kill an unhealthy container and start up a new one
  • Test that autoscaling is properly configured and working as expected
  • Confirm that internal DNS is working as expected, such as for Kubernetes
  • Learn what happens when your application or service encounters a memory leak or full disk

In instances like these, a resource attack is often not what you need or want when using containers. Here are some examples of alternate infrastructure layer attack options you can try that may be better suited to your intended experiment:

  • Use a shutdown attack to find out what happens when a container disappears, becomes uncommunicative, or is otherwise unusable.
  • Use a blackhole attack to find out what happens when communication between a container and the rest of your system is disrupted; make sure you target a single container within a pod here if that is what you are looking to test. If you target a pod it will impact all containers in the pod.
  • You can use a process killer attack to kill specific processes on the underlying host machine to find out what happens as you simulate dependency crashes, for example. If you do, remember that it has the potential to impact all containers on that host.

In addition, when you want to test that an application does the expected, you should focus your testing directly on the application in a way that that does not impact other services and containers on the host. Examples of such moments include when you want to:

  • Determine whether your application scales properly when fed many concurrent requests
  • Impact only specific customer-IDs or device types to help further refine and tightly scope experiments

In instances like these, you will want to use our application layer fault injection option.

Exact vs Random

There are two ways to select targets when you create an attack, choosing the exact targets, or to randomly select some, or all, of the targets by the supplied tags. Exact target selection uses the host or container's Identifier as the selection method, while Random target selection takes advantage of tags on that host or container. For automated scheduling of attacks in a dynamic environment where hosts and containers are spun up and down, Random selection is preferred.

Exact

Exact selection offers deterministic behavior. It impacts exactly the target hosts or containers that are selected. To select the specific hosts or containers you would like to attack, narrow your search by a tag, and select your targets. All hosts or containers selected will be attacked.

Random

Random selection helps introduce entropy, and produces a more realistic scenario of partial and probabilistic failure on some targets. Hosts and containers that are impacted as a result of an attack are selected using a combination of tags to select them, and the impact field to narrow the blast radius.

To select targets using the Random method, select 1 or more tags per tag category. Tags in the same category select targets using the OR operation. Tags in different categories use the AND operation.

For example, if there are 5 tags in a category named Services and 3 tags API, Events, Caching are selected, Gremlin will Target API or Events or Caching. Then, if you select a tag in a category Zone, Gremlin will target any of the 3 services API, Events, or Caching but only in zone us-west-1a.

To further narrow your attack beyond the tags selected, random targets can be selected by a specific count of targets to impact (M of N targets), or by percentage of targets to impact (X percent of total applicable targets). For example, Gremlin can impact 5 of 37 targets, or 50% of 37 targets.