Gremlin can inject fault into targets running on any infrastructure environment whether it be an EC2 instance on Amazon Web Services, a droplet in DigitalOcean, or a bare metal server in your own data center.
With your hosts registered to the Gremlin Control Plane, users can create attacks targeting either hosts or containers in your environment.
Gremlin can automatically detect the services running in your environment and present them as a new target for running chaos experiments. This lets you easily run attacks on distributed systems by targeting host processes, containers, or Kubernetes resources. You can also track attacks ran on each service, view operational metadata, link monitoring dashboards and incident management runbooks, and assign other Gremlin users as owners. See Services Discovery for more details.
Clients (hosts or cluster nodes) registered to the Gremlin Control Plane will show as active targets available for attacks. In the Web UI, users can filter selected hosts based on their tags, and select individual hosts by their
Identifier, in order to run attacks and inject fault into targeted hosts.
In a container environment, when users select hosts as targets to run attacks, the resulting attack is scoped at the host level with the potential to affect anything (processes, containers, etc.) on the host. Users are encouraged to minimize the blast radius of their attacks via attack parameters (network device, port, etc.).
When host clients first authenticate, they will appear on the hosts page in the
Active state. So long as they continue to phone home to the Gremlin Control Plane, they will remain in this state. If for any reason the host stops polling the Gremlin Control Plane our system will wait five (5) minutes before putting the host into the
Idle state. If the host does not resume polling in the next twelve (12) hours the host will be automatically deleted from our systems.
Users also have the option to manually revoke access to the Gremlin Control Plane for a host client. In this case the host will be put into a
Revoked state and will not be allowed any access to the Gremlin Control Plane regardless of whether or not it continue to actively poll. This action can be reversed by the user should they to regrant the host access to the Gremlin Control Plane.
As of version
2.11.6 the Gremlin client supports automatic tagging of all hosts running in the three major cloud providers. If the Gremlin client is able to detect and parse cloud metadata it will automatically append specific attributes to the host in the form of tags. This allows users to easily utilize cloud metadata to target particular sections of their infrastructure out of the box.
The following AWS metadata attributes are supported for automatic tagging.
To include custom AWS tags, ensure the DescribeTags policy is granted to your EC2 instances.
For the case of RPM and DEB installations, you will also need to install the
Some details are available here and here.
The following Azure metadata attributes are supported for automatic tagging.
The following GCP metadata attributes are supported for automatic tagging.
Container and pod labels are exposed in the Web UI. Users can individually select the container they wish to attack. By definition, containers of a Kubernetes Pod all share a namespace and cgroup. This means when Gremlin applies a network impact to one container within a Kubernetes pod, the impact will be observed for all containers in the Pod. Note that this does not apply to containers in Pod replicas. If you attack a specific Pod replica, the effect applies to containers within that replica only, and does not apply to the rest of the replicas.
It is always recommended to target only a single container of a Pod. If you wish to exclude some containers from the network impact, reduce your blast radius by specifying ports relevant to the containers you wish to see impact.
Control Groups (cgroups) are used by the Linux kernel to limit access to a host's hardware resources among a group of processes. Container run times like Docker use cgroups to establish Memory, CPU, Disk, and IOPS limits.
With Cgroup integration, Gremlin resource consumption is designed to behave as if it were running inside the containers it targets. Gremlin accomplishes this by running under the cgroup of its target container so that any limits enforced on the target, are also enforced on Gremlin. Kernel protections like Out-of-Memory killers (OOMKillers) can terminate container processes when they compete for all of the resources available to them, including Gremlin.
On Gremlin versions 2.13 and later, Kubernetes may evict pods targeted by Gremlin resource attacks when they exhaust
their resource limits. When this happens, the Kubernetes scheduler deletes Gremlin and all targeted pod resources.
Sometimes Gremlin displays this as a
Failed attack. Gremlin will show some warning messages when targets are potential
candidates for eviction.
Updates to the Gremlin Agent regularly improve the observability around pod eviction, and the effect Gremlin made on the Kubernetes targets.
Update Gremlin to at least 2.16 in order to use resource attacks against containers with proper integration with the
systemd cgroup driver.
See Gremlin's Kubernetes installation documentation for more information on installing Gremlin with a container driver that supports the
systemd cgroup driver.
See Gremlin's Considerations: Container Drivers for more information on the requirements of Gremlin's container drivers.
Container runtimes generally provide support for two cgroup drivers:
If your system is running the
systemd cgroup driver, and you are running Gremlin's legacy docker container driver,
you may observe the following:
- Attacks against Kubernetes pods will abide by the resource limits of the pods targeted in the attack, but the resulting resource usage Gremlin generates will not be reflected in cAdvisor metrics (like from
metrics-server) and may not impact autoscaling group triggers
- Attacks against Docker containers (not running within Kubernetes) may not abide by container limits and instead abide by the
system.sliceroot cgroup of the host machine
Here are some examples of infrastructure layer attack options you can try that may be better suited to your intended experiment:
- Use a CPU, Memory, Disk, or IOPS attack on the target container to test how your orchetrator handles the corresponding resource pressure
- Use a shutdown attack to find out what happens when a container disappears, becomes uncommunicative, or is otherwise unusable.
- Use a blackhole attack to find out what happens when communication between a container and the rest of your system is disrupted; make sure you target a single container within a pod here if that is what you are looking to test. If you target a pod it will impact all containers in the pod.
- Use a process killer attack to kill specific processes on the underlying host machine to find out what happens as you simulate dependency crashes, for example. If you do, remember that it has the potential to impact all containers on that host.
There are two ways to select targets when you create an attack, choosing the exact targets, or to randomly select some, or all, of the targets by the supplied tags. Exact target selection uses the host or container's Identifier as the selection method, while Random target selection takes advantage of tags on that host or container. For automated scheduling of attacks in a dynamic environment where hosts and containers are spun up and down, Random selection is preferred.
Exact selection offers deterministic behavior. It impacts exactly the target hosts or containers that are selected. To select the specific hosts or containers you would like to attack, narrow your search by a tag, and select your targets. All hosts or containers selected will be attacked.
Random selection helps introduce entropy, and produces a more realistic scenario of partial and probabilistic failure on some targets. Hosts and containers that are impacted as a result of an attack are selected using a combination of tags to select them, and the impact field to narrow the blast radius.
To select targets using the Random method, select 1 or more tags per tag category. Tags in the same category select targets using the
OR operation. Tags in different categories use the
For example, if there are 5 tags in a category named
Services and 3 tags
Caching are selected, Gremlin will Target
Caching. Then, if you select a tag in a category
Zone, Gremlin will target any of the 3 services
Caching but only in zone
To further narrow your attack beyond the tags selected, random targets can be selected by a specific count of targets to impact (M of N targets), or by percentage of targets to impact (X percent of total applicable targets). For example, Gremlin can impact
5 of 37 targets, or
50% of 37 targets.
In the case of a Kubernetes attack, after selecting the deployments to test, you can indicate a percentage or a number of pods to be selected at random to run the attack on. For example, Gremlin can impact a randomly selected
5 of 37 pods, or
50% of 37 pods.