An experiment is a method of injecting failure into a system in a simple, safe, and secure way. Gremlin provides a range of experiments which you can run against your infrastructure. This includes impacting system resources, delaying or dropping network traffic, shutting down hosts, and more. In addition to running onetime experiments, you can also schedule regular or recurring experiments, create experiment templates, and view experiment reports.
Gremlin provides three categories of experiments:
- Resource experiments: test against sudden changes in consumption of computing resources
- Network experiments: test against unreliable network conditions
- State experiments: test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes
Each experiment tests your resilience in a different way.
Resource experiments are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk space.
State experiments modify the state of a target so you can test auto-correction and similar fault-tolerant mechanisms.
Network experiments test the impact of lost or delayed traffic to a target. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.
Network host tags
You can use tags to target IP addresses where traffic should be impacted during network experiments. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an experiment should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between <span class="code-class-custom">serviceA</span> and <span class="code-class-custom">serviceB</span>, select all clients with the tag <span class="code-class-custom">service:serviceA</span> when choosing the Hosts to target, and select the tag <span class="code-class-custom">service:serviceB</span> when configuring the Network experiment. IP addresses assigned to the network interface by the container runtime are also automatically included.
Limit the impact of a network experiment to specific external service providers. Select one or many services and their associated region to impact. Gremlin currently supports AWS, Azure, and Datadog services. The destination network configuration is automatically updated daily using these sources: AWS discovery service, Azure service tags.
Network device selection
All network experiments accept a <span class="code-class-custom">--device</span> argument that refers to the network interfaces to target. Starting with Linux agent version 2.30.0 / Windows agent version 1.9.0, you can specify one or more network interfaces using either a comma-separated list or with multiple <span class="code-class-custom">--device</span> arguments.
When unspecified, Gremlin targets all physical network interfaces as reported by the operating system. For virtual / cloud machines that typically includes the expected network interfaces like <span class="code-class-custom">eth0</span> and <span class="code-class-custom">eth1</span> for Linux and <span class="code-class-custom">Ethernet</span> for Windows.
Device discovery on older agents
Agents before Linux version 2.30.0 / Windows version 1.9.0 use a different strategy described here. All network experiments accept a <span class="code-class-custom">--device</span> argument that refers to the network interface to target. Gremlin network experiments target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:
- Gremlin omits all loopback devices (determined by [RFC1122]).
- Gremlin selects the device with the lowest interface index that starts with <span class="code-class-custom">eth</span>, <span class="code-class-custom">en</span>, or for Windows <span class="code-class-custom">Ethernet</span>.
- If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to [RFC1918]).
- If nothing was found, Gremlin selects the first device with the lowest interface index.
Experiment stage progression
Every experiment in Gremlin is composed of one or more Executions, where each Execution is an instance of the experiment running on a specific target.
The Stage progression of an experiment is derived from the Stage progression of all of an experiment's Executions. Gremlin weighs the importance of Stages to mark an experiment with the most important Stage of its executions.
An experiment with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are <span class="code-class-custom">TargetNotFound, Running, TargetNotFound</span>, the resulting stage for the experiment will be <span class="code-class-custom">Running</span>.
You can see Stages ordered by their importance in the following section.
Stages are sorted by descending order of importance (the <span class="code-class-custom">Running</span> Stage holds the highest importance)
Experiments can be run ad-hoc or scheduled, from the Web App or programmatically. You can schedule experiments to execute on certain days and within a specified time window. You can also set the maximum number of experiments a schedule can generate.
Running experiments on Kubernetes objects
Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.
For State and Resource experiment types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Experiment page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the experiment, the underlying containers within the objects selected will be impacted.
Monitoring experiments in real time
You can observe your environments in real-time in Gremlin for CPU or Shutdown experiments, to quickly verify the effect of your experiments. For CPU experiments, you can see the statistics for CPU load; for Shutdown experiments, you can see machine uptime.
Enabling Experiment Visualizations
Company Admins and Owners can turn this feature on for their company by visiting the Company Settings, clicking Settings, and toggling Experiment Visualizations on. Only data relevant to the experiment is collected and no data is collected when experiments are not running.
Overriding Experiment Visualizations for a host
To prevent any host from sending metrics to populate experiment visualization charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.
For details on parameters supplied to individual experiments, check out the links to the individual experiment pages at the beginning of this page.
Include new targets in ongoing experiments
When selecting targets by tag, you have the option to check the Include New Targets checkbox. When checked, if Gremlin detects a new target that meets the experiment's selection criteria, it will distribute the experiment to the target. By default, new targets will not run the experiment even if they match the selection criteria.
For example, imagine you select all EC2 hosts in the AWS <span class="code-class-custom">us-east-1</span> region for a CPU experiment. When you run the experiment, AWS detects the increased CPU usage and automatically provisions a new EC2 instance and installs the Gremlin agent. If Include New Targets is checked, Gremlin will add this new instance to the ongoing CPU experiment.
Port and address options can be used multiple times in a single command.
Alternatively, a <span class="code-class-custom">,</span> can also be used to specify multiple values.
A <span class="code-class-custom">^</span> can be used before a port or address to exclude that argument from the set of impacted network targets.
This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the experiment.