Platform > Managing the Gremlin Agent

Managing the Gremlin Agent

Supported platforms:

N/A

The Gremlin Agent is an executable binary installed on a host operating system, container runtime, or Kubernetes cluster. It maintains a heartbeat connection to the Gremlin Control Plane to let Gremlin know that the host is active and able to receive orders, such as initiating a reliability test or injecting fault. The agent only requires an outbound network connection to the Gremlin Control Plane, letting you run it behind a firewall without opening inbound ports. All traffic is encrypted.

‍

Agent lifecycle

Agents have two primary states:

Active: the agent is running.
Deactivated: the agent is disabled.
Idle: the agent is active, but Gremlin has lost communication with it for more than five minutes.

When an agent is installed and authenticated, and has no issues, it appears as "Active" in the Agents list. This list also indicates what type of agent this is, under the Agent Type column. Infrastructure agents will show "Infrastructure", Failure Flags agents will show "Failure Flags", etc. Each agent has a unique identifier, usually their hostname, but you can customize this. You can also search for agents by name, or by tag.

Idle and Deactivated agents don't count towards your billable agents.

‍

Deactivating agents

If you no longer want to run an agent, you can deactivate it. This makes it no longer possible to run experiments on the agent. You can reactivate a deactivated agent to make it available for experimentation again.

‍

Idle agents

An agent goes into an Idle state if the Gremlin Control Plane detects no activity for at least 5 minutes. You cannot run or schedule experiments on Idle agents. If Gremlin does not hear from these agents for a period of 12 hours, the agents become Deactivated. However, if an agent starts communicating with Gremlin again while still within the 12 hour idle window, the agent returns to the Active state.

‍

Unhealthy agents

If an agent has a problem, such as not having the correct permissions to run certain tests, it will show as Active with warnings. You can click on the agent identifier for a detailed description of the problem.

‍

Logs

Logs can be found under the ^{/var/log/gremlin} directory. Agent logs can be found in the ^daemon.log file. Log entries in this file may indicate events where the Gremlin Agent is not able to communicate with the Control Plane.

Each fault injection performed by the Agent is logged under ^{/var/log/gremlin/executions} using its unique experiment execution ID. This is useful for troubleshooting experiments that do not complete.

‍

Log size

To see how much disk space is being used by logs, run the ^du utility on the ^{/var/log/gremlin} directory:

SHELL


du -sh /var/log/gremlin

‍

Bandwidth usage

Idle state

The Gremlin Agent uses very little bandwidth in its idle state. In testing over a 5 minute period, the Agent sent a total of 11.3KB and received 24.8KB—an average combined bandwidth of 0.12KB/s.

‍

Attack state

There is a slight increase in overall bandwidth consumption during experiments. While experiments are being executed, the Agent stays in constant communication with the Control Plane as it checks for the abort condition to be executed. The bandwidth used is not affected by the type of experiment being run. In testing over a 5 minute period, the Agent sent a total of 112.3KB and received 114.0KB—an average combined bandwidth of 0.75KB/s.

‍

Process Collection

When Process Collection is enabled, the Gremlin Agent will send additional data and the bandwidth consumed will depend on how many processes are discovered. The information is gzip compressed in order to minimize network consumption. To measure the actual bandwidth consumed by Gremlin for your particular installation, we recommend using a tool such as iptraf or nethogs.

‍

Agent warnings

If an agent has a problem, Gremlin will display a warning icon next to it. Click on the agent identifier to see the complete message. For additional context, these messages are included below.

‍

W000: Linux: Agent upgrade recommended

Gremlin recommends upgrading your Linux agents to version 2.59.0, which contains important fixes and behavior changes.

Memory-based reliability tests are more accurately scored as a result of this upgrade. Out of Memory events are correctly triggered on the target application's processes, instead of the Gremlin process only.
Gremlin properly detects when traffic shaping rules are applied by a third party, and avoids changing them.

To upgrade your gremlin installation, see Updating Gremlin.

‍

W001: Network experiments limited to container and Kubernetes targets

Starting in Linux version 2.59.0, Gremlin properly detects when the target has traffic shaping rules (Traffic Control) that have been installed by a third party other than Gremlin. Any network experiments performed against these targets will fail. The Gremlin agent also performs this check on startup and reports this warning in the Gremlin web app if it applies to the node (including on Kubernetes nodes, where this is most common).

Agents reporting this warning can still support network attacks against containers and Kubernetes pods running on affected nodes, but not on the node itself.

‍

How do I inspect these traffic shaping rules?

If you are not aware of the existing traffic shaping rules on your system, you can print them with the following command on the host:

‍

SHELL


# where $DEV is the name of the network interface, such as `eth0`
tc qdisc show dev $DEV

‍

This will print a list of rules on this interface. Any value other than 0 indicates to Gremlin that third-party rules are present. For example:

SHELL


qdisc prio 1: root refcnt 5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

‍

What can I do about these rules to support network experiments?

‍

Remove unnecessary Etcd optimizations from Kubernetes worker nodes

If Gremlin is running in a Kubernetes cluster on nodes built with image-builder, your worker nodes may be built with traffic shaping rules for optimizing Etcd traffic. To enable node-level network experiments on such nodes, you should disable this traffic shaping. To enable etcd-network-tuning in your node image, remove the file/etc/udev/rules.d/90-etcd-tuning.rules. See etcd-network-tuning.sh, and 90-etcd-tuning.rules.

Reach out to Gremlin for guidance on removing these tuning rules for your environment.

‍

Other

Open a support ticket with Gremlin to inquire about supporting network attacks with your existing rules.

‍

Privileges required

Privilege	Description
CLIENTS_READ	Allows reading all client information within the team
CLIENTS_WRITE	Allows editing all client information within the team

‍

Updating Gremlin