Managing the Gremlin Agent
The Gremlin Agent is an executable binary installed on a host operating system, container runtime, or Kubernetes cluster. It maintains a heartbeat connection to the Gremlin Control Plane to let Gremlin know that the host is active and able to receive orders, such as initiating a reliability test or injecting fault. The agent only requires an outbound network connection to the Gremlin Control Plane, letting you run it behind a firewall without opening inbound ports. All traffic is encrypted.
When an agent is installed and authenticated, it appears as "Active" in the Agents list. It also identifies any targets for fault injection, such as hosts or containers.
You can only run attacks on "active" Gremlin Agents. An Agent goes into an "idle" state if the Gremlin Control Plane detects no activity for at least 5 minutes. You cannot run or schedule attacks on idle Agents. If Gremlin does not hear from these idle Agents for a period of 24 hours, the Agents are removed from the list. However, if an Agent starts communicating with Gremlin again while still within the 24 hour idle window, the Agent is reactivated and returned to the "active" state.
Logs can be found under the
/var/log/gremlin directory. Agent logs can be found in the
daemon.log file. Log entries in this file may indicate events where the Gremlin Agent is not able to communicate with the Control Plane.
Each fault injection performed by the Agent is logged under
/var/log/gremlin/executions using its unique attack execution ID. This is useful for troubleshooting attacks that do not complete.
To see how much disk space is being used by logs, run the
du utility on the
1du -sh /var/log/gremlin
The Gremlin Agent uses very little bandwidth in its idle state. In testing over a 5 minute period, the Agent sent a total of 11.3KB and received 24.8KB—an average combined bandwidth of 0.12KB/s.
There is a slight increase in overall bandwidth consumption during attacks. While attacks are being executed, the Agent stays in constant communication with the Control Plane as it checks for the abort condition to be executed. The bandwidth used is not affected by the type of attack being run. In testing over a 5 minute period, the Agent sent a total of 112.3KB and received 114.0KB—an average combined bandwidth of 0.75KB/s.
When Process Collection is enabled, the Gremlin Agent will send additional data and the bandwidth consumed will depend on how many processes are discovered. The information is gzip compressed in order to minimize network consumption. To measure the actual bandwidth consumed by Gremlin for your particular installation, we recommend using a tool such as iptraf or nethogs.