Infrastructure Layer

Troubleshooting

This page contains troubleshooting instructions for errors you might encounter. If you can't find the answer to your question, check the Gremlin knowledge base for additional information.

Gremlin Agent

Unhealthy State

Check that you are only running attacks on active Gremlin agents. It's possible to run an attack on an agent in an unhealthy state but the attack may not complete. An unhealthy state indicates that there was an issue with the installation or configuration of the Gremlin agent. If you see a Gremlin agent in an unhealthy state or you are experiencing problems running attacks, such as receiving "Attack Interrupted" errors, refer to the Gremlin knowledge base for more information.

Troubleshooting attacks in an unhealthy state

LostCommunication

There are several reasons an agent can lose communication to the Gremlin Control Plane. Common examples include

  • Running a network based attack that affected the traffic. Ensure both api.gremlin.com and DNS are white-listed.
  • Running a CPU attack has starved Gremlin of the ability to compute API encryption. This is rare but it does happen.

In the event of a LostCommunication error, The Gremlin agent will trigger its dead-man switch and cease all attacks.

This can occur on a host when running a network attack, when a previous network attack had been run and the Gremlin agent was killed mid attack by the user, system, or other tool which did not allow Gremlin to run garbage collection.

To solve, run gremlin rollback.

Failed to parse execution attribute ‘pid’ for execution < HASH_STRING >

There are two non-exclusive modes of failure that can occur with this error message:

  • The running version of Gremlin is several versions out of date
    • Update the Gremlin agent or docker image
  • /var/lib/gremlin/executions has become corrupt
    • Delete the file /var/lib/gremlin/executions

Kubernetes

Run Chao in debug mode

Chao supports the GODEBUG environment variable, which can be used to enable debug features such as verbose logging of HTTP activity. You can enable verbose HTTP logs by adding the following variable to the environment section of the Chao deployment.

NOTE: Verbose logging prints sensitive information like HTTP request and response bodies. This configuration is intended to be a troubleshooting measure only, and should be removed when no longer needed.

yaml
1- name: GODEBUG
2 value: http2debug=2

Chao's logs will now contain verbose logs for http requests.

Run Gremlin checks

You can run Gremlin's check subcommand on Kubernetes clusters to troubleshoot common configuration or compatibility issues with the environment. The following is an example Job that you can run to get gremlin check output.

yaml
1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: gremlin-check
5 namespace: gremlin
6 labels:
7 k8s-app: gremlin
8 version: v1
9spec:
10 template:
11 metadata:
12 labels:
13 app.kubernetes.io/name: gremlin-check
14 spec:
15 restartPolicy: Never
16 containers:
17 - name: gremlin
18 image: gremlin/gremlin
19 # You can also pass subcommands (like `proxy` to check only proxy information)
20 args: [ "check" ]
21 env:
22 # # Pass the same environment you would pass to the Gremlin DaemonSet, including secrets, and proxy information
23 - name: GREMLIN_TEAM_PRIVATE_KEY_OR_FILE
24 value: file:///var/lib/gremlin/cert/gremlin.key
25 - name: GREMLIN_TEAM_CERTIFICATE_OR_FILE
26 value: file:///var/lib/gremlin/cert/gremlin.cert
27 - name: GREMLIN_IDENTIFIER
28 valueFrom:
29 fieldRef:
30 fieldPath: spec.nodeName
31 # # Example proxy configuration
32 # - name: https_proxy
33 # value: http://my-proxy:3128
34 # - name: SSL_CERT_FILE
35 # value: /etc/gremlin/ssl/proxy-ca.pem
36 # - name: GREMLIN_TEAM_ID
37 # value: my-team-id
38 volumeMounts:
39 - name: docker-sock
40 mountPath: /var/run/docker.sock
41 - name: gremlin-state
42 mountPath: /var/lib/gremlin
43 - name: gremlin-logs
44 mountPath: /var/log/gremlin
45 - name: gremlin-cert
46 mountPath: /var/lib/gremlin/cert
47 readOnly: true
48 # # Example proxy configuration
49 # - name: proxy-ca
50 # mountPath: /etc/gremlin/ssl
51 volumes:
52 - name: docker-sock
53 hostPath:
54 path: /var/run/docker.sock
55 - name: gremlin-state
56 hostPath:
57 path: /var/lib/gremlin
58 - name: gremlin-logs
59 hostPath:
60 path: /var/log/gremlin
61 - name: gremlin-cert
62 secret:
63 secretName: gremlin-secret
64 # # Example proxy configuration
65 # - name: proxy-ca
66 # configMap:
67 # name: proxy-ca
68 backoffLimit: 4

Once deployed, you can get the output of gremlin check by pulling the logs of the Pod associated with the Job:

shell
1kubectl logs --follow \
2 --namespace gremlin \
3 $(kubectl get pods --namespace gremlin --selector=job-name=gremlin-check --output=jsonpath='{.items[*].metadata.name}')
1proxy
2====================================================
3https_proxy : http://proxy.local:3128
4http_proxy : (unset)
5SSL_CERT_FILE : /etc/gremlin/ssl/proxy-ca.pem
6Service Ping : OK

Docker

Non-zero exit code (137)

Docker has killed the container via kill -9. This is often attributed to OOM issues, and is most often seen when running a memory attack. Allocating more RAM to Docker usually solves the issue.

Non-zero exit code (1)

  • Unable to find local credentials file: Gremlin is not configured to point to the correct credentials file, usually located in /var/lib/gremlin. Ensure the credentials file(s), either certificates of API keys, exists and Gremlin has read+write access.

  • Permission denied (os error 13): The Gremlin container does not have proper filesystem permissions. Gremlin requires write access to /var/lib/gremlin, including the ability to create new files. Check permission on the host, and ensure write access is being passed through via docker when running the Gremlin container.

OS Error 1

This is often observed in the context of Capabilities: Unable to inherit one or more required capabilities: cap_net_admin, cap_net_raw

Solution: Add the missing required capabilities to that Docker container (full list here: https://help.gremlin.com/security/#linux-capabilities)

Example: docker run -it --cap-add=NET_ADMIN --cap-add=KILL --cap-add=SYS_TIME gremlin/gremlin syscheck

API Return codes

401

The Gremlin agent is unable to authenticate against the API. Causes of this error are usually due to bad or missing credentials files or certificates, or a revocation issued against the client.

Examples:

  • 401 Unauthorized - Authorization header is missing or malformed
  • Client has been revoked (401 Unauthorized)
  • AUTH_RENEW: 401 Unauthorized

Solution:

  • Ensure you have valid credentials (Certificates or API keys) in a location that Gremlin can read from.
  • Ensure Gremlin has proper read+write access to /var/lib/gremlin
  • Remove the file /var/lib/gremlin/.credentials if it exists
  • Rerun gremlin init

This error can also be the result of a race condition when Gremlin daemon is being started prior to the environment variables being exported.

In some specific cases, this error can also occur when multiple hosts or agents are configured with the same GREMLIN_IDENTIFIER. Common places this can occur:

  • Improperly configured ECS/Kubrenettes/Mesosphere where multiple Gremlin agents are assigned the same virtual IP
  • Missing HOST meta data on AWS/GCP/Azure which causes Gremlin to revert to the default localhost Identifier

402

The client limit for your company or team has been reached, Gremlin does not have a license to apply to the client.

You may terminate or revoke existing clients, or contact Gremlin Sales to increase the client limit.

403

The account, most likely trial account, has expired. Please contact Gremlin Sales to extend the trial.

408

This is most often attributed to a host having bad time data. Verify the system clock of the host and try again. If this problem persists after validating your host's system clock, contact Gremlin Support.

409

An error code of 409 indicates there is a conflicting attack running on the host. This is most often seen in the case of one network attack running (for example, a blackhole attack) and attempting to launch a second network attack. However, this can also occur when trying to run two concurrent network or state attacks against the same target.