Troubleshooting Gremlin on OpenShift
This issue is most often seen with timeout errors in both Chao and Gremlin logs.
1error sending request for url (https://api.gremlin.com/v1/daemon/poll?multiple=1): operation timed out
This usually stems from network rules preventing Gremlin's access to the internet. It's important to figure out what the intended network behavior should be for Gremlin on your infrastructure with some questions:
- What other services connect to the internet within your cluster?
- Do services within your cluster rely on an HTTP proxy when connecting to the internet?
If you've reviewed the proxy requirements and determined that Gremlin does not need an HTTP proxy, but you are still unable to connect Gremlin to the internet, it's likely one or more OpenShift projects are preventing internet access with an EgressNetworkPolicy.
You can list such policies in any project with the following
1oc -n $PROJECT get egressnetworkpolicies
1NAME AGE2test 20m
If you look at the details of such a policy, you can see if network access for api.gremlin.com is denied. Here's an example of a policy which denies api.gremlin.com, because it only allows specific IP address ranges and host names while denying everything else.
1oc -n $PROJECT get egressnetworkpolicy test -o yaml
1apiVersion: network.openshift.io/v12kind: EgressNetworkPolicy3metadata:4 name: test5 namespace: test6spec:7 egress:8 - to:9 cidrSelector: 18.104.22.168/2410 type: Allow11 - to:12 dnsName: www.foo.com13 type: Allow14 - to:15 cidrSelector: 0.0.0.0/016 type: Deny
api.gremlin.com to such a
EgressNetworkPolicy will fix this problem.
1apiVersion: network.openshift.io/v12kind: EgressNetworkPolicy3metadata:4 name: test5 namespace: test6spec:7 egress:8 - to:9 cidrSelector: 22.214.171.124/2410 type: Allow11 - to:12 dnsName: www.foo.com13 type: Allow14 - to:15 dnsName: api.gremlin.com16 type: Allow17 - to:18 cidrSelector: 0.0.0.0/019 type: Deny
This issue will generate a variation of the following error:
1container details : time="2022-05-11T13:07:21Z" level=error msg="container \"2584cede1cf01e77d9d9ac8f864f99f1c155268ec1095af2bbde850e73d936a2\" does not exist"
The Gremlin agent currently relies on the presence of a "sandbox" container to resolve container namespaces. OpenShift
4.9 uses the CRI-O 1.22 container runtime. This runtime generates a "sandbox" container (referred to as the "infra"
container) during pod creation, but drops it immediately afterward by default. This behavior is controlled by the
drop_infra_ctrflag in the container runtime table configuration and is set to
true by default. In order to run
attacks against an OpenShift 4.9 cluster, the
drop_infra_ctr flag must be set to
To apply this workaround, save the following machine configuration as
95-gremlin-drop_infra_ctr.yaml to your OpenShift
1apiVersion: machineconfiguration.openshift.io/v12kind: MachineConfig3metadata:4 labels:5 machineconfiguration.openshift.io/role: worker6 name: 95-crio-worker-config7spec:8 config:9 ignition:10 version: 3.2.011 storage:12 files:13 - contents:14 source: data:,%5Bcrio.runtime%5D%0Adrop_infra_ctr%20%3D%20false%0A15 mode: 42016 overwrite: true17 path: /etc/crio/crio.conf.d/95-gremlin-drop_infra_ctr18 extensions: null19 fips: false20 kernelArguments: null21 kernelType: ""22 osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d98cb7fe51f82a5eedce8ca8dd1a3a65406f15d13a40435786646148517392f3
Install the machine configuration:bash1oc apply -f 95-gremlin-drop_infra_ctr.yaml
Wait for the apply command to propogate through to each machine config pool:bash1oc get machineconfigpools -w2NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE3master rendered-master-4da59b029c1dc49757c63426cee6afe2 True False False 3 3 3 0 13h4worker rendered-worker-18cfed020d41141d6b6056c61b130685 True False False 3 3
Once all machines are updated and ready, Gremlin agent attacks will work against the cluster.