How to Install and Use Gremlin with Mesosphere Marathon

How to Install and Use Gremlin with Mesosphere Marathon
Last Updated:
Categories: Chaos Engineering

This tutorial will walk you through installing Gremlin on Mesosphere using Marathon Container Orchestration, and testing the functionality with a CPU Chaos Engineering experiment.

Prerequisites

  • We recommend creating a Gremlin application group in Marathon, and inside of that group creating an application definition for each role in your Marathon configuration. In the case of this guide, we are assuming default roles of private and public.
  • A Gremlin account (sign up here)
  • Certificate based authentication should be used, and the certificates should be made available to the Gremlin daemon in /var/lib/gremlin.
  • You will need to create a Marathon Application definition for each Mesosphere role in your cluster. Currently that should be private and public; depending on Mesos version that may be expressed as either agent and agent_public, or slave and slave_public
  • For each Mesosphere role you wish to run Gremlin, the number of instances defined in the application definition should match the number of nodes assigned to the role.

Step 1: Install the Gremlin container into Marathon

This step will need to be repeated for each resource pool you wish to install the Gremlin agent on.

  • In the Marathon interface, navigate to the group you wish to create the application definition
  • Click Create Application, located in the upper right hand corner.
  • In the new application modal, use the radio button on the top right to select JSON mode.
  • Copy the appropriate JSON definition, located below, into the JSON text field
  • Click Create Application, located on the lower right hand corner of the modal.

Step 2: Install the Swissknife container into Marathon

The Swissknife container utilizes shellinabox to expose a couple tools to you the user, for the purpose of testing. This container should not be left running past the use of this tutorial, but we have found it to be very helpful in the pursuit of troubleshooting our container environments.

For more information on the Swissknife container, see Docker Hub

  • In the Marathon interface, navigate to the group you wish to create the application definition
  • Click Create Application, located in the upper right hand corner.
  • In the new application modal, use the radio button on the top right to select JSON mode.
  • Copy the appropriate JSON definition, located below, into the JSON text field
  • Click Create Application, located on the lower right hand corner of the modal.

Step 3: Open HTOP from the Swissknife container

Let's get the htop application open to give us real-time metrics to the running swissknife container. As we run an example attack from gremlin this will help us visualize what is happening to the container. The htop application can be found at http://<PUBLIC-NODE-IP>:8888/htop. Also available to you in this container is a root shell located at http://<PUBLIC-NODE-IP>:8888/, a constant ping running to Google http://<PUBLIC-NODE-IP>:8888/gping and a running iostat located at http://<PUBLIC-NODE-IP>:8888/iostat

  • In the Marathon interface, navigate into swissknife application
  • Under the running instances find the IP address associated with the container
    • Depending on your network security settings, you may need to open access to port 8888
    • The JSON definition assumes the role of slave_public, you may need to change this if that is not the public role you use
    • If the IP address given my Mesosphere is the private IP of the node the container is on, you will need to get the public IP of the node
  • In a new brower window, open the above IP to http://<PUBLIC-NODE-IP>:8888/htop

Congratulations, you should now see the htop interface in your web browser. Leave this open, as we'll be referring back to it in the next step.

Step 4: Run a test attack

  • In a new browser window, open the link to the Gremlin Attacks UI: https://app.gremlin.com/attacks
  • Click New Attack
  • Select the Containers tab
  • Select the swissknife container we created as your target by clicking the checkbox next to the container ID, you should be able to find this based on the Docker key-value pair we added, app:swissknife
  • Click Choose a Gremlin
  • Select the Resource category and CPU attack
  • Default values should be fine, click Unleash Gremlin to launch the attack
  • Observe in the open htop browser window that you can see the increased CPU load on the docker container.

Conclusion

You now have Gremlin up and running in your ECS environment, and validated it's functionality against a running swissknife container. For security, you should remove the swissknife container from your running cluster, as it's an unsecured metric view into your running environment.

Feel free to expand this to other Mesosphere environments and have fun running Chaos Experiments!

JSON Application Definitions

  • The Gremlin application definitions map the mount points /var/run/docker.sock, /var/log/gremlin and /var/lib/gremlin to the same locations on the host nodes
  • The GREMLIN_TEAM_CERTIFICATE_OR_FILE, GREMLIN_TEAM_PRIVATE_KEY_OR_FILE and GREMLIN_CLIENT_TAGS fields will need to be updated to fit with your environment.
  • As stated above, you will need to update the number of instances to match the number of nodes in each resource pool.
  • If you are using overlapping resource pools, additional consideration will need to ensure an even non-overlapping deployment of Gremlin, as multiple agent containers running on a single host is not recommended.

Gremlin Private JSON Definition

json
1{
2 "id": "/gremlin/gremlin-agent",
3 "cmd": "/entrypoint.sh daemon",
4 "cpus": 0.25,
5 "mem": 64,
6 "disk": 0,
7 "instances": 2,
8 "constraints": [["hostname", "UNIQUE"]],
9 "acceptedResourceRoles": ["*"],
10 "container": {
11 "type": "DOCKER",
12 "docker": {
13 "forcePullImage": true,
14 "image": "gremlin/gremlin",
15 "parameters": [
16 {
17 "key": "cap-add",
18 "value": "NET_ADMIN"
19 },
20 {
21 "key": "cap-add",
22 "value": "SYS_BOOT"
23 },
24 {
25 "key": "cap-add",
26 "value": "SYS_TIME"
27 },
28 {
29 "key": "cap-add",
30 "value": "KILL"
31 }
32 ],
33 "privileged": true
34 },
35 "volumes": [
36 {
37 "containerPath": "/var/run/docker.sock",
38 "hostPath": "/var/run/docker.sock",
39 "mode": "RW"
40 },
41 {
42 "containerPath": "/var/log/gremlin",
43 "hostPath": "/var/log/gremlin",
44 "mode": "RW"
45 },
46 {
47 "containerPath": "/var/lib/gremlin",
48 "hostPath": "/var/lib/gremlin",
49 "mode": "RW"
50 }
51 ]
52 },
53 "env": {
54 "GREMLIN_TEAM_ID": "<TEAM_ID_HASH>",
55 "GREMLIN_TEAM_SECRET": "<TEAM_SECRET_HASH>",
56 "GREMLIN_CLIENT_TAGS": "mesoscluster=<Your_Mesos_Cluster_Name>,owner=<Your_Name>,mesosrole=private"
57 },
58 "labels": {
59 "name": "gremlin"
60 },
61 "portDefinitions": [
62 {
63 "port": 10000,
64 "protocol": "tcp"
65 }
66 ]
67}

Gremlin Public JSON Definition

json
1{
2 "id": "/gremlin/gremlin-agent-public",
3 "cmd": "/entrypoint.sh daemon",
4 "cpus": 0.25,
5 "mem": 64,
6 "disk": 0,
7 "instances": 2,
8 "constraints": [["hostname", "UNIQUE"]],
9 "acceptedResourceRoles": ["slave_public"],
10 "container": {
11 "type": "DOCKER",
12 "docker": {
13 "forcePullImage": true,
14 "image": "gremlin/gremlin",
15 "parameters": [
16 {
17 "key": "cap-add",
18 "value": "NET_ADMIN"
19 },
20 {
21 "key": "cap-add",
22 "value": "SYS_BOOT"
23 },
24 {
25 "key": "cap-add",
26 "value": "SYS_TIME"
27 },
28 {
29 "key": "cap-add",
30 "value": "KILL"
31 }
32 ],
33 "privileged": true
34 },
35 "volumes": [
36 {
37 "containerPath": "/var/run/docker.sock",
38 "hostPath": "/var/run/docker.sock",
39 "mode": "RW"
40 },
41 {
42 "containerPath": "/var/log/gremlin",
43 "hostPath": "/var/log/gremlin",
44 "mode": "RW"
45 },
46 {
47 "containerPath": "/var/lib/gremlin",
48 "hostPath": "/var/lib/gremlin",
49 "mode": "RW"
50 }
51 ]
52 },
53 "env": {
54 "GREMLIN_TEAM_ID": "<TEAM_ID_HASH>",
55 "GREMLIN_TEAM_SECRET": "<TEAM_SECRET_HASH>",
56 "GREMLIN_CLIENT_TAGS": "mesoscluster=<Your_Mesos_Cluster_Name>,owner=<Your_Name>,mesosrole=public"
57 },
58 "labels": {
59 "name": "gremlin"
60 },
61 "portDefinitions": [
62 {
63 "port": 10000,
64 "protocol": "tcp"
65 }
66 ]
67}

Swissknife JSON definition

json
1{
2 "id": "/swissknife",
3 "cmd": null,
4 "cpus": 0.25,
5 "mem": 64,
6 "disk": 0,
7 "instances": 1,
8 "constraints": [["hostname", "UNIQUE"]],
9 "acceptedResourceRoles": ["slave_public"],
10 "container": {
11 "type": "DOCKER",
12 "docker": {
13 "forcePullImage": true,
14 "image": "khultman/swissknife",
15 "parameters": [],
16 "privileged": false
17 },
18 "volumes": [],
19 "portMappings": [
20 {
21 "containerPort": 8888,
22 "hostPort": 8888,
23 "labels": {},
24 "name": "default",
25 "protocol": "tcp",
26 "servicePort": 8888
27 }
28 ]
29 },
30 "labels": {
31 "app": "swissknife"
32 },
33 "networks": [
34 {
35 "mode": "container/bridge"
36 }
37 ],
38 "portDefinitions": []
39}

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started