Correlate Chaos Engineering and system impact using Gremlin webhooks and New Relic

Correlate Chaos Engineering and system impact using Gremlin webhooks and New Relic

When performing chaos experiments, it’s important to correlate the attack with the impact on your system. Webhooks send information about your attack to your monitoring tool of choice. In this way, you can observe from a single pane of glass the timeline of the attack and the reaction of your system. Exporting reports of the results also provides valuable information that can be shared when assessing the reliability of your applications.

In this tutorial, we’ll set up a Gremlin webhook to send custom events to New Relic to demonstrate how you can highlight the effect of a Gremlin Attack on a system.

Step 1: Register a New Relic API key

Register a new API key for the Gremlin webhook inside of New Relic’s Insights tool. Go to “Manage data” > “API Keys.” Click the “+” symbol next to “Insert Keys.” Enter the description, such as “Gremlin Attack events” and click “Save your notes.” Note the endpoint URL at the end of the example cURL command https://insights-collector.newrelic.com/v1/accounts/{{account_id}}/events and the API Key.

Gather your New Relic Endpoint and API

Step 2: Log into your Gremlin account

Log in to your Gremlin account. If you don’t have an account, you can sign up for free below.

Create your Gremlin Free account

Run your first Chaos Experiment in minutes.
Log in

Step 3: Add a new Gremlin webhook

Click the Settings button to the left of the red “Halt All Attacks” button. Select “Team Settings” then “Webhooks”. Select “New Webhook”. Fill in the New Relic endpoint URL and API key header from Step 1:

bash
1https://insights-collector.newrelic.com/v1/accounts/{{account_id}}/events
bash
1"X-Insert-Key”:”{{API_KEY}}

New Relic has one required body tag for custom events called “eventType” that we need to include. Add this in the body of the Gremlin webhook:

bash
1“eventType”:”{{value}}

This example uses “eventType”:”GremlinAttack” as the primary event to filter on inside of New Relic. Make any other customizations you would like to make to the key value pairs. New Relic is able to pick up the other fields as tags for filtering in NRQL queries and charts.

Add a webhook to your Team using the "New Webhook" button

At this stage, all new attacks will send information in the form of custom events to New Relic. From here you can add custom charts and alerting to show when attacks occurred and the corresponding impact on your system, generate reports on how your infrastructure fared against attacks, or warn engineering leaders of the start of a FireDrill. Next, we’ll walk through a custom chart example.

Step 4: Create a chart that shows impact

Using NRQL, you can design a variety of charts. For this example, we’ll use charts to highlight the effect of an attack upstream. To add a Gremlin chart, go to your dashboard and click “Query your data”. Enter the below in the query field, then click “Run”.

bash
1Select filter(count(*), where attackStatus = 'RUNNING') as ‘Attack Running’, filter(count(*), where attackStatus = 'FINISHED') as ‘Attack Finished’ FROM GremlinAttack SINCE 30 minutes ago TIMESERIES 1 minute

Add a query on the Gremlin events to build a chart

Step 5: Run an attack

The last step is to run an attack and share your results. The CPU Attack is a great first step and is useful to verify tools such as monitoring, alerting, and autoscaling. We’ll run a CPU Attack and make sure that New Relic reports and informs us of the increased load.

In the Gremlin app, select “Create Attack” and select your hosts. Next, click “Choose a Gremlin”. Select “Resource” for the category and “CPU” for the attack. For this example, we’ll set the “Length” to 120 seconds, the “CPU Capacity” to 80%, and the impact to All Cores. Then click “Unleash Gremlin.”

Add a CPU Attack

You can see the start and end of the attack in the chart we created and the increased CPU load in the CPU Utilization chart. Our webhooks and New Relic monitoring tool are working.

New Relic charts showing the impact of a CPU Attack

At the end of the experiment, you can export the result from New Relic as a PDF to share your findings and to provide metrics for the reliability of your system.

New Relic's export as PDF function

Create your own integrations

Webhooks open up a plethora of ways to gain insight into the behavior of your systems. You can integrate with any system that takes JSON POST requests and customize the results to be most relevant for your organization. Recreate past incidents with Gremlin and hunt down the problem using your monitoring tool or integrate with your monitoring tool to ensure you’re gathering the right metrics, have visibility into the health of your services, and are meeting your Service Level Objectives (SLOs).

You can learn more about Gremlin webhooks in the Docs. We’re excited to see what you build with this feature!

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started
  • TechCrunch
  • Forbes
  • Business Insider
  • VentureBeat


© 2020 Gremlin Inc. San Jose, CA 95113