Platform > Reliability Intelligence

Reliability Intelligence

Supported platforms:

N/A

Reliability Intelligence analyzes your Gremlin environment to provide real-time reliability insights and recommendations. This includes identifying failed reliability tests and offering step-by-step instructions to remediate the underlying problems. Gremlin also provides a Model Context Protocol (MCP) server, which lets you integrate Gremlin into your LLM or AI service of choice.

Note

Gremlin will not send data to LLM or AI services without your consent, nor will we use your data to train LLMs.

‍

How Reliability Intelligence works

Reliability Intelligence works automatically to uncover reliability risks in your Gremlin organization and provide recommended resolutions.

‍

Experiment analysis and remediation

Reliability Intelligence can determine the root cause of failed tests and suggest remediations tailored to your service. It uses context from your service, including:

The type of test that was run.
The type of service that the test ran on.
The Health Check that raised the error.
Any unusual events during the test (such as an out-of-memory event).

For example, here is a Kubernetes deployment that failed its Memory Scalability test. The icon next to the “failed” result means Reliability Intelligence has insight into this test result.

The Reliability Intelligence icon is visible to the right of failed tests whenever a suggestion is available. Click on the icon to view the suggestion(s).

‍

In this case, Kubernetes’ out-of-memory (OOM) killer terminated one of the pods in the deployment, which in turn increased the error rate. Gremlin diagnosed that these two events are related and that we can fix the problem by increasing our replica count or reserving more memory in advance.

Failed reliability tests like this Host Redundancy test will display any relevant Reliability Intelligence diagnoses in the test results.

‍

Enabling Reliability Intelligence

Reliability Intelligence is enabled by default.

‍

Enabling LLM access for Reliability Intelligence

You can optionally grant Gremlin permission to use a large language model (LLM) for Reliability Intelligence. This grants you access to more detailed diagnoses and recommendations as we improve Reliability Intelligence.

To enable LLM access:

Click on this link to open the Company Settings page.
Next to Advanced, click Edit.
Select the checkbox next to Allow LLM access.
Click Save.

To disable LLM access, repeat steps 1 and 2, uncheck the box, and click Save.

‍

Using the Gremlin MCP Server

Gremlin provides a model context protocol (MCP) server, which lets you interact with your Gremlin account using your choice of AI agent. To use the Gremlin MCP server, you will need:

A Gremlin account and REST API key.
An AI or LLM interface that can run MCP servers, such as Claude Desktop.

‍See the GitHub repository for detailed instructions on configuring and using the Gremlin MCP server.