Frequently asked questions

Answers to common questions about Gremlin, Chaos Engineering, and Reliability Management.

General

What is Gremlin?

Gremlin is the world's first enterprise reliability platform that helps engineering teams find and fix reliability risks before they become expensive outages. Gremlin combines Chaos Engineering, Reliability Management, and Reliability Intelligence to give you full visibility and control over your reliability posture—especially critical in the era of AI, where uptime and trust matter more than ever.

What is Chaos Engineering?

Chaos Engineering is the practice of identifying potential failures in a system before they become outages. It involves injecting faults into systems (such as increasing CPU consumption, adding network latency, or simulating zone outages), observing how those systems respond, and then using that knowledge to improve the system and reduce the risk of production outages.

Learn more about Chaos Engineering.

What is Reliability Management?

Reliability Management is a standards-based approach to baseline, remediate, and automate the reliability of complex, distributed systems. It helps teams measure and improve service reliability before incidents occur, while also standardizing reliability testing across the organization. Like Chaos Engineering, Reliability Management uses fault injection to test systems, but it automates the process of detecting potential failures and creating action items.

Learn more about Reliability Management.

How is this different from traditional testing?

Reliability testing goes beyond traditional failure testing because it's not only about verifying assumptions. It uncovers behaviors that are hard to predict and helps teams better understand complex, dynamic systems.

How is this different from performance testing?

Reliability testing and performance testing complement each other. Both practices stress systems to see how they behave under demanding conditions, but in different ways. In production, systems are likely to experience both high load and unstable conditions. Combining reliability and performance testing helps ensure your systems are resilient under these conditions, better preparing them for real-world scenarios.

Learn more in our blog post on how reliability testing and load testing are complementary.

We already have observability and incident response - why do we need reliability testing?

Observability and incident response are important, but they only provide insights after an issue has already occurred. Reliability testing proactively checks your systems to reveal reliability risks so you can address them before they lead to incidents.

Learn more in our blog post on how observability and incident response need resilience testing.

What measurable benefits can we expect from using Gremlin?

Gremlin has helped our customers achieve:

More generally, finding and fixing reliability risks reduces downtime, which in turn reduces lost revenue, lost productivity, and lost customer goodwill. On average, enterprises lose $14,056 per minute of downtime. Increasing availability from 98% to 99% is a 50% improvement that saves, according to the above average, $72 million annually. Other gains are harder to quantify, such as accelerating time to market, more successful product launches, and reduced stress on engineers.

Learn more about the benefits in our blog on the ROI of reliability.

Features

What can I test using Gremlin?

Gremlin’s experiments can test every layer of your stack, including networks, compute resources, operating systems, and applications. You can run experiments against hosts, containers, applications, and Kubernetes resources. These experiments include:

Resource experiments:

  • CPU exhaustion
  • Memory exhaustion
  • Disk I/O stress
  • Disk storage exhaustion
  • GPU stress

State experiments:

  • Process termination
  • Host shutdown/restart
  • System time manipulation

Network experiments:

  • Network outage/blackhole
  • Network latency
  • Packet loss/packet corruption
  • SSL/TLS certificate expiration
  • DNS outage

See Gremlin’s complete list of experiments.

Will Gremlin support our tech stack?

The Gremlin agent supports Linux and Windows hosts, containerized environments like Docker and Kubernetes, and applications. Gremlin is cloud-agnostic and supports all public cloud environments—including AWS, Azure, and GCP—as well as on-prem or air-gapped environments with Gremlin Private Edition.

Gremlin also natively integrates with:

  • CI/CD pipelines (Jenkins, GitLab, etc.)
  • Observability tools (CloudWatch, AppDynamics, Grafana, Datadog, Dynatrace, Honeycomb, New Relic, PagerDuty, Prometheus)
  • Cloud platforms (AWS, Azure, GCP, Pivotal Cloud Foundry, etc.)
  • Container orchestration tools (Kubernetes, Docker, Podman, etc.)
  • Other tools (using the Gremlin REST API or Webhooks)

See our list of supported platforms.

What are GameDays and how do they work?

GameDays are organized team events designed to test systems and processes. They involve coordinated experiments that simulate realistic failure scenarios, helping teams practice incident response and discover system weaknesses in a controlled environment.

How does Gremlin help with disaster recovery planning?

Gremlin helps you validate that your systems, infrastructure, and response plans are reliable. You can also use Gremlin to practice your disaster recovery plans to prepare engineers to respond quickly in an actual emergency.

What reporting and analytics capabilities does Gremlin provide?

Gremlin includes built-in dashboards and reports to review your reliability posture on the service, team, and organization levels. Users of AI tools like ChatGPT and Claude can also use the Gremlin MCP server to query their Gremlin data and create custom reports.

Safety & Security

Is it safe to run chaos experiments and reliability tests in production?

Safety and control are core aspects of Gremlin. All of our attacks can be reverted immediately, allowing you to safely abort and return to steady state if things go wrong. Gremlin also provides enterprise safety and security features, such as role-based access control and fail-safe agents.

More details are available on our security page.

What security standards does Gremlin meet?

Gremlin is SOC II compliant and GDPR compliant, and follows industry-standard security practices such as OWASP and NIST. The platform is designed for enterprise environments with role-based access controls and audit logging.

What happens if something goes wrong during an experiment?

The Gremlin web app displays a “halt” button on all active experiment screens. Clicking this button immediately stops and rolls back the experiment. The web app also includes a halt-all experiments button in the top-right corner, accessible from any page.

For Scenarios and reliability tests, Gremlin monitors your systems using Health Checks. If a Health Check raises an error or detects a problem with your systems, Gremlin will stop the experiment.

Gremlin includes several other built-in safety mechanisms, including:

  • Immediate halt capabilities - stop any experiment at the push of a button and with real-time system monitoring.
  • Blast radius controls - precisely target specific infrastructure components without affecting others.
  • Agent fail-safes - experiments automatically stop if the Gremlin agent loses connection to the Gremlin platform.
  • Monitoring integration - automatic experiment halting based on your system’s health metrics and existing observability monitors.

Getting Started

Does Gremlin have a free tier?

Gremlin offers a free 30-day trial with no credit card required. This includes access to all platform features so you can evaluate the full capabilities with your actual infrastructure.

What's included in the trial?

  • Full access to all chaos engineering experiments
  • Reliability testing and scoring capabilities
  • Integration with your monitoring and observability tools
  • Customer success support for implementation guidance

How long does it take to implement Gremlin?

The Gremlin agent takes only a few minutes to install and register with our SaaS platform. To help teams get started quickly, Gremlin offers a free 30-day trial with no credit card required. For teams with strict security controls, we’ll work with you to onboard your systems in a way that meets your security requirements.

Do we need special expertise to use Gremlin?

No: anyone can use Gremlin to run tests and see results.

Can we start with non-production environments?

Absolutely. Most teams start with staging or development environments to build confidence and expertise before moving to production.

What training resources are available?

Does Gremlin have a certification program?

Yes, Gremlin offers the Enterprise Chaos Engineering Certification, a free virtual course that covers applying Chaos Engineering practices in enterprise environments.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape