Create resilience. Learn from failure. Go beyond Chaos Monkey and the Simian Army.

Gremlin's chaos engineering tools allow devs and SREs to safely, securely, and easily simulate real outages with an evergrowing library of attacks. Run game days with the only Failure-as-a-Service platform.



Minimize the blast radius with precise failure testing. Safely halt and roll back to steady state at a moment’s notice.



Gremlin doesn't require root access, provides SSO & MFA, and undergoes regular 3rd party security audits.



Install and run attacks in minutes. Works on hosts or containers.

Request a demo

Over a decade of collective experience unleashing chaos at companies like


How it works.

Run disciplined chaos experiments to identify weak points in your system and fix them before they become a problem.

  • I. Plan an Experiment

    Create a hypothesis for what might go wrong.

  • II. Run the smallest version

    Execute a simple test to see how your system responds.

  • III. Scale or squash

    Scale the experiment until you identify a bug. Then squash it.

We use Gremlin to test various failure scenarios and build confidence in the resiliency of our microservices. The ability to target containerized services with an easy-to-use UI has reduced the amount of time it takes us to do fault injection significantly.

Paul Osman
Senior Engineering Manager at Under Armour

Truly comprehensive fault injection.

Gremlin has a full suite of enterprise-grade failure testing modes so you can find out how resilient your production system is.

Build resilient infrastructure.

Recreate real world technical and business failures and prepare for the failure of dependencies, internal or external.

Resource exhaustion

Which resource is your bottle neck? CPU, Memory, IO, or Disk? Find out for certain.

Bad behavior

Processes die, time drifts, instances reboot. Are you ready?

Unreliable networks

You cannot rely on the network, nor on your dependencies being available. What happens when they slow down or disappear all together?

Build resilient applications.

Recreate application outages to quickly resolve or prevent them in the first place.

Accurately and precisely simulate an outage

Confidently experiment with a tiny blast radius to quickly reduce your failure surface area.

Be resilient to the unknown

Maintain a consistent end user experience by preparing for dependency failure.

Serverless, including Lambda

Understand how your applications behave in the face of failure in a serverless environment.

Surgical precision.

Minimize business impact and maximize learning with precise, fine-grained experiments

  • Experiment on a single user, device, or <attribute> to begin.

  • Request level granularity, from 0.01% of traffic up to 100%.

  • Fail or Delay any part of your application, from functions to endpoints.

Never accidentally cause an outage.

If the unexpected happens, Gremlin's failsafes automatically halt your experiment and fall back to steady state.

Gremlin is built to not only cause failure, but to handle it as well.

Secure from the ground up.

Security is a first class citizen, and is part of our DNA.

  • Least Permissions

    Gremlin runs on default Linux permissions and doesn’t require root access.

  • Audit Everything

    Every action taken on the platform creates an audit trail.

  • Ready for Production

    Multi-factor authentication, Secure Single Sign On, Role Base Access Control. Gremlin is secured to allow experimentation in production.

  • Enterprise Grade

    The Gremlin client, daemon, API, and website undergo regular security auditing by an external auditor.

Staggering simplicity.

Get Gremlin up and running in moments with 3 lines of code.

By engineers, for engineers.

Manage Gremlin from our intuitive interface or the command line.

Gremlin API

Gremlin API

Control and automate everything Gremlin does via our API.

Gremlin CLI

Gremlin CLI

Manage Gremlin from our intuitive interface or the command line.

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Try Gremlin for free and see how you can harness chaos to build resilient systems.