The Comprehensive Chaos Engineering Platform

Everything you need to safely, securely, and simply build reliable software through Chaos Engineering.

Improve reliability at every level of your stack

Use Gremlin's comprehensive set of failure modes to experiment across your system, including bare metal, any cloud provider, containerized environments, kubernetes, applications, and serverless.

Build resilient infrastructure

  • Resource Gremlins
    Throttle CPU, Memory, I/O, and Disk
  • State Gremlins
    Reboot hosts, kill processes, travel in time
  • Network Gremlins
    Introduce latency, blackhole traffic, lose packets, fail DNS

Test for application failure

  • Test for failure in your code
  • Fail or delay serverless functions
  • Narrow the impact to a single user, device, or percentage of traffic

Run chaos experiments in any environment

Test anywhere.

Safely test in production

Gremlin is designed with redundant failsafes that restore your system to a healthy state at the first sign of trouble.
  • Halt all and roll back experiments with a single click
  • Trigger roll backs based on your monitoring
  • Status Checks prevent experiments from running when systems are unstable

Secure from the ground up

Gremlin is SOC II compliant and follows industry standard security practices.
Least Permissions
Gremlin runs on default Linux permissions and doesn’t require root access
Ready for Production
Multi-factor authentication, Secure Single Sign On, and Role-Based Access Control (RBAC)
Audit Trails
Every action on the platform is tracked for compliance
3rd Party Testing
Gremlin regularly undergoes regular security auditing by a 3rd party

Simple to use

Get up and running in 3 lines of code. Manage Gremlin from our intuitive UI or the command line.
echo "deb non-free" | sudo tee /etc/apt/sources.list.d/gremlin.listsudo apt-key adv --keyserver key C81FC2F43A48B25808F9583DBFF170F324D41134 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6sudo apt-get update && sudo apt-get install -y gremlin gremlin

Chaos Engineering Scenarios

Validate your systems can respond to common failures.

Get started with outage templates

Run pre-configured chaos experiment scenarios based on real-world outages.

Validate autoscaling rules

Increases CPU utilization to test that your autoscaling is properly configured.

Segment AWS autoscale outage

Read the report

Prepare for host failures

Shuts down an increasing percentage of your hosts so you can prepare for inevitable host failure.

Google Compute Engine Persistent Disk issue in europe-west1-b

Read the report

Handle the unreliable network

2 seconds of latency is added to a growing number of hosts so you can validate clients continue to respond without issue.

Github Oct 21 Incident Analysis

Read the report

Be resilient to unavailable dependencies

Increasing amounts of traffic are dropped across a service to ensure your system can still function.
Packet Loss

S3 Outage 2017

Read the report

Prepare for region evacuation and disaster recovery

Blackhole traffic to a region so you can demonstrate disaster preparedness.

Netflix project 2013

Read the report

Withstand DNS outages

Blocks internal or external DNS traffic so you can identify single points of failure.

DynDNS Outage 2016

Read the report

Prove you can withstand common failures

Simulate real-world scenarios that can impact performance, uptime, and customer experience. Run pre-built scenarios based on actual outages and be sure your system is resilient to common cloud failures.
  • Verify that your autoscaling works
  • Prepare for host failure
  • Handle a slow, unreliable dependency
  • Perform zone and region evacuations
  • Validate your capacity plan

Build and share your own Scenarios

Configure scenarios based on common outages.

  • Chain attacks together
  • Scale the impact magnitude
  • Increase the blast radius

Safely scale the impact of your experiments

Scenarios provide you the ability to divide your attacks into incremental steps to mitigate the risk of complex experiments.

Dial up the blast radius over time

3 of 3

Increase the magnitude

Hypothesize and observe

Record your hypothesis, observe, and record the results of your experiments so you can take action and improve the reliability of your system.

Track, share, and schedule experiments

Follow how your experiments perform over time to prevent the drift into failure. Status Checks prevent scheduled experiments from running when the system is in an unsteady state.

Chaos Engineering on Kubernetes

Gain confidence in the reliability of your Kubernetes clusters and train your team.

Choose objects to target

1. Choose a cluster
2. Choose a namespace
0 of 2 selected
    • 1 ReplicaSet
    • 1 Pod
    • 7 Pods
0 of 1 selected
    • 2 Pods
0 of 2 selected
    • 1 Pod
    • 1 Pod
Blast Radius
0 of 5

Be confident in the reliability of your Kubernetes clusters

  • Filter and control access by cluster and namespace to easily find and harden specific Kubernetes objects
  • Prevent noisy Pods from bringing down your application
  • Ensure you can withstand common Kubernetes failure modes including CPU throttling, DNS issues, and Blackholes

Confidently operate Kubernetes in production and prevent downtime

  • Validate your self-healing and orchestration
  • Be sure your app autoscales as expected
  • Find out what happens when you unexpectedly lose Pods - are your customers negatively impacted?

Develop quickly and safely using Kubernetes

  • Verify your Kubernetes migration is regression free
  • Identify critical bugs lurking within your clusters before they cause an outage
  • Share what you learn with the rest of your organization