hero-image
Gremlin Chaos Engineering

Build trust in complex systems with Chaos Engineering

hero-image
Gremlin enables every organization to conduct safe and secure Chaos Engineering experiments. Find reliability risks in any environment—before they impact users.

Trusted by teams worldwide

Industry leaders rely on Gremlin to keep their systems available and their customer experience reliable.
Charter CommunicationsGrubhubNABSASShiptTargetTwilioWalmartWorkiva
Charter CommunicationsGrubhubNABSASShiptTargetTwilioWalmartWorkiva

Chaos Engineering makes your systems reliable. Gremlin makes it safe, easy, and secure.

Gremlin’s Chaos Engineering platform enables SRE and platform teams to improve uptime, validate system reliability, and cultivate a robust reliability culture.

Confidently recreate incidents and outages

  • Test system reliability by thoughtfully injecting failure into services, hosts, containers, and serverless workloads.
  • Use the experiments library to see how systems respond to a variety of common failure conditions.
  • Start small and scale experiments once you're confident in system stability.
  • Easily and automatically halt and roll-back attacks should issues arise.

Validate systems against any incident scenario

  • Replicate real-world incidents by running multiple Chaos Engineering experiments in sequence.
  • Create your own or use Gremlin's pre-configured library of recommended Scenarios to simulate real-world outages that can impact performance, uptime, and customer experience.
  • Share scenarios across teams to create a stronger culture of reliability.
  • Run scenarios on a schedule to keep your availability high and your incident count low.

Prepare and run effective GameDays

GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them.
  • Use GameDay Manager to reduce your internal prep and run time, ensuring a successful event.
  • Automatically analyze and store results so you can review what success or failure looked like. Turn that data into action to improve your system.
  • Integrate with Jira to ensure action items are captured and tracked with your team.
Use Cases

Superpowers for SRE teams

Gremlin helps SRE teams get ahead of reliability issues and avoid late-night pages.

Meet uptime and availability SLOs

Identify hidden reliability risks

Validate and tune monitors

Mitigate dependency failure

Eliminate revenue impacting outages

Ensure reliable mitigations & launches

Meet uptime and availability SLOs
Identify hidden reliability risks
Validate and tune monitors
Mitigate dependency failure
Eliminate revenue impacting outages
Ensure reliable mitigations & launches
Why Gremlin?

The Gremlin Advantage

Only Gremlin has the depth of experience to implement Chaos Engineering at scale in the world’s most demanding environments.
  • Used by 100+ of the Fortune 2000, including 5 of the 7 biggest US banks
  • Hundreds of thousands of hosts safely and securely run Gremlin
  • Over one million chaos engineering experiments and reliability tests run
Fault Injection Library

Stress your systems, any way you choose.

Rely on Gremlin’s comprehensive fault injection library to build resilience to common failure conditions. Run experiments in virtually any environment.
CPU
Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.

Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.

Process Killer
Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.

Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.

Memory
Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.

Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.

Blackhole
Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.

Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.

IO
Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.

Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.

Latency
Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.

Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.

Disk
Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files and exhausted storage volumes).

Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files and exhausted storage volumes).

Packet Loss
Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.

Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.

Shutdown
Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.

Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.

DNS
Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Time Travel
Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.

Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.

Certificate Expiry
Retrieve the certificate chain and validate that no certificates expire in the next 30 days.

Retrieve the certificate chain and validate that no certificates expire in the next 30 days.

Combine experiments for hundreds of pre-built and custom scenarios.
Supported Platforms

Gremlin works where you do