Gremlin Chaos Engineering

Build trust in
complex systems with Chaos Engineering

Gremlin enables every organization to conduct safe and secure Chaos Engineering experiments. Find reliability risks in any environment—before they impact users.

TRY for FREE GET A DEMO

Trusted by teams worldwide

Industry leaders rely on Gremlin to keep their systems available and their customer experience reliable.

Take the tour

See how easy it is to run Chaos Engineering experiments using Gremlin Fault Injection.

Chaos Engineering
makes your systems reliable.
Gremlin makes it safe, easy, and secure.

Gremlin’s Chaos Engineering platform enables SRE and platform teams to improve uptime, validate system reliability, and cultivate a robust reliability culture.

Confidently recreate incidents and outages

Test system reliability by thoughtfully injecting failure into services, hosts, containers, and serverless workloads.
Use the experiments library to see how systems respond to a variety of common failure conditions.
Start small and scale experiments once you're confident in system stability.
Easily and automatically halt and roll-back attacks should issues arise.

Learn more about Chaos Engineering

Validate systems against any incident scenario

Replicate real-world incidents by running multiple Chaos Engineering experiments in sequence.
Create your own or use Gremlin's pre-configured library of recommended scenarios to simulate real-world outages that can impact performance, uptime, and customer experience.
Share scenarios across teams to create a stronger culture of reliability.
Run scenarios on a schedule to keep your availability high and your incident count low.

Prepare and run
effective Gamedays

GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them.

Use GameDay Manager to reduce your internal prep and run time, ensuring a successful event.
Automatically analyze and store results so you can review what success or failure looked like. Turn that data into action to improve your system.
Integrate with Jira to ensure action items are captured and tracked with your team.

Learn more about GameDays

Use Cases

Superpowers for SRE teams

Gremlin helps SRE teams get ahead of reliability issues and avoid late-night pages.

Meet uptime and availability SLOs

Identify hidden reliability risks

Validate and tune monitors

Mitigate
dependency failure

Eliminate revenue impacting outages

Ensure reliable migrations & launches

Why Gremlin?

The Gremlin Advantage

Only Gremlin has the depth of experience to implement Chaos Engineering at scale in the world’s most demanding environments.

Used by 100+ of the Fortune 2000,
including 5 of the 7 biggest US banks

Hundreds of thousands of hosts safely and securely run Gremlin

Over one million
Chaos Engineering experiments and reliability tests run

Fault Injection Library

Stress your systems, any way you choose.

Rely on Gremlin’s comprehensive Fault Injection library to build resilience to common failure conditions. Run experiments in virtually any environment.

CPU

Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.

Process Killer

Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.

Memory

Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.

Blackhole

Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.

Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.

Latency

Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.

Disk

Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files and exhausted storage volumes).

Packet Loss

Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.

Shutdown

Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.

DNS

Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Time Travel

Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.

Certificate Expiry

Retrieve the certificate chain and validate that no certificates expire in the next 30 days.

GPU

Consume computing power to simulate a highly-intensive workload and push the GPU to its limit. Stress test your AI, LLM, video encoding, and other GPU workloads.

Combine experiments for hundreds of pre-built and custom scenarios.

Learn more: Gremlin’s Guide to Enterprise Fault Injection

Supported Platforms

Gremlin works  where you do

Gremlin is a cloud-native platform that runs in any environment. Gremlin supports all public cloud environments (including AWS, Azure, and GCP) and runs on Linux, Windows, containerized environments like Kubernetes, and even on-prem with Gremlin Private Edition.

Featured Resources

Learn more about Chaos Engineering

How to implement Chaos Engineering

Andre Newman

June 28, 2023

Originally published on December 15, 2020. Teams that use Chaos Engineering report having greater than >99.99% site reliability . However, only 34% of companies using Chaos Engineering run their experiments in production. So how can you…

If you're adopting Kubernetes, you need Chaos Engineering

Andre Newman

January 31, 2022

When Ticketmaster started their Kubernetes migration , they had to address a huge problem: whenever ticket sales opened for a popular event, as many as 150 million visitors flooded their website, effectively causing distributed denial of…

Chaos Engineering tools: myth vs. fact

Gavin Cahill

April 4, 2023

With so many Chaos Engineering tools available, it’s no surprise that SRE and platform leaders are doing their homework when choosing a platform to help them build and scale their Chaos Engineering programs. But like anything else you can research on the internet, there’s a lot of noise and hype that you need to wade through.

See It For Yourself

Ready to tame the chaos in your systems?

Gremlin empowers you to proactively root out failure before it causes downtime.
See how you can leverage chaos to build resilient systems by requesting a demo of Gremlin.

TRY IT FREE

Build trust in complex systems with Chaos Engineering

Trusted by teams worldwide

Take the tour

Chaos Engineering makes your systems reliable. Gremlin makes it safe, easy, and secure.

Confidently recreate incidents and outages

Validate systems against any incident scenario

Prepare and runeffective Gamedays

Superpowers for SRE teams

The Gremlin Advantage

Stress your systems, any way you choose.

Gremlin works where you do

Learn more about Chaos Engineering

How to implement Chaos Engineering

If you're adopting Kubernetes, you need Chaos Engineering

Chaos Engineering tools: myth vs. fact

Ready to tame the chaos in your systems?

Build trust in
complex systems with Chaos Engineering

Chaos Engineering
makes your systems reliable.
Gremlin makes it safe, easy, and secure.

Prepare and run
effective Gamedays

Gremlin works  where you do