Gremlin Chaos Engineering

Build trust in
complex systems with Chaos Engineering

Product Hero Image

Gremlin enables every organization to conduct safe and secure Chaos Engineering experiments. Find reliability risks in any environment—before they impact users.

Trusted by teams worldwide

Industry leaders rely on Gremlin to keep their systems available and their customer experience reliable.

Take the tour

See how easy it is to run Chaos Engineering experiments using Gremlin Fault Injection.

Chaos Engineering
makes your systems reliable.
Gremlin makes it safe, easy, and secure.

Gremlin’s Chaos Engineering platform enables SRE and platform teams to improve uptime, validate system reliability, and cultivate a robust reliability culture.

Confidently recreate incidents and outages

  • Test system reliability by thoughtfully injecting failure into services, hosts, containers, and serverless workloads.
  • Use the experiments library to see how systems respond to a variety of common failure conditions.
  • Start small and scale experiments once you're confident in system stability.
  • Easily and automatically halt and roll-back attacks should issues arise.
Learn more about Chaos Engineering

Validate systems against any incident scenario

  • Replicate real-world incidents by running multiple Chaos Engineering experiments in sequence.
  • Create your own or use Gremlin's pre-configured library of recommended scenarios to simulate real-world outages that can impact performance, uptime, and customer experience.
  • Share scenarios across teams to create a stronger culture of reliability.
  • Run scenarios on a schedule to keep your availability high and your incident count low.

Prepare and run
effective Gamedays

GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them.

  • Use GameDay Manager to reduce your internal prep and run time, ensuring a successful event.
  • Automatically analyze and store results so you can review what success or failure looked like. Turn that data into action to improve your system.
  • Integrate with Jira to ensure action items are captured and tracked with your team.
Learn more about GameDays
Use Cases

Superpowers for SRE teams

Gremlin helps SRE teams get ahead of reliability issues and avoid late-night pages.

Meet uptime and availability SLOs
Identify hidden reliability risks
Validate and tune monitors
Mitigate
dependency failure
Eliminate revenue impacting outages
Ensure reliable migrations & launches
Why Gremlin?

The Gremlin Advantage

Only Gremlin has the depth of experience to implement Chaos Engineering at scale in the world’s  most demanding environments.

Used by 100+ of the Fortune 2000,
including 5 of the 7 biggest US banks
Hundreds of thousands of hosts safely and securely run Gremlin
Over one million
Chaos Engineering experiments and reliability tests run
Fault Injection Library

Stress your systems, any way you choose.

Rely on Gremlin’s comprehensive Fault Injection library to build resilience to common failure conditions. Run experiments in virtually any environment.

CPU

Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.

Process Killer

Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.

Memory

Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.

Blackhole

Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.

IO

Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.

Latency

Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.

Disk

Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files and exhausted storage volumes).

Packet Loss

Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.

Shutdown

Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.

DNS

Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Time Travel

Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.

Certificate Expiry

Retrieve the certificate chain and validate that no certificates expire in the next 30 days.

GPU

Consume computing power to simulate a highly-intensive workload and push the GPU to its limit. Stress test your AI, LLM, video encoding, and other GPU workloads.

Combine experiments for hundreds of pre-built and custom scenarios.
Learn more: Gremlin’s Guide to Enterprise Fault Injection
Supported Platforms

Gremlin works 
where you do

Gremlin is a cloud-native platform that runs in any environment. Gremlin supports all public cloud environments (including AWS, Azure, and GCP) and runs on Linux, Windows, containerized environments like Kubernetes, and, yes, bare metal, too.

Featured Resources

Learn more about Chaos Engineering

See It For Yourself

Ready to tame the chaos in your systems?

Gremlin empowers you to proactively root out failure before it causes downtime.
See how you can leverage chaos to build resilient systems by requesting a demo of Gremlin.