Build trust in
complex systems with Chaos Engineering
Gremlin enables every organization to conduct safe and secure Chaos Engineering experiments. Find reliability risks in any environment—before they impact users.
Trusted by teams worldwide
Industry leaders rely on Gremlin to keep their systems available and their customer experience reliable.
Take the tour
See how easy it is to run Chaos Engineering experiments using Gremlin Fault Injection.
Chaos Engineering
makes your systems reliable.
Gremlin makes it safe, easy, and secure.
Gremlin’s Chaos Engineering platform enables SRE and platform teams to improve uptime, validate system reliability, and cultivate a robust reliability culture.
Confidently recreate incidents and outages
- Test system reliability by thoughtfully injecting failure into services, hosts, containers, and serverless workloads.
- Use the experiments library to see how systems respond to a variety of common failure conditions.
- Start small and scale experiments once you're confident in system stability.
- Easily and automatically halt and roll-back attacks should issues arise.
Validate systems against any incident scenario
- Replicate real-world incidents by running multiple Chaos Engineering experiments in sequence.
- Create your own or use Gremlin's pre-configured library of recommended scenarios to simulate real-world outages that can impact performance, uptime, and customer experience.
- Share scenarios across teams to create a stronger culture of reliability.
- Run scenarios on a schedule to keep your availability high and your incident count low.
Prepare and run
effective Gamedays
GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them.
- Use GameDay Manager to reduce your internal prep and run time, ensuring a successful event.
- Automatically analyze and store results so you can review what success or failure looked like. Turn that data into action to improve your system.
- Integrate with Jira to ensure action items are captured and tracked with your team.
Superpowers for SRE teams
Gremlin helps SRE teams get ahead of reliability issues and avoid late-night pages.
dependency failure
The Gremlin Advantage
Only Gremlin has the depth of experience to implement Chaos Engineering at scale in the world’s most demanding environments.
including 5 of the 7 biggest US banks
Chaos Engineering experiments and reliability tests run
Stress your systems, any way you choose.
Rely on Gremlin’s comprehensive Fault Injection library to build resilience to common failure conditions. Run experiments in virtually any environment.
Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.
Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.
Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.
Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.
Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.
Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.
Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files and exhausted storage volumes).
Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.
Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.
Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.
Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.
Retrieve the certificate chain and validate that no certificates expire in the next 30 days.
Consume computing power to simulate a highly-intensive workload and push the GPU to its limit. Stress test your AI, LLM, video encoding, and other GPU workloads.
Gremlin works where you do
Gremlin is a cloud-native platform that runs in any environment. Gremlin supports all public cloud environments (including AWS, Azure, and GCP) and runs on Linux, Windows, containerized environments like Kubernetes, and, yes, bare metal, too.
Learn more about Chaos Engineering
How to implement Chaos Engineering
Originally published on December 15, 2020. Teams that use Chaos Engineering report having greater than >99.99% site reliability . However, only 34% of companies using Chaos Engineering run their experiments in production. So how can you…
Continue ReadingIf you're adopting Kubernetes, you need Chaos Engineering
When Ticketmaster started their Kubernetes migration , they had to address a huge problem: whenever ticket sales opened for a popular event, as many as 150 million visitors flooded their website, effectively causing distributed denial of…
Continue ReadingChaos Engineering tools: myth vs. fact
With so many Chaos Engineering tools available, it’s no surprise that SRE and platform leaders are doing their homework when choosing a platform to help them build and scale their Chaos Engineering programs. But like anything else you can research on the internet, there’s a lot of noise and hype that you need to wade through.
Continue ReadingReady to tame the chaos in your systems?
Gremlin empowers you to proactively root out failure before it causes downtime.
See how you can leverage chaos to build resilient systems by requesting a demo of Gremlin.