The most comprehensive Chaos Engineering platform

TT | DynamoDB Failover
  • Reliability across your environment
    Test and improve the reliability of distributed systems. Orchestrate chaos experiments across your environment, including cloud platforms, bare metal, containers, and Kubernetes clusters.
  • Track reliability improvements
    Keep track of various experiments run across your environment to help you prioritize which services need attention, and where vulnerabilities may lie.
  • Win customer trust
    Keep customers at the heart of your engineering practices. Test across every variable to uncover and address potential failure points before they disrupt customer experiences and damage your brand.
  • Avoid costly downtime
    Minimize your risk of system failure by proactively testing for weaknesses and addressing them before they become costly, public outages.
  • Reduce your MTTD and MTTR
    Get comfortable with various failure scenarios so in the event that the unlikely happens (and it will), you're prepared to respond quickly and efficiently.

Confidently run Chaos Engineering experiments

Confidently test systems reliability by thoughtfully injecting failure into services, hosts, or containers with a Gremlin attack. Using the attack library, see how systems respond to a variety of common failure conditions. Scale the blast radius of the attack once you're confident in system stability, and easily halt attacks should issues arise.

Teams who frequently run Chaos Engineering experiments - weekly, or monthly - have >99.9% availability. Keep your availability high and your incident count low by setting attacks to run on an automated schedule.

Validate resilience to common failures

Scenarios let you run multiple attacks in sequence and create more complex Chaos Engineering experiments. Create your own, or use Gremlin's pre-configured library of Recommended Scenarios to simulate real-world outages that can impact performance, uptime, and customer experience. Share scenarios across teams to create a stronger culture of reliability.

Automatic services detection and tracking

Gremlin auto-detects all services in an environment, giving you complete systems visibility and helping you uncover any unknowns. Isolate, target, and attack distributed services no matter where they're running. Track your reliability practice with a full history of all attacks run on a service, and quickly identify and prioritize services that need attention.

Prevent unintended failures

Prevent experiments from running when systems are unstable. Status Checks automatically halt and roll back experiments if systems don't meet expected criteria. Integrate with your preferred monitoring and observability tool to validate conditions and trigger rollbacks if any issues arise.

An integral part of your testing framework

Use Gremlin's APIs and webhooks to trigger notifications to monitoring, incident management, or other DevOps tools of new or ongoing experiments. Automate Chaos Engineering practices by integrating experiments into your CI/CD pipeline.

Supported platforms

Gremlin is a cloud-native platform that runs in any environment. Gremlin supports all public cloud environments - AWS, Azure, and GCP - and runs on Linux, Windows, containerized environments like Kubernetes, and yes, bare metal too.

attack library

Rely on Gremlin's comprehensive attack library to build resilience to common failure conditions.
Generate load across CPU cores. Ensure your systems can withstand stressful conditions caused by high demand or heavy traffic.
Consume a specific amount of RAM. Validate resilience to memory leaks or resource-intensive applications.
Create read/write pressure on I/O devices such as hard disks. Test the performance of your systems when connected to high-latency, low-throughput storage.
Consume a specific amount of space on a storage device. Ensure your systems keep running even with disk-hungry applications (such as log files) and exhausted storage volumes.
Shutdown (and optionally reboot) the host operating system. Build resilience to host failures.
Time Travel
Change the system time. Prepare for Daylight Savings Time, clock drift between systems, expiring SSL/TLS certificates, and other time-sensitive events.
Process Killer
Terminate a specific process or set of processes. Prepare for application crashes, Out Of Memory (OOM), and similar events.
Drop network traffic based on port, network interface, or hostname. Test your ability to failover during a complete network outage.
Inject a delay into outbound network traffic. Validate your system's responsiveness under slow network conditions.
Packet Loss
Drop or corrupt a percentage of outbound network traffic. Ensure you can successfully send and receive data despite poor network conditions.
Block access to DNS servers. Prepare for DNS outages, test fallback DNS servers, and validate DNS resolver configurations.

Enterprise-grade security and compliance

Gremlin is SOC II compliant and follows industry standard security practices.
  • Least Permissions
    Gremlin runs on default Linux permissions and doesn’t require root access
  • Secure user management
    Multi-factor authentication, Secure Single Sign On, and Role-Based Access Control (RBAC)
  • Audit trails
    Every action on the platform is tracked for compliance
  • 3rd party testing
    Gremlin regularly undergoes regular security auditing by a 3rd party

Trusted by teams worldwide

Leading SRE teams rely on Gremlin to keep their systems available and their customer experience reliable.
Charter CommunicationsDreamworks AnimationExpediaGrubhubH‑E‑BJPMorgan Chase & CoMailChimpNABQualtricsSASShiptTargetToyota ConnectTwilioWalmartWorkiva
Charter CommunicationsDreamworks AnimationExpediaGrubhubH‑E‑BJPMorgan Chase & CoMailChimpNABQualtricsSASShiptTargetToyota ConnectTwilioWalmartWorkiva