Recreating the conditions that led to past incidents and outages is key to ensuring resilience to those conditions moving forward. Gremlin allows you to evaluate system reliability by safely injecting failures into services, hosts, containers, and serverless workloads and seeing how systems respond.
With a comprehensive library of common failure conditions at your disposal, you can simulate and evaluate the real-world impact of varying stressors. Experimentation can start small–a single host or a fraction of your traffic—and expand as your confidence in your systems improves. Importantly, Gremlin offers fail-safes that automatically stop and roll-back experiments based on real-time system health, ensuring that when systems do fail, they aren’t down for a moment longer than necessary.
True reliability requires a proactive defense against diverse failure scenarios. Gremlin facilitates this by enabling the replication of real-world incidents through orchestrated Chaos Engineering experiments and reliability tests. Gremlin includes an extensive library of pre-configured scenarios, and enables you to build your own scenarios to validate against any type of incident. Need to ensure your customers won’t be impacted by resource saturation, significant latency, or the loss of a data center, availability zone, or cloud provider? Gremlin has you covered with these and more. These scenarios can be shared across teams, fostering an organizational culture prioritizing reliability. Schedule scenarios and validate deployments to keep availability high and reduce unplanned downtime.
Teams tasked with the daunting responsibility of maintaining optimal system availability often lack the tools to validate that past incidents won’t crop up again. Gremlin's platform provides these teams with the tools necessary to proactively identify and mitigate reliability risks, minimizing incident firefighting and costly late-night pages. Gremlin enables SREs to identify hidden reliability risks, validate and tune monitors, mitigate dependency failures, ensure reliable migrations and launches, and eliminate unplanned revenue-impacting outages. It’s a whole new approach to meeting uptime and availability SLOs.
Gremlin is a cloud-native platform that runs in any environment, so you can enable every team to build more reliable systems, regardless of their stack. Gremlin supports all public cloud environments (including AWS, Azure, and GCP), and runs on Linux, Windows, containerized environments like Kubernetes, serverless infrastructure like Lambdas, and, yes, bare metal, too. It integrates with the CI/CD, observability, and performance tools you already use so you can integrate it with your current tooling and workflows.