Measure and track your reliability
Get a comprehensive, objective measurement of your service’s reliability in minutes.
Hundreds of finance, retail, and technology organizations worldwide trust Gremlin
The world’s first truly objective measure of reliability
Scores are used everywhere in software engineering—QA tests, unit test coverage, uptime, etc. Why not reliability? Gremlin is the only reliability solution that can give you an objective, up-to-date score on how reliable your services are: no configuration needed.
What is a reliability score?
A reliability score is a value that represents how well a service can withstand real-world failures. Gremlin runs a suite of reliability tests on your services, then calculates the score based on the percentage of successful tests. This score ranges from 0 to 100, with 100 indicating a reliable service.
Gremlin makes Chaos Engineering easy and seamless. For us, it’s cut down the amount of time involved in designing and executing the chaos experiments, particularly for our Microservices and Kubernetes.
Chaitanya Krant, Engineering Manager at National Australia Bank
Align your engineering organization around a single reliability metric
Engineering teams have varying ideas of what reliability means and how to measure it. Gremlin’s reliability score sets the standard for teams, letting you see how well each team adheres to your organization’s reliability standards. Teams now have a positive, proactive, and self-guiding reliability metric they can use to plan improvements. Contrast this with retrospective meetings teams run after an incident has already happened.
At its core, Gremlin’s reliability score is built on your observability tool. In other words, you tell us what “reliable” looks like. Gremlin will use your metric of choice—whether it’s a simple responsiveness check, a Datadog metric, a PagerDuty alert, or something more complex.
Track changes to reliability over time
Your reliability score is more than just a point-in-time measure of reliability. Gremlin also tracks your score over time so you can see how the reliability of your service has changed as you continue to test and improve it. This is especially useful for reviewing past test results, determining when you last tested this service, and proving to your manager that you've been putting effort into improving your service's reliability.
Also, services change over time. Engineers push new code, services scale up and down, and infrastructure changes. New risks may appear and regressions may be reintroduced. Reliability scoring lets teams prove that you’ve kept your services resilient to new and recurring risks. And if a reliability risk does appear, the score acts as a proactive indicator, so you can fix it before it ever reaches production.
Demonstrate improvements to your organization
Historically, teams have struggled to prove the impact of their reliability efforts. Indicators like mean time to detection (MTTD) and mean time to resolution (MTTR) are useful, but they’re reactive, and don’t tell the whole story. By the time you collect data on these indicators, the incident or outage has already happened.
With reliability scores, teams now have a proactive metric they can use to show the positive impact of their reliability work.
Shift from observing to improving
Gremlin enables teams to proactively improve reliability at every stage of maturity.
Robust, customizable chaos tests to safely replicate any incident scenario.
Pre-built test suite to cover the most common reliability risks. Get started in minutes.
Standardized scoring tools to identify and prioritize risks, and build reliability programs.