The case for Fault Injection testing in Production
Gremlin Principal Engineer Sam Rossoff shows you when you should run Fault Injection tests in non-production and Production environments.
How Gremlin's reliability score works
In order to make reliability improvements tangible, there needs to be a way to quantify and track the reliability of systems and services in a meaningful way. This "reliability score" should indicate at a glance how likely a service is to withstand real-world causes of failure without having to wait for an incident to happen first. Gremlin's Reliability Score feature allows you to do just that.
What is Reliability Management?
Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways:
Four tests to measure and improve reliability: what matters and how it works
Legendary race car driver Carroll Smith once said, "until we have established reliability, there is no sense at all in wasting time trying to make the thing go faster." Even though he was referring to cars, the same goes for technology: no amount of code optimization or new features can replace stable systems. Unfortunately, much like race cars, it's hard to know that a system is unreliable until it blows a tire, the brakes stop working, or the steering wheel comes off the column. By that point, it's too late: you're panicking, you and other engineers are scrambling to fix the issue, and your users are angry.