Ensuring software availability is essential for any SaaS company—including Gremlin. To do that, our teams need to identify the reliability risks hiding in our systems. That’s why our development, platform, and SRE teams use Gremlin regularly to perform Chaos Engineering experiments, run reliability tests, and track the reliability of our systems against our standards. Along the way they’ve picked up a thing or two about how to find and fix reliability risks with Gremlin.
Based on their experience, we’ve put together five best practices you can use to improve your reliability and maximize the impact of your Reliability Management practice.
In the DevOps Handbook, Gene Kim says, “You have to be able to see a problem to fix it.” This piece of common sense advice is important to keep in mind when you’re looking at the reliability of your systems. (In fact, one of our platform engineers, no joke, has this quote on his wall.)
You have to be able to see a problem to fix it.
Properly setting up observability and alerting has to be one of your first big goals for reliability, since this lets you see what’s going on. Start by making sure all your systems are instrumented properly, then go through and make sure your alerts and health checks are set to the right levels.
Chaos Engineering experiments are a great way to test if your alerting levels and observability work the way you want them to. With Fault Injection, you can create a scenario where you intentionally trigger an alert in your observability platform—like if memory usage spikes past a critical level or you lose a critical network connection—without actually bringing your system down. We’ve had plenty of customers set an alert level that seemed reasonable, but a test found out the program crashed well before the metrics hit the alert level.
Our team uses Gremlin to fine tune our metrics and alerts to find that perfect balance between noise and signal.
Reliability testing shouldn’t be an afterthought. Every stage of the development process has the potential to introduce bugs or conflicts with existing systems, especially with how complex modern systems are. And the more you test, the more you’re able to find and fix reliability risks before they hit production or impact users.
A good way to start is by shifting reliability testing left into staging and integrating it with your CI/CD pipeline. When SREs join our team, they practice their due diligence and start by running experiments in the staging environment. Then, once they see how the experiments work and know that they have the right guardrails in place, the engineers confidently migrate their experiments into production.
It’s just as essential to move towards testing in production. No matter how much we might try, it’s impossible to make staging a perfect mirror of production. Production has unique data, traffic loads, user interactions, and more. With Gremlin’s health checks and automatic experiment rollback, you can confidently run experiments in production without impacting the health of your service.
Gremlin also helps make testing at every stage simple with pre-built reliability tests and the ability to automatically detect common risks. While you can (and should) still configure custom Fault Injection experiments to fit your systems, these out-of-the-box features help our team automate finding reliability risks at every stage of production.
In Failure Modes and Effects Analysis (FMEA), failure modes are all the ways a system could fail. Every software system has a wide variety of failure modes, many of them unique to the system. But there are a few common failure modes that fit almost every system, such as CPU/memory scalability, host/zone redundancy, and dependency failures.
Unfortunately, at many companies, engineers are left to prioritize testing for specific failure modes themselves—if they test at all. This practice is often biased toward past outages, which means it misses basic tests and best practices. While hoping their systems can navigate these failure modes, they don’t know for certain, and that’s enough to keep an SRE up at night with worry.
Matt, a Gremlin platform engineer, had that exact experience. Before he joined the team, he’d spent years knee-deep in the world of reliability. Like many SREs, he’d do everything he could to make sure the systems were resilient, but without Chaos Engineering testing, he couldn’t be sure—and that uncertainty would keep him up late.
But then he joined Gremlin, where we use our own pre-built reliability tests to know for certain how our systems will react to scalability, redundancy, and dependency failure modes.
When I got to Gremlin, I started playing around with Reliability Management, and it was exactly what I wish I would have had earlier. It gives me some comfort and baseline assurances of the reliability of our services. And that's really nice.
If you want to have confidence in your systems, it starts with making sure your basics are covered. At Gremlin, we spread out a standard set of basic reliability tests to run throughout the week. This means that each of our services is automatically tested weekly against those core, basic failure modes. We’re able to spot reliability risks and add fixing them to our sprints. Which means Matt and the rest of our platform and engineering teams can rest easier at night.
Probably the worst time to get strong, efficient work out of engineers is when they’ve been dragged out of bed to respond to an alert. Or pulled away when they’re deep into a completely different project because something went down. Unfortunately, that’s precisely when many engineers are forced to fix issues and make reliability improvements.
But not if you regularly run reliability tests. When you schedule your tests, you can choose to run them at a time when engineers will be available for their best work. And when you do find reliability risks, you can add remediation into sprints on your schedule instead of when there’s an outage.
Most of Gremlin’s tests are set to run once a week during the middle of the day. That way there’s always an on-call engineer awake and ready to do their best work in case anything dire turns up.
While this may seem counter-intuitive (after all, most maintenance happens late at night), Gremlin’s automated health check with quick test rollback feature gives our SRE team confidence in the ability to perform non-destructive tests in real-world environments. And it allows them to see how our systems truly react when under the heavier load of midday work.
Finding a reliability risk doesn’t help much if you can’t take action to fix the issue. That’s why it’s essential to pair regular testing with a regular meeting cadence to go over your Gremlin dashboard and Reliability Management scores.
At Gremlin, our regular meetings include discussions with service owners and balancing sprint scheduling. Dashboards and test results are shared ahead of time, so you know that if your service is flagged with a risk, you can come prepared to answer questions. The Gremlin dashboard even makes an appearance at company-wide meetings where our engineering team can show the progress they’re making to address reliability risks without waiting on incidents and outages.
These meetings aren’t as simple as going down the list of reliability risks to address every single risk. Like any other engineering task, remediation has to be weighed against the rest of the engineering tasks and capacity. But by having a list of failed tests, your team can make informed decisions to balance shipping new code with fixing existing issues.
Always remember that the point isn’t to find the teams that messed up and blame them—it’s to enable teams to find and fix reliability risks.
What I like about reliability tests is that they give a good baseline for a service team so you can begin narrowing down which teams need more help than others.
These meetings can also give you a sense of your tech debt creation rate. Every reliability risk represents potential tech debt you’re going to have to address. The lower your score with Reliability Tests, the more tech debt is piling up. By having regular meetings, you can make better decisions to balance your tech debt creation and feature launches.
Reliability isn’t a binary one-and-done task where you flip a switch and suddenly everything works perfectly. Your systems (and the systems they connect to) are always changing, which means new reliability risks are being introduced all the time.
Every single company needs to spend time improving the reliability of their systems—Gremlin included. The question you need to ask your team is whether you want to spend that time after the outage has occurred while frantically trying to fix it and repair the damage, or if you want to test early and fix the issues before they cause problems.
Yes, incidents will still happen, but regular reliability testing will detect reliability risks earlier. Which means you can fix them on your schedule and your terms, so incidents become few and far between.