
3 things you can do to get closer to five nines
5 minutes. That’s how much downtime some of the world’s largest enterprises will tolerate.
For most organizations, five nines (99.999%) of availability sounds like a pipedream. But the trick to increasing availability isn’t massive infrastructure spending or complex system redesigns. All it takes are three key practices that any team can adopt and implement.
In this post, we’ll present these practices and how we implement them at Gremlin. You’ll get a sneak peek into what it takes to make a SaaS platform like Gremlin highly available, and how your organization can benefit from what we learned.
1. Test regularly
Reliability is like backups: not testing it regularly is the same as not having it.
Testing is crucial for understanding how your systems respond to failure conditions, but most teams only test sporadically. The problem is that spotty, one-off tests don’t account for the constant changes in your environment, software stack, or teams. And while observability is helpful for troubleshooting failures, it’s a reactive practice: it’s most helpful after the problem has already occurred.
At Gremlin, we found that weekly testing is the sweet spot. It’s frequent enough to catch issues before they become incidents, but not so frequent that it overwhelms your development cycle. But where do you start with testing?
Gremlin provides a suite of pre-built reliability tests that test your resilience to common failures:
- Resource exhaustion: How does your service perform when CPU, memory, or storage is under pressure?
- Redundancy failures: Can your services survive losing a host, availability zone, or even an entire region?
- Dependency failures: What happens when a critical external API, database, or other service goes down?
The key is to make this testing safe and automated. Using Health Checks, Gremlin will monitor your key metrics or alerts during testing. If an alert fires or a metric exceeds your set threshold, the test automatically stops and gets flagged as failed.
If you’re just starting with reliability testing, start small. Pick one critical service and run individual tests to understand how it responds to different failures. Connect Gremlin to your observability tool and create Health Checks around your key metrics. This also helps you validate that your observability tool is accurate and timely. Then, schedule tests to run weekly to catch any reliability issues introduced in the previous sprint.
2. Make reliability everyone’s responsibility through on-call rotations
Every engineer at Gremlin has spent at least one week on call. This isn’t a punishment, but a practice. Today, each engineer is intimately familiar with how our systems work, how they can fail, and how to fix them.
How did we uplevel our engineers? The answer is easy: on-call engineers are responsible for running our reliability tests.
This creates a positive feedback loop: engineers aren’t responding to problems; they’re actively looking for problems before they happen. The more problems they discover, the more incentivized they are to dig deeper. When you’re the person who might get paged at 2 a.m., the motivation to ensure systems don’t fail is very strong.
What naturally emerged from this practice is that engineers compare reliability scores week over week when rotating. If our overall reliability score drops since the previous rotation, they investigate the cause. Did a feature release introduce a reliability risk? Did a dependency have an outage? This type of proactive investigation helps the team stay ahead of potential incidents, which in turn helps the entire company.
3. Show your reliability work and take accountability
Metrics are only meaningful when people care about them. For reliability metrics to drive organization change, they need to be front and center and part of your team’s regular conversations.
At Gremlin, we pull up a dashboard of our services’ reliability scores during every engineering staff and product operations meeting. This isn’t just a status update—it’s an accountability moment. Engineers get to show off improvements, and if there’s a decline, the entire team has a shared conversation about the cause and possible solutions.
The key is making reliability data-driven and blameless. We’re not looking to single out engineers for letting the score fall by two or three points. Instead, we’re identifying areas of improvement and creating actionable steps. As a result, engineers can take pride in their reliability efforts, and reliability becomes a shared conversation rather than something that happens in the background.
It comes down to culture, not just technology
Achieving five nines isn't about having perfect infrastructure—it's about building a culture where reliability is as important as feature velocity. The companies that consistently achieve high availability don't just have better tools; they have teams that think about failure as a regular part of operating software systems.
Five nines might still feel ambitious, but with these three practices, you'll be surprised how much closer you can get—and how much more confident you'll feel about your systems' ability to stay up when it matters most.
Ready to start testing your way to higher reliability? Try Gremlin's automated reliability platform free for 30 days and see what your systems are really capable of. Or, watch Gremlin CEO Kolton Andrus talk about how we keep Gremlin at five nines:
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIAL