From DevOps to Five Nines: How a Global Enterprise Is Building a World-Class Resilience Program

Using Gremlin, a Fortune 100 SaaS company's DevOps team transformed their approach to reliability, proactively validating a mission-critical AWS platform targeting 99.999% availability for 5,000+ developers worldwide.

Availability bar raised from 99.95% to 99.999%

The team continually proved reliability through measurement, earning executive backing to bring their resilience practice to a new mission-critical AWS platform.

Executive Summary

The company's core DevOps platform supports 5,000+ developers but lacked a formal reliability program. with no consistent way to test, measure, or validate resilience across system. Rising expectations for five-nines availability exposed gaps in existing approach. Using Gremlin, the team built a systematic, repeatable reliability practice across the platform, introduced structured testing, measurement, and validation, and established a foundation now being extended to a mission-critical AWS platform targeting 99.999% availability.

"Five nines is just table stakes of operating at the scale and world-class reputation that [the company] has.”

Director of Engineering, Fortune 100 SaaS Company

‍

Six years ago, a Fortune 100 SaaS Company's internal technology organization was still in the early stages of maturing its approach to reliability.

“We weren’t really looking at reliability as intently then as we do now,” says the Director of Engineering. “It was just one of those things—we’ll run it on a Kubernetes cluster, and it’s good enough in house. Most people don’t look too hard at the developer toolchain anyways.”

But this wasn’t just any internal system. The DevOps platform is the core toolchain for more than 5,000 developers building and shipping software across a global organization. If the platform wasn’t available, development slowed—or stopped entirely.

Now, that same platform is backed by a disciplined reliability practice—and the team is using Gremlin to validate a brand-new, mission-critical AWS platform targeting 99.999% availability.

‍

The Challenge

The company's Digital Enterprise Technology organization (formerly Business Technology) operates as “customer zero”—the team that customizes the platform for the company and gets first access to new capabilities. Within it, the DevOps group owns the toolchain, delivery pipeline, and shared services that more than 5,000 engineers rely on to build and operate software used by customers around the world.

At the time, development and operations were still separate, and reliability wasn’t a formalized practice. The toolchain—despite underpinning critical engineering workflows—sat at the bottom of the company’s service tier model had been deployed on Kubernetes, but there were no structured reliability programs in place—no consistent way to measure, test, or improve resilience.

As the team began improving the platform, one thing became clear: a system that thousands of developers depended on needed to be reliable by design—but there was no framework in place to make that possible.

‍

The Solution

To close that gap, the focus wasn’t just on improving uptime—it was on building a disciplined, repeatable approach to reliability across the platform. The goal was simple: create a system that would seamlessly integrate into developers' existing workflows.

That meant moving beyond ad hoc efforts toward structured practices: defining reliability standards, introducing consistent testing, and creating clear ways to measure performance over time.

To support this shift, the team looked for a solution that could help operationalize these efforts without requiring a large internal investment.

“I liked Gremlin’s user interface, just how the tool itself is put together. It makes it very easy to achieve different goals. And I hadn’t seen anything in the market that was quite as easily used that would not require a lot of investment for me to establish a reliability practice. It has definitely lived up to expectations from the very beginning.”

Director of Engineering, Fortune 100 SaaS Company

‍

Gremlin became a core part of how the team validated reliability—making it possible to safely test failures, uncover weaknesses, and standardize resilience practices across the platform.

‍

The Turning Point

With this foundation in place, reliability became embedded into the platform itself. Using Gremlin, the team could simulate real-world failures and validate how systems behaved under pressure—ensuring the toolchain could support thousands of developers without disruption.

Instead of relying on assumptions, reliability became something they could actively test and prove.

Over time, reliability became a measurable, continuously improving discipline. The team introduced scorecards, SLA tracking, and telemetry to understand how systems were performing and where they needed to improve.

Rather than treating reliability as a fixed target, it became an ongoing process. Across more than two years, they steadily raised the bar—using testing and data to push the platform to higher levels of reliability.

As these practices matured, they caught the attention of new leadership.

When a new President of Enterprise and AI Technology stepped into the role, he recognized the progress the team had made—and saw an opportunity to apply that same approach more broadly across the company.

“Five nines is just table stakes of operating at the scale and world-class reputation that the company has. Their mantra is, we want best in class across the board. Doesn’t matter how much it costs because they know that the payoff is there. They want to build apps that can’t go down.”

Director of Engineering, Fortune 100 SaaS Company

‍

With a new mission-critical AWS platform supporting products, AI initiatives, and developer workflows, leadership looked to extend this reliability model to an even broader set of systems.

The new platform would target 99.999% availability—and resilience engineering became an executive-level priority.

With a strong foundation in place, the focus is now on scaling these practices even further—making reliability testing more self-serve for developers, expanding the use of metrics and reporting, and embedding validation directly into the delivery pipeline.

“Five nines—we’re talking minutes per year of downtime. So we really have to get this right in terms of making it easy and automated end to end.” Gremlin is the tool the team uses to validate it all—testing scalability, auto-scaling, and regional failover across the platform," says the Director of Engineering.

The team is also pushing toward more advanced testing in production, including large-scale disaster recovery scenarios like zone evacuations—continuing to raise the bar for what reliable systems look like at the company's massive scale.

‍

Conclusion

A decade as a developer before moving into management gave the Director of Engineering a clear philosophy: resilience testing can’t be yet another burden on engineers. “What I don’t want them to have to do is learn an entire specialty to write software in this company,” he says. “They should just be given the experience that we want them to have out of the box.”

His vision: the platform itself is resilient. Everything else—compliance, security, observability—is baked in through policy as code. The team is even exploring integrating Gremlin directly into their custom-built CICD pipeline, introducing resilience testing as an automated stage of delivery.

“Gremlin has been so much more than just a tool. It’s been a lot of collaboration, a lot of sharing of ideas. It transcends the tool itself.”

Director of Engineering, Fortune 100 SaaS Company

‍

Looking ahead, the Director sees the partnership deepening as a new platform engineering team scales up and his DevOps group transitions into a center of excellence for resilience—pushing best practices across the organization. In three to six months, the goal is a repeatable, structured resilience program: defined scorecards, documented processes, and Gremlin embedded at the center of a platform built for five nines.

“We’re building a world-class experience for developers at this company,” he says.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started

From DevOps to Five Nines: How a Global Enterprise Is Building a World-Class Resilience Program

Executive Summary

The Challenge

The Solution

The Turning Point

Conclusion

Avoid downtime. Use Gremlin to turn failure into resilience.