From Exploration to Expectation: How a Global Athletic Apparel Leader Built a Culture of Resilience

Gremlin helps the world's leading athletic apparel brand prevent catastrophic holiday outages and build a culture of resilience that scales with evolving infrastructure.

Prevented catastrophic outage

A Gremlin-powered game day uncovered a hidden front-end library issue that would have triggered a repeat holiday failure—freezing order processing during peak shopping.

Culture of resilience

Chaos engineering evolved from an exploratory concept to an organizational expectation—with teams across the company independently planning and running their own experiments.

Executive Summary

A global athletic apparel leader's infrastructure was evolving faster than their ability to ensure its resilience, with no systematic way to validate that fixes actually worked under real failure conditions. After repeated holiday-season outages threatened critical revenue periods, the company partnered with Gremlin to build a chaos engineering practice that has since transformed resilience from a novel concept into an organizational expectation. Today, chaos engineering is embedded across their engineering culture, forming the foundation for continued innovation as their platform architecture evolves.

"Even though people don't plan for failure, it happens all the time. And when you hold the pager and you're responsible for the thing working in production, they're not gonna ask your boss or your boss's boss. They're gonna ask you."
Lead Software Engineer, Global Athletic Apparel Company

How do you know your technical fix actually works?

The leading athletic apparel brand found out the hard way- or rather, they almost did. After a holiday outage froze order processing, they identified the root cause and built a fix. The team was confident they would have a stress-free holiday season- a crucial time of year for any retailer.  

But just to be sure, they used Gremlin to inject the same dependency failure that had caused the original outage… only to discover that the fix didn’t work. 

"It was because of this bizarre issue with a live front-end library that you would have never guessed caused this particular behavior," their Lead Software Engineer says. "Basically, it would have reoccurred- and that woke everybody up.” 

Gremlin's ability to safely recreate real-world failure scenarios in production revealed what confidence and code reviews couldn't: a hidden vulnerability that would have triggered the same cascading failure. The team addressed the real root cause and prevented a catastrophic repeat failure during their most critical business period.

More importantly, it proved something that would reshape the company's entire engineering culture: confidence isn't enough. You have to test.

The Challenge

In 2018, the company faced a challenge familiar to many large enterprises: their infrastructure was evolving faster than their ability to ensure its resilience. The company was in the midst of a massive migration from on-premise data centers to cloud infrastructure, with plans to eventually embrace containerization and serverless architectures.

But the technical transformation was only part of the story. 

"Even though people don't plan for failure, it happens all the time,”  explains the Lead Software Engineer, who had been with the chaos engineering program since its inception. “And when you hold the pager and you're responsible for the thing working in production, they're not gonna ask your boss or your boss's boss. They're gonna ask you."

The Evolving Threat Landscape

That game day success came after years of holiday pain. During the 2019-2020 holiday seasons, the company faced a perfect storm: multiple AWS outages crashed parts of the internet during critical shopping windows, while internal issues caused additional outages that degraded key business moments.

The impact was severe enough to shift their entire engineering culture. An organization that prided itself on continuous deployment adopted deployment freezes during the holiday period. The mantra changed from "ship 365" to "Holiday 365"- always preparing for peak season.

But the companu's infrastructure kept evolving. Dependencies constantly changed — internal services, third-party APIs, cloud provider services. A service that was resilient yesterday might not be resilient today if a new dependency was added or an upstream service changed its behavior.

"Tech stack diversification has only increased," the Lead Software Engineer explains. "So there’s still a lot more work to be done. Turnover has increased. Teams get really good at resilience, but then there’s rotation-  the maturity model goes up and then back down.” 

The company needed an approach to resilience that could adapt as quickly as their infrastructure evolved.

The Solution

Phase 1: Exploration 

The early days focused on education and experimentation. The team was initially hesitant to even use the word "chaos" in the program name, opting instead for "Resilience Engineering Program."

They started with infrastructure-level experiments on their remaining on-premise servers and early cloud deployments. Teams needed to be convinced of the value. The SRE team offered support, ran experiments, and demonstrated findings.

The value proposition was compelling, yet it required time to truly take hold.

Engineers dealing with production incidents began to see chaos engineering as a way to prevent future pages, rather than just reacting to current ones.

Phase 2: Validation

A critical service had experienced a major outage during the previous holiday season. The service's dependency had issues that caused a cascading failure, ultimately freezing order processing. In the post-incident review, the team identified the root cause and implemented a fix.

"They said, 'This fix has gotta work next holiday. This is a super important service,'" the Lead Software Engineer explains. "So we said, okay, we'll definitely run a game day this time."

The team was confident the fix would work. They set up a chaos experiment using Gremlin to fail the dependency in the specific way it had failed in production.

The fix didn't work.

"It was because of this bizarre issue with a live front-end library that you would have never guessed caused this particular behavior," the Lead Software Engineer says. "Basically, it would have reoccurred- and that woke everybody up.” 

The team quickly addressed the real root cause. Holiday season was successful. And the company took notice.

Phase 3: Standardization

Resilience testing shifted from a nice-to-have to a necessity. Teams that had been hesitant started requesting support. Engineers who had worked with the SRE team on chaos engineering and then moved to different teams brought the practice with them.

"People have come to our team for support when they haven't run an experiment in years with this other team," the Lead Software Engineer says. "They're like, 'Man, that [chaos engineering] was really great. I'm on this new team, and I told my manager about it, and they want to do it now.”
Lead Software Engineer, Global Athletic Apparel Company

Phase 4: Expectation 

Today, resilience testing isn't optional — it's expected. The journey from exploration to expectation took years, but the cultural shift is undeniable.

"Everyone within the organization knows what it is," the Lead Software Engineer says. "They're planning their own experiments. They're thinking of ways to fail [infrastructure and applications]. They have the mindset now and the expectation."

The company even renamed the entire department to Resilience Engineering, embracing the terminology they once avoided.

Conclusion

As the company's infrastructure evolved, so did their approach to resilience testing – and Gremlin evolved alongside them.

"Kolton [Gremlin’s CEO] was always saying, 'It's all about application fault injection. That's the ultimate goal. And now with Failure Flags, it’s possible — that's the best product out there. It's fantastic."
Lead Software Engineer, Global Athletic Apparel Company

They’re now tackling a new challenge: how do you make resilience accessible to platform teams and all their tenants without requiring each team to become chaos engineering experts?

The solution: bake resilience directly into the platforms.

"Rather than resilience being a place you’ve got to get to, we've shifted to, 'How do we bake it into everything?'" the Lead Software Engineer explains. "Put the cookies on the lower shelf and make it accessible, and then track who opens the cookie jar and extrapolate how much value they got."

The company's approach to resilience now mirrors their approach to security: continuous adaptation, not one-time fixes. And the culture built through their six-year partnership with Gremlin – the mindset shift from reactive to proactive – has become the foundation for their next era of retail innovation.

This is some text inside of a div block.
Previous
Customers
Next

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Product Hero ImageShape