Under Armour, Inc. is a Baltimore-based company that manufactures footwear, sports, and casual apparel with offices in San Francisco and Austin. The Under Armour Connected Fitness division builds and operates the MapMyFitness and MyFitnessPal apps for their community of over 200MM users who log millions of meals and workouts daily.
Under Armour focuses on empowering their engineers, and Chaos Engineering, the controlled practice of injecting failure into systems to build resiliency, is a key initiative driven by senior engineering manager, Paul Osman. His Site Reliability team has adopted the practice of organizing disciplined, targeted Chaos Experiments to be executed on what are known as "GameDays." This discipline guides their teams as they harden critical systems for the impact of seasonal traffic spikes.
We use Gremlin to test various failure scenarios and build confidence in the resiliency of our microservices. The ability to target containerized services with an easy-to-use UI has reduced the amount of time it takes us to do fault injection significantly.
In order to make Chaos Engineering part of Under Armour's DevOps culture, it was essential for them to have a hosted service that was simple to use, safe for their team, and that offered world-class security. "Do It Yourself" was not an option because of the expertise, time and long-term maintenance required. Paul had learned that any point of friction slows adoption and reduces developer productivity.
Under Armour began running GameDays using a network latency simulation tool that a popular internet brand had open sourced, Toxiproxy. They needed to simulate the heavy usage that the MapMyFitness app experiences every New Year's when customers rush to pursue their new fitness resolutions. The team was committed to Chaos Engineering, but configuring Toxiproxy for each GameDay took two engineers a full week of their operations time.
And as part of maximizing the value of each GameDay, Paul's team wanted to run as many Chaos Experiments as possible, but the configuration challenges with ToxiProxy meant they were only able to run 2-3 experiments per GameDay.
We were repeating the set up for every single experiment, and even then it was error-prone. We anticipated that if we wanted to get good at Chaos Engineering and make GameDays really repeatable, we'd probably have to write our own tooling. We decided there's got to be a tool out there that helps do this better, and that's when we found Gremlin.
Paul's team of SREs worked with Gremlin's customer success team, all trained Chaos Engineers, to install Gremlin and collaborate on running GameDays. Built to be safe, simple, and secure, Gremlin runs out of the box and is compatible with their Kubernetes environment. Paul's team didn't need to spend time doing repetitive configuration and set-up each time they ran a GameDay. They were also able to perform more Chaos Experiments each time because Gremlin's UI and control plane made it simple to start a Chaos Experiment. In addition, their team could test with confidence knowing Gremlin has a Halt All feature to immediately stop any experiment and and revert back to a steady state.
Working with Gremlin's team helped Under Armour feel more confident about doing Chaos Experiments, and shortened their Chaos Engineering adoption curve.
Having Gremlin's Success team is something I underestimated the value of. It's extra engineers who help us run GameDays. It makes people a little bit more comfortable, especially when I'm making the case for Chaos Engineering internally. I can say, 'Here's Gremlin, they're going to go over and actually be in the office with you when you do this. Paul Osman
Senior Engineering Manager
Gremlin saved our team repetitive set up every time we ran a GameDay. GameDay prep before Gremlin would take the time of two engineers usually for about a week. After adopting Gremlin it takes the same two engineers one day.
Before Gremlin, we were able to run 2-3 attacks per GameDay, and with Gremlin we're able to run 10 or more.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started