Backcountry Validates Black Friday Readiness

Backcountry's engineers run a safe and secure GameDay in production to validate the resilience of their distribution center robots for Black Friday.

  • Zero Incidents

    during Black Friday 2018

  • MTTD Reduced to 5 minutes

    From multiple hour MTTD the previous year

Backcountry.com is one of the largest online specialty retailers of clothing and outdoor recreation gear. They have offices in Park City, Salt Lake City, Portland, Virginia, Costa Rica, and Germany. They rely on software-driven machinery to convert online purchases into ready-to-ship packages with little to no human intervention in their distribution center.

As part of their normal ecommerce preparation for Black Friday and Cyber Monday, Backcountry goes through a long period of testing in order to root out bugs that could cause service disruptions. In 2018, Backcountry's engineering team adopted Chaos Engineering in order to take a more proactive stance to their Black Friday preparation.

Normally I would not consider destructive testing in production, but Gremlin made it easy and safe.

Alec Wilkins

Head of Engineering and product atย ย 
Challenge

Run Production Chaos Experiments in Distribution Warehouse

During Black Friday 2017, traffic loads impacted Backcountry's SLOs, despite following preparation best practices. To prevent future issues like this, their forward-looking engineering organization led by Director of Engineering, Gustavo Leiva, and Principal Software Engineer, Jose Esquivel, sought a new approach for the organization to test for potential Black Friday incidents.

There was a strong business case for Chaos Engineering as a potential solution because it gives teams the ability to test for real world outage scenarios and also gives the business confidence in production system resilience. Despite this business case, Backcountry had no previous experience with Chaos Engineering and would need support to successfully introduce it into their engineering culture.

And because the testing would be in production, Backcountry needed software that they were confident could run safely and securely.

We don't have a shipping warehouse in staging. If we were going to be confident our systems would be stable during peak traffic, we had to test in production.

Gustavo Leiva

Director of Engineering
The Solution

Gremlin's Production Ready Solution

Backcountryโ€™s search for Chaos Engineering software included looking at open source options, building their own solution, and using enterprise tooling. They ultimately elected to use Gremlinโ€™s hosted solution. Building their own tool would take away from their existing feature roadmap, and they wanted to begin their Chaos Engineering journey as quickly as possible. Their requirements for rigorous safety and security quickly ruled out open source options as the current offerings lack security features and support.

Gustavo and Jose worked with Gremlin's success team to plan a GameDay that would recreate conditions from Black Friday 2017, and proactively look for other gaps as well. Gremlin was easy to install and configure and allowed their team to get up and running very quickly. The plan also included specific SLO abort criteria that if reached would take advantage of Gremlin's Halt All Attacks feature to restore their warehouse operations to steady state.

We considered building our own tooling as well as the available open source tools. Gremlin was the only solution mature enough to make us comfortable running in production.

Gustavo Leiva

Director of Engineering

Results

  • Zero Incidents

    during Black Friday 2018

    The SLO disruptions in 2017 lasted 72 hours. We incorporated Chaos Engineering with Gremlin into our Q4 preparation and we had zero incidents in 2018.

    Jose Esquivel

    Engineering Manager
  • MTTD Reduced to 5 minutes

    From multiple hour MTTD the previous year

    Diagnosing the SLO issues in 2017 took hours. We used Chaos Engineering to improve Time to Diagnose of the system to less than 5 minutes by testing and tuning our logging, monitoring, and traceability.

    Gustavo Leiva

    Director of Engineering

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free