Qualtrics prepares for Disaster Recovery service-by-service, without disrupting dev cycles
Qualtrics' Quality Engineering team runs dependency experiments in preparation for Disaster Recovery testing.
prepared for failover without having to coordinate planned downtime
vs. traditional dependency testing
Qualtrics is an Experience Management platform used by over 10,000 different brands to gather customer feedback in order to build better products. They have invested heavily in reliability and disaster recovery in order to mitigate the risk of downtime and ensure an always-on customer experience.
Qualtrics uses Gremlin to prepare for disaster recovery so each team can be ready for region failover. Their Resiliency Quality Engineering team, led by Venki Krishnamurthy along with Quality Engineer, Josh Furr, and Technical PM, Catherine Stocker, helped introduce Chaos Engineering into the organization in order to prove disaster recovery plans work as expected, and to meet their overall reliability goals.
We wanted to make sure teams were prepared for scenarios where they couldn't connect to their dependencies, particularly during region evacuation.
In order to achieve their disaster recovery goals with Chaos Engineering, Qualtrics needed an intuitive, enterprise solution that they could easily roll out to dozens of teams. The solution would have to meet their organization’s security requirements and be simple to operate so engineering teams across the organization would trust and use it.
Qualtrics uses Gremlin to prepare for disaster recovery so each team can be ready for region failover. Their Resiliency Quality Engineering team, led by Venki Krishnamurthy along with Quality Engineer, Josh Furr, and Technical PM, Catherine Stocker, helped introduce Chaos Engineering into the organization in order to prove disaster recovery plans work as expected, and to meet their overall reliability goals.
Gremlin allows you to showcase the value of failure testing within just a few hours of usage. This simplicity is what makes the tool so effective.
The QE team scheduled half an hour with each engineering team to test their dependencies. Using Gremlin’s Blackhole attack, the QE team can simulate dependency failure similar to the real DR, while limiting the blast radius to a single service at a time.
Qualtrics was able to test over 40 teams in less than three months, and the QE team provided a bi-weekly report back to the entire org about the faults they uncovered and mitigated. For example, during the exercise, they found a common issue with logging, which was fixed and enabled teams to quickly and clearly detect issues with their dependencies.
Additionally, the DR dependency testing effort enabled Qualtrics to verify the tiers of services had the appropriate restart scripts based on its tier.
Knowing and taking action from dependency testing is an important part of our disaster recovery preparation. Gremlin made it easy for us to test how our services come back up with a dependency missing and how they perform while waiting for the dependency to come online.
Avoid downtime. Use Gremlin to turn failure into resilience.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.