Qualtrics is an Experience Management platform used by over 10,000 different brands to gather customer feedback in order to build better products. They have invested heavily in reliability and disaster recovery in order to mitigate the risk of downtime and ensure an always-on customer experience.
Qualtrics uses Gremlin to prepare for disaster recovery so each team can be ready for region failover. Their Resiliency Quality Engineering team, led by Venki Krishnamurthy along with Quality Engineer, Josh Furr, and Technical PM, Catherine Stocker, helped introduce Chaos Engineering into the organization in order to prove disaster recovery plans work as expected, and to meet their overall reliability goals.
We wanted to make sure teams were prepared for scenarios where they couldn't connect to their dependencies, particularly during region evacuation.
In order to achieve their disaster recovery goals with Chaos Engineering, Qualtrics needed an intuitive, enterprise solution that they could easily roll out to dozens of teams. The solution would have to meet their organization’s security requirements and be simple to operate so engineering teams across the organization would trust and use it.
Qualtrics wanted to test how well they could failover to a backup datacenter. Many of their teams run microservices, so a testing failover involves restarting all the dependencies to be sure all the data flows will still work properly inside the new environment.
To save developer time, the QE team looked for the most efficient way to see how their services handle a lost connection to their dependencies in the event of a region evacuation as well as how they handle restarting the connections in the backup data center. The old way involved shutting down services to see the impact, but this can impact dev cycles.
In order to see how teams failover without Gremlin you’d have to shut off a service to everyone and watch what happens
Gremlin makes it simple to blackhole traffic to a dependency without needing to do a full scale hot swap. In preparation for an organizational DR exercise, Qualtrics ran team-by-team dependency tests, saving hundreds of hours of engineering time.
Gremlin allows you to showcase the value of failure testing within just a few hours of usage. This simplicity is what makes the tool so effective. Venki Krishnamurthy
Quality Engineering Manager
The QE team scheduled half an hour with each engineering team to test their dependencies. Using Gremlin’s Blackhole attack, the QE team can simulate dependency failure similar to the real DR, while limiting the blast radius to a single service at a time.
Qualtrics was able to test over 40 teams in less than three months, and the QE team provided a bi-weekly report back to the entire org about the faults they uncovered and mitigated. For example, during the exercise, they found a common issue with logging, which was fixed and enabled teams to quickly and clearly detect issues with their dependencies.
Additionally, the DR dependency testing effort enabled Qualtrics to verify the tiers of services had the appropriate restart scripts based on its tier.
Knowing and taking action from dependency testing is an important part of our disaster recovery preparation. Gremlin made it easy for us to test how our services come back up with a dependency missing and how they perform while waiting for the dependency to come online. Catherine Stocker
Sr. Technical Product Manager
Using Gremlin shortened the amount of time it took for each team to complete their dependency testing for disaster recovery.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.Get started