
Announcing Disaster Recovery Testing
Today, we’re launching a new approach to running disaster recovery tests, validating failover processes, and ensuring compliance with regulations such as DORA. With Disaster Recovery Testing, you can run zone, region, and datacenter-scale experiments across your entire Gremlin organization simultaneously.
Protect your systems against catastrophic failures
Major cloud provider outages are rare, but can be devastating when they happen. The AWS us-east-1 zone outage in October 2025 is estimated to have impacted approximately 70,000 companies and resulted in $581 million in losses. A Microsoft Azure outage later that month was even more severe, with an estimated impact of between $4.8 billion and $16 billion.
There’s a clear takeaway: cloud organizations with strict availability requirements must show proven resilience against zone and region failures.
Disaster Recovery Testing lets you recreate major outages like these, so you can proactively test your failover systems and processes, validate your disaster recovery and incident response plans, confirm that your automatic zone and region failover systems work when needed, and be prepared for the next major outage.
Test across your organization
Zone and region outages impact the entire engineering organization, not just individual teams. With Disaster Recovery Testing, you can accurately simulate this impact by running tests on services from across your entire Gremlin organization. Simply select the services you want to run the test on, define your test, and click “Run.”
For additional safety, we’ve added Disaster Recovery Testing Health Checks, which let you define Health Checks that run in addition to service-level Health Checks during a Disaster Recovery Test. This gives you more ways to monitor your key metrics during testing and minimize impact. Configure a Disaster Recovery Test Health Check to check your application’s most important metrics, and if those metrics go outside of your SLA, the Health Check will automatically halt the Disaster Recovery Test.

Use expert-crafted Scenarios, or bring your own
Disaster Recovery Testing is built around Scenarios, giving you access to Gremlin’s full suite of pre-built Recommended Scenarios. Run experiments designed to test zone redundancy, region evacuation, DNS redundancy, and more, with minimal setup.

Want to run experiments specific to your infrastructure and applications? You can reuse Scenarios you’ve already created, use Scenarios shared from other team members, or create your own from scratch.

Get the 30,000-foot view of your reliability
After completing a Disaster Recovery Test, Gremlin provides a comprehensive analysis of the results, including which services passed and which ones failed. Immediately see which services need additional attention and which team(s) they belong to for easy prioritization.
Engineers can drill down into specific test results, identify the points of failure, and address them long before they’re deployed to production. Then, you can re-run the Disaster Recovery Test to verify that the problems are fixed.

Professional support and expertise
Our team can help you plan and execute your Disaster Recovery test by providing professional guidance and expertise. We work with dozens of Fortune 1000 companies, including four of the top five U.S. banks, to orchestrate zone and region-level failover tests. We’ve also done this ourselves: our Engineering team ran a zone evacuation test on Gremlin’s production infrastructure and passed with zero failures and no service interruptions (our on-call engineer didn’t even get paged!).
Disaster Recovery Testing is available immediately for all customers. Log into the Gremlin web app and select “Disaster Recovery” from the navigation menu. If you’d like help planning and scheduling a Disaster Recovery Test, talk to one of our reliability experts.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALSchedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.
Schedule nowHow reliability engineering can verify disaster recovery plans
Learn how reliability engineering and Gremlin can help test your disaster recovery plans to make sure you’re prepared—and compliant with regulations.


Learn how reliability engineering and Gremlin can help test your disaster recovery plans to make sure you’re prepared—and compliant with regulations.
Read moreGremlin for DORA compliance: how financial services firms build digital resilience–and prove it
The Digital Operational Resilience Act (DORA) is set to significantly impact the financial sector. Coming into full effect in 2025, this EU regulation will set new standards for information and communications technology (ICT) risk management. In this landscape, how can financial firms ensure they’re not only compliant, but also operationally resilient?


The Digital Operational Resilience Act (DORA) is set to significantly impact the financial sector. Coming into full effect in 2025, this EU regulation will set new standards for information and communications technology (ICT) risk management. In this landscape, how can financial firms ensure they’re not only compliant, but also operationally resilient?
Read more