Today, we’re launching a new approach to running disaster recovery tests, validating failover processes, and ensuring compliance with regulations such as DORA. With Disaster Recovery Testing, you can run zone, region, and datacenter-scale experiments across your entire Gremlin organization simultaneously.

Protect your systems against catastrophic failures

Major cloud provider outages are rare, but can be devastating when they happen. The AWS us-east-1 zone outage in October 2025 is estimated to have impacted approximately 70,000 companies and resulted in $581 million in losses. A Microsoft Azure outage later that month was even more severe, with an estimated impact of between $4.8 billion and $16 billion.

There’s a clear takeaway: cloud organizations with strict availability requirements must show proven resilience against zone and region failures.

Disaster Recovery Testing lets you recreate major outages like these, so you can proactively test your failover systems and processes, validate your disaster recovery and incident response plans, confirm that your automatic zone and region failover systems work when needed, and be prepared for the next major outage.

Test across your organization

Zone and region outages impact the entire engineering organization, not just individual teams. With Disaster Recovery Testing, you can accurately simulate this impact by running tests on services from across your entire Gremlin organization. Simply select the services you want to run the test on, define your test, and click “Run.”

For additional safety, we’ve added Disaster Recovery Testing Health Checks, which let you define Health Checks that run in addition to service-level Health Checks during a Disaster Recovery Test. This gives you more ways to monitor your key metrics during testing and minimize impact. Configure a Disaster Recovery Test Health Check to check your application’s most important metrics, and if those metrics go outside of your SLA, the Health Check will automatically halt the Disaster Recovery Test.

Easily select a subset (or all) of your organization's services to include in a Disaster Recovery Test.

Use expert-crafted Scenarios, or bring your own

Disaster Recovery Testing is built around Scenarios, giving you access to Gremlin’s full suite of pre-built Recommended Scenarios. Run experiments designed to test zone redundancy, region evacuation, DNS redundancy, and more, with minimal setup.

Choose from pre-built or custom created Scenarios to run on your services. Integrate with your observability tool to monitor your services and safely halt testing in case of unexpected outcomes.

Want to run experiments specific to your infrastructure and applications? You can reuse Scenarios you’ve already created, use Scenarios shared from other team members, or create your own from scratch.

Create in-depth Scenarios that replicate zone and region outages, cascading failures, and other failure modes.

Get the 30,000-foot view of your reliability 

After completing a Disaster Recovery Test, Gremlin provides a comprehensive analysis of the results, including which services passed and which ones failed. Immediately see which services need additional attention and which team(s) they belong to for easy prioritization.

Engineers can drill down into specific test results, identify the points of failure, and address them long before they’re deployed to production. Then, you can re-run the Disaster Recovery Test to verify that the problems are fixed.

Get an auditable report of each test including which services passed and which ones failed. With Reliability Intelligence, Gremlin can tell you why a test failed and how to remediate it.

Professional support and expertise

Our team can help you plan and execute your Disaster Recovery test by providing professional guidance and expertise. We work with dozens of Fortune 1000 companies, including four of the top five U.S. banks, to orchestrate zone and region-level failover tests. We’ve also done this ourselves: our Engineering team ran a zone evacuation test on Gremlin’s production infrastructure and passed with zero failures and no service interruptions (our on-call engineer didn’t even get paged!).

Disaster Recovery Testing is available immediately for all customers. Log into the Gremlin web app and select “Disaster Recovery” from the navigation menu. If you’d like help planning and scheduling a Disaster Recovery Test, talk to one of our reliability experts.

No items found.
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL
Book a demo

Schedule a time with a reliability expert to see how reliability management and Chaos Engineering can help improve the reliability, resilience, and availability of your systems.

Schedule now
Andre Newman
Andre Newman
Sr. Reliability Specialist