Outages happen.
In fact, there were over 15,000 outages in 2025, with a global price tag of over $37 billion. And many of them were caused by third-party failures.
The outage could be from a major cloud provider, like when AWS us-east-1 went down, affecting 70,000 companies and costing up to $581 million. Or it could be a major dependency incident, like CrowdStrike's bad config push that hit 8.5 million Windows systems and cost Fortune 500 companies $4.5 billion.
Every organization needs to be ready for when things go catastrophically wrong. Unfortunately, many companies are woefully unprepared.
According to Cockroach Labs' State of Resilience report, only 20% of organizations consider themselves prepared for outages, while only 33% regularly perform disaster recovery simulations. And 71% of organizations don't do any failover testing to make sure outage prevention protocols work.
Engineering leaders know it's important to build and test disaster recovery plans. But when there's this level of disconnect between importance and adoption, that means there's a problem deeper than simple prioritization by leadership.
There is something fundamentally broken about the disaster recovery testing process. It needs to be modernized.
Current approaches to disaster recovery testing
Disaster recovery testing is happening at companies, but it's a painful process that doesn't produce the results teams and companies need. Let's look at where the most common methods are effective and when they fall short.
Individual reporting
This is the most basic method of disaster recovery verification. Individual engineering teams are given a checklist they need to go through in order to comply with regulations or standards. They check the box that a specific action or mechanism exists, but there's no further verification or validation.
Pros
- Minimal disruption that takes up engineering time but doesn't impact systems.
- Taps into the detailed knowledge of teams who know their systems best.
- Can be done asynchronously.
Cons
- Relies on assumptions and best intentions. No verification.
- Responses and standards can vary between teams.
- Doesn't account for the entire system.
Tabletop exercises
People from various teams get together in a conference room or on a large call to walk through a simulated disaster scenario. For many teams, this is where their disaster recovery testing stops. They'll run through the simulation, go through all the boxes on compliance checklists, and assume everything will go smoothly during an actual disaster.
Pros
- Catches big holes and makes sure everyone knows their roles.
- No system disruption since no failures are actually tested.
- Builds cross-functional awareness.
Cons
- Validates human knowledge, not system behavior.
- Doesn't test recovery mechanisms under real-world conditions.
- Compliance can't be backed up with auditable test data.
Backup restoration tests
A backup isn't a true backup until you test it. In this method, engineering teams test the backups themselves, verifying that the backup systems are set up correctly to spin up within the expected recovery time.
Pros
- Verifies backups are actual backups.
- Safely tests restoration without disrupting core systems.
- Uncovers config and provisioning issues to make sure backups can handle full system loads.
Cons
- Only tests restoration, not whether the failover will be engaged during a disaster scenario.
- Tests in a controlled environment, not real-world conditions.
- Usually done on a team basis rather than the entire system.
Failover simulations
Very few companies perform these tests, and with good reason. These tests validate failover by shutting down and/or rerouting traffic from your primary systems to your backup systems. And since they have a high likelihood of disruption, everyone involved is either in the room or on the call. They're messy, often painful exercises that risk your system and often involve coordination with external vendors.
Pros
- Verifies failover mechanisms work.
- Validates redundancy provisioning to make sure recovery and performance standards are met.
- Comprehensive results for compliance and reporting.
Cons
- Extremely disruptive to systems and engineering teams.
- Usually involves controlled shutdown or rerouting, which doesn't accurately simulate a disaster.
- Doesn't compensate for system shifts that happen between annual tests.
- Relies on custom scripts or manual efforts that don't scale and increase risk due to manual rollbacks.
Engineering organizations need a new approach
What would a different approach to disaster recovery testing look like?
Right now, most organizations with formal disaster recovery plans stop at tabletop exercises or get bogged down in expensive and manual testing. In both cases, the new approach would have to enable them to easily perform actual testing and verification. For organizations already performing failover testing, it would need to remove the disruption and toil while building on the engineering foundation of backup restoration testing.
All told, it would need to be:
-
Systematic
The method has to be easily applied to teams across the organization to verify standards. It should also enable engineering teams to repair any issues that come up and validate their fixes on their own between major testing actions.
-
Non-disruptive
The testing method should minimize disruption to systems and minimize disruption to engineering roadmaps and deployment schedules.
-
Accurate
Instead of relying on assumptions or best intentions, the method needs to be able to simulate realistic failure conditions and prove that recovery mechanisms operate correctly.
-
Auditable
The testing results need to be gathered and presented in reports that can be used for compliance verification with enough granularity to tie data back to specific tests.
-
Ongoing
Full disaster recovery testing usually happens annually, but systems are constantly changing. Teams need to be able to verify resilience between annual tests, providing better test coverage and making annual testing exercises easier.