In Chaos Engineering Through Staged Resiliency - Stage 2 we examined partial automation by implementing a handful of automated resiliency tests.
- Creation and agreement on Disaster Recovery and Dependency Failover Playbooks.
- Completion of Resiliency Stage 0.
- Completion of Resiliency Stage 1.
- Completion of Resiliency Stage 2.
After progressing through Resiliency Stage 2 your team implemented at least some semi-automated resiliency testing. However, this fourth stage is where all non-automated resiliency tests must also be integrated into your automated testing suite. If your application features a development or other non-production environment, you can opt to integrate these automated resilience tests in that non-production environment, as full-blown production testing isn’t required until Resiliency Stage 4. However, the earlier the team starts thinking about and practicing implementation within production systems, the smoother the transition will be and the sooner you’ll see that dramatic increase in resilience and drop in support costs that staged resiliency aims to provide.
In spite of everyone’s best efforts, not all disasters can be avoided, so it’s critical that the team implement at least a semi-automated disaster recovery failover script. As with resiliency testing, it’s best to automate as much of the disaster recovery failover process as possible, requiring as little human intervention as feasible. However, depending on the breadth of the system and initial planning throughout the earlier Resiliency Stages, it’s entirely possible your disaster recovery failover will require at least a modicum of human supervision.
As the team progresses through this stage make sure you follow the playbooks that have been previously established. If something needs to be changed in a process or playbook, this is the time to suss that out and make those updates.
As will sometimes be the case when your own team is working through each Stage of Resiliency, the Bookstore application has already been configured to automatically perform resiliency testing in non-production environments. In Resiliency Stage 2 we explored Performing a CDN Failure Simulation Test and Performing a DB Failure Simulation Test, which handles the major resiliency tests for the system by creating Gremlin attacks to sever the connection between the `bookstore-api` instances and the respective CDN/DB endpoints.
To ensure these tests are performed automatically, we can use the Gremlin API or web front-end to automatically schedule attacks for our given testing schedule. Similarly, we’d want to schedule an automatic disaster recovery failover test using a Gremlin Shutdown Attack, as illustrated in Verifying Automated Instance Failover in Resiliency Stage 2. Check out the Gremlin documentation for more details on creating attacks with Gremlin.
You’ve automated resiliency testing in a non-production environment (and, ideally, even a bit in production). Your team has also semi-automated disaster recovery failover procedures to ensure your service can moderately recover itself after a failure, with minimal human intervention. In the last chapter of this series, Chaos Engineering Through Staged Resiliency - Stage 4, we’ll explore the final steps of fully automating resiliency testing in production, along with CI/CD integration to ensure your service maintains stability throughout every step of the software development life cycle.