When you start to practice Chaos Engineering it is important to ensure you have a plan for monitoring & metrics. Read our Chaos Engineering Monitoring & Metrics Guide to learn more.
The next step is sharing your progress and success with your team. We've created this email template to help you get started. Communicating with your team is critical to the success of your Chaos Engineering practice.
It would be even better if you could share this information publicly for other engineers to learn from. We look forward to hearing about your success!
Subject: Chaos Engineering results in 10x reduction in incidents for Databases Team
Over the past 3 months the Databases Team has achieved a 10x reduction in incidents through the practice of Chaos Engineering. Prior to the commencement of Chaos Engineering it was common for there to be 400 incidents a week.
We started to practice Chaos Engineering in May.
- Used the PagerDuty service to export all incidents and obtain a batch dump of all incidents
- Used the Pareto Principle to identify the top 20% of incidents causing 80% of the incidents
- Ran 3 x weekly Chaos Engineering experiments to identify and confirm issues impacting reliability
- Fixed 15 critical tooling bugs which were contributing to the top 20% of incidents
- Ran Chaos Engineering experiments to confirm bug fixes have improved reliability
- Did an audit of monitoring and alerting, identified the top 10 ways to make improvements to alerts (removed outdated alerts, fixed thresholds for alerts, added critical alerts which were missing etc.)
The Databases Team have been able to achieve this massive reduction in incidents through the use of Chaos Engineering.
If you are interested in learning more about how Chaos Engineering can help your team improve reliability and reduce on-call load please come along to our internal Tech Talk which will be held on October 5 at 11am in the auditorium.
Thanks for reading!