Email Template: Internal Chaos Engineering Report - 10x Reduction In Incidents

Tammy Butow
Principal SRE
Last Updated:
October 24, 2018
Categories:
Chaos Engineering
,

Creating Your Own Chaos Engineering Reports

When you start to practice Chaos Engineering it is important to ensure you have a plan for monitoring & metrics. Read our Chaos Engineering Monitoring & Metrics Guide to learn more.

The next step is sharing your progress and success with your team. We've created this email template to help you get started. Communicating with your team is critical to the success of your Chaos Engineering practice.

It would be even better if you could share this information publicly for other engineers to learn from. We look forward to hearing about your success!

Email Template: Internal Chaos Engineering Report - 10x Reduction In Incidents

To: engineering@yourcompany.com

Subject: Chaos Engineering results in 10x reduction in incidents for Databases Team

Cc: databases-team@yourcompany.com

Body:

Over the past 3 months the Databases Team has achieved a 10x reduction in incidents through the practice of Chaos Engineering. Prior to the commencement of Chaos Engineering it was common for there to be 400 incidents a week.

We started to practice Chaos Engineering in May.

null

How did we achieve a 10x reduction in incidents using Chaos Engineering?

  • Used the PagerDuty service to export all incidents and obtain a batch dump of all incidents
  • Used the Pareto Principle to identify the top 20% of incidents causing 80% of the incidents
  • Ran 3 x weekly Chaos Engineering experiments to identify and confirm issues impacting reliability
  • Fixed 15 critical tooling bugs which were contributing to the top 20% of incidents
  • Ran Chaos Engineering experiments to confirm bug fixes have improved reliability
  • Did an audit of monitoring and alerting, identified the top 10 ways to make improvements to alerts (removed outdated alerts, fixed thresholds for alerts, added critical alerts which were missing etc.)

How can your team use Chaos Engineering to reduce incidents?

The Databases Team have been able to achieve this massive reduction in incidents through the use of Chaos Engineering.

If you are interested in learning more about how Chaos Engineering can help your team improve reliability and reduce on-call load please come along to our internal Tech Talk which will be held on October 5 at 11am in the auditorium.

Thanks for reading!

Databases Team

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape