The State of Chaos Engineering in 2021
Five years ago today, our co-founders launched Gremlin with a simple but bold mission: Build a more reliable internet. Over the past five years, the practice of Chaos Engineering is increasingly employed as a means for proactively testing systems to make them more resilient and reliable. Chaos Engineering has reached mainstream media (with articles in Politico and Bloomberg), community conferences on the subject have grown from a few hundred attendees to 3,500+ registrants, and Gremlin’s own user base has been actively testing systems, having executed nearly half a million Chaos Engineering attacks.
All of this interest got us curious: who is actually practicing Chaos Engineering, what techniques are they using, what problems are they solving, and what are they seeing as a result? To find out, we surveyed the community and examined Gremlin platform data, resulting in the first State of Chaos Engineering Report. The report confirms the growing interest in Chaos Engineering, as well as an observation that the nature of software failure modes has shifted.
State of Chaos Engineering Report
With more than 500 responses, primarily from software and site reliability engineers, we identified the ways in which these roles use Chaos Engineering to improve the reliability and resilience of their systems.
The top benefits to Chaos Engineering? Increased availability and decreased mean time to resolution (MTTR). In fact, teams who frequently run Chaos Engineering experiments were more likely to have >99.9% availability—an absolutely impressive feat. 23% of teams have an MTTR of under 1 hour, and over 60% of teams have an MTTR of under 12 hours. Not only is Chaos Engineering keeping services up and running, it's creating more informed dev and ops teams who are better equipped to respond to incidents when they do happen.
Engineering teams across the globe use Chaos Engineering to intentionally inject harm into their systems, monitor the impact, and fix failures before they negatively impact customer experiences. The State of Chaos Engineering Report confirmed that in doing so, they avoid costly outages while reducing MTTR and MTTD, prepare their teams for the unknown, and protect the customer experience.
However, the fear of testing in production is real; only 34% of respondents run Chaos Engineering experiments in production. Dev and staging are much more common environments for running attacks. Proactive testing in lower environments can provide better confidence in the stability of services without directly impacting customer experiences. While Gremlin has continued to advocate for testing across all environments, and we expect the percent of respondents testing in production to increase over time, we also recognize that not all critical services can be experimented on in production—emergency response systems and autonomous vehicles come to mind. It’s important to consider a system’s function and impact on customer experiences.
The report also uncovers trends in technology environments and looks to the future to explore where Chaos Engineering is heading.
Unleash all the Gremlins
So much of our world has shifted online over the past 12 months and as a result, significant increases in network traffic are a new reality for businesses in just about every industry. Running latency attacks helps engineers ensure customer experience continuity, and improves the overall reliability of services. Latency attacks are the 2nd most popular attack type for Gremlin customers, following closely behind blackhole attacks. A latency attack allows users to intentionally slow down network requests and observe how this affects response time, page load time, application stability, and ultimately the customer experience.
As interest in Chaos Engineering continues to grow, we want to provide teams of all sizes with a safe, secure, and scalable platform to run experiments. We released our Getting Started with Gremlin attacks guide to help teams familiarize themselves with all of the different Gremlin attack types along with their technical and business use cases. You can learn the benefits and best practices of testing for memory leaks, disks filling up, latency, process killers, blocked DNS access, and more. We’ve also expanded our library of Recommended Scenarios, allowing users to easily run Chaos Engineering experiments based on common real-world outages.
Community and Chaos Champions
A few years ago, we launched a Slack community to connect Chaos Engineering practitioners, learn best practices, find mentorship, and build reliable systems, together. Now, that community has nearly 7,000 members with more joining each day. And it’s not just the Slack community that’s growing. Last year, the world’s largest Chaos Engineering conference, Chaos Conf, saw a 440% YoY increase in registrations, and Gremlin’s Chaos Engineering Bootcamps had close to 2,000 registrants for these free, hands-on workshops.
In October 2020, we announced the Gremlin Chaos Champions program to recognize the work practitioners were doing for their teams, community, and the Chaos Engineering field at large. The freshman class consisted of four Chaos Champions, and today we’re excited to expand that group. Please meet the newest members of the Gremlin Chaos Champion program!
Chaos Engineering has allowed us to finally test long-held assumptions about our services and enables us to continuously build more reliable infrastructure.
Chaos can lurk just behind the facade of order. For me, chaos engineering has been an amazing journey for discovering that reliability is based on thousands of tiny failures, and that success of resilience is based on how many times we have failed at something.
Chaos Engineering has helped me a lot in getting a better understanding of our systems and working towards making them resilient. Sometimes chaos is the only way to achieve stability.
If you know someone who deserves to be recognized for their efforts leading Chaos Engineering in their organization with Gremlin, nominate them today for the Gremlin Chaos Champion Program.
We’re excited to see how the practice of Chaos Engineering and the larger reliability space evolves in 2021. We expect to see the global theme of ‘resilience’ continue to lead our thinking, as teams across all sectors and industries look for ways to make their organizations, teams, and systems more resilient and better equipped to handle the unexpected.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
How to troubleshoot unschedulable Pods in Kubernetes
Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry...
Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry...Read more
How to fix Kubernetes init container errors
One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start…
One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start…Read more