2021

state of

Chaos Engineering

In 2016, Matthew Fornaciari and Kolton Andrus co-founded Gremlin with a simple mission: Build a more reliable internet. We are ecstatic to see how far the practice of Chaos Engineering has come, and are proud to share the results of the inaugural State of Chaos Engineering report that emphasizes the importance of the practice in improving availability.

Over the past twelve years, I’ve had the opportunity to be part of and watch the growth of Chaos Engineering. From its humble origins, most often met with “Why would you want to do that?” to its position today, helping ensure the reliability of the top companies in the world, it’s been quite the journey.

I first began practicing this discipline, years before it had a name, at Amazon where it was our job to prevent the retail website from going down. As we were having success, Netflix wrote their canonical blog post on Chaos Monkey (ten years ago this July). The idea hit mainstream and many engineers were hooked. After my tour of duty at Amazon, I rushed to join Netflix to dive deeper in this space. We were able to advance the art even further, building developer-focused solutions that spanned the entire Netflix ecosystem, ultimately resulting in another nine of availability and a world-renowned customer experience.

Five years ago my co-founder, Matthew Fornaciari and I founded Gremlin with a simple mission: Build a more reliable internet. We are both ecstatic to see how far the practice has come in this time. Many within the community have been hungry for more data on how to best leverage this approach, and so we are proud to present the first inaugural State of Chaos Engineering report.

Engineering teams across the globe use Chaos Engineering to intentionally inject harm into their systems, monitor the impact, and fix failures before they negatively impact customer experiences. In doing so, they avoid costly outages while reducing MTTD and MTTR, prepare their teams for the unknown, and protect the customer experience. In fact, Gartner anticipates that by 2023, 80% of organizations that use Chaos Engineering practices as part of SRE initiatives will reduce their mean time to resolution (MTTR) by 90%. We see the same parallels from the inaugural State of Chaos Engineering Report: top-performing Chaos Engineering teams boast four nines of availability with an MTTR of less than one hour.

Kolton Andrus
CEO, Gremlin

Key findings

Increased availability and decreased MTTR are the two most common benefits of Chaos Engineering
Teams who frequently run Chaos Engineering experiments have >99.9% availability
23% of teams had a mean time to resolution (MTTR) of under 1 hour and 60% under 12 hours
Network attacks are the most commonly run experiments, in line with the top failures reported
While still an emerging practice, the majority of respondents (60%) have run at least one Chaos Engineering attack
34% of respondents run Chaos Engineering experiments in production

Things break

From the survey, the top 20% of respondents had services with an availability of more than four nines, an impressive level. 23% of teams had a mean time to resolution (MTTR) of under an hour, with 60% having an MTTR of under 12 hours.

What is the average availability of your
service(s)?

Average number of high severity incidents (Sev 0&1) per month

What is your mean time to resolution (MTTR)?

When things do break, the most common causes were bad code pushes and dependency issues. These are not mutually exclusive. A bad code push from one team can cause a service outage for another. In modern systems where teams own independent services, it’s important to test all services for resiliency to failure. Running network-based chaos experiments, such as latency and blackhole, ensures that systems are decoupled and can fail independently, minimizing the impact of a service outage.

What percent of your incidents (SEV0&1) have been caused by:

Who finds out

Monitoring for availability varies by company. For example, Netflix’s traffic is so consistent, they can use video starts per second from the server-side to spot an outage. Any deviation from the projected pattern signals an outage. Google uses Real User Monitoring mixed with windowing to determine if a single outage had a large impact or if multiple small incidents are impacting a service, leading to deeper analysis of the cause of the incident(s). Few companies have consistent traffic patterns and sophisticated statistical models like Netflix and Google. That’s why a standard uptime over total time using synthetic monitoring sits at the top as the most popular way to monitor the uptime of services, while many organizations use multiple methods and metrics. We were pleasantly surprised that all of the respondents are monitoring availability. This is often the first step teams take to get proactive about improving customer experiences in applications.

What metric do you use to define availability?

How do you monitor availability?

When looking at who receives reports about availability and performance, it was no surprise that the closer a person is to operating applications, the more likely they are to receive reports. We believe the trend of DevOps bringing Operations and Development closer together is bringing the developer in line with Ops as the mindset of build and operate becomes pervasive in organizations. We also believe that as digitization increases and online user experience becomes more paramount, we’ll see an increase in the percent of C-level staff that receive availability and performance reports.

Who monitors or receives reports on availability?

Who monitors or receives reports on performance?

Top performers

Top performers had 99.99%+ availability and an MTTR of under one hour (highlighted above). In order to achieve these impressive numbers, we looked into what tooling teams used. Notably, autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks were all more common in the top availability group. Some of these, such as multi-zone, are expensive, while others, such as circuit breakers and select rollouts, are a time and engineering expertise issue.

Teams who consistently run chaos experiments have higher levels of availability than those who have never performed an experiment, or do so ad-hoc. But ad-hoc experiments are an important part of the practice, and teams with >99.9% availability are performing more ad-hoc experiments.

Frequency of Chaos Engineering
experiments by
availability

Tool use by availability

Autoscaling

DNS failover/elastic IPs

Load balancers

Active-active multi-
region, AZ or DC

Active-passive multi-
region, AZ, or DC

Circuit breakers

Backups

DB replication

Retry logic

Select rollouts of deployments (Blue/Green, Canary, feature flags)

Cached static pages when dynamic unavailable

Monitoring with health checks

Evolution of Chaos
Engineering

In 2010, Netflix introduced Chaos Monkey into their systems. This pseudo-random failure of nodes was a response to instances and servers failing at random. Netflix wanted teams prepared for these failure modes, so they accelerated the process to demand resiliency to instance outages. It created both a test for reliability mechanisms and forced developers to build with failure in mind. Based on the success of the project, Netflix open sourced Chaos Monkey and created a Chaos Engineer role. Chaos Engineering has evolved since then to follow the scientific process, and experiments have expanded beyond host failure to test for failures up and down the stack.

Google searches for
"Chaos Engineering"

For every dollar spent in
failure, learn a dollar’s
worth of lessons

"MASTER OF DISASTER"
JESSE ROBBINS
In 2020, Chaos Engineering went mainstream and made headlines in Politico and Bloomberg. Gremlin hosted the largest Chaos Engineering event ever, with over 3,500 registrants. Github has over 200 Chaos Engineering related projects with 16K+ stars. And most recently, AWS announced their own public Chaos Engineering offering, AWS Fault Injection Simulator, coming later this year.

Chaos Engineering
today

Chaos Engineering is becoming more popular and improving: 60% of respondents said they have run a Chaos Engineering attack. Netflix and Amazon, the creators of Chaos Engineering, are cutting edge, large organizations, but we’re also seeing adoption from more established organizations and smaller teams. The diversity of teams using Chaos Engineering is also growing. What began as an engineering practice was quickly adopted by Site Reliability Engineering (SRE) teams, and now many platform, infrastructure, operations, and application development teams are adopting the practice to improve the reliability of their applications. Host failure, which we categorize as a State type attack, is far less popular than network and resource attacks. We’ve seen an uptake in simulating lost connections to a dependency or a spike in demand for a service. We’re also seeing many more organizations moving their experimentation to production, although this is in the early days.
459,548
ATTACKS USING THE GREMLIN PLATFORM
68%
OF CUSTOMERS USING K8S ATTACKS

How frequently does your organization practice Chaos Engineering?
Daily or more frequent attacks
Weekly attacks
Monthly attacks
Quarterly attacks
Performed ad-hoc attacks
Never performed an attack

What teams are involved in conducting chaos experiments?

What percentage of your organization uses Chaos Engineering?

What environment have you performed chaos experiments on?

Percent of attacks by type

Percent of attacks by target type

Results of chaos experiments

One of the most exciting and rewarding aspects of Chaos Engineering is discovering or verifying a bug. The practice makes it easier to uncover unknown issues before they impact customers and identify the real cause of an incident, speeding up the patching process. Another major benefit that showed up in the write-in response to our survey was a better understanding of architectures. Running chaos experiments helps identify where there is tight coupling or unknown dependencies that adversely affect our applications and often remove many of the benefits of creating microservices applications. From our own product, we found that customers were frequently identifying incidents, mitigating the issue, and verifying the fixes with Chaos Engineering. Our survey respondents frequently found their applications increased in availability while they reduced their MTTR.

After using Chaos Engineering, what benefits have you experienced?

Future of Chaos Engineering

What is the biggest inhibitor to adopting/expanding Chaos Engineering?

The biggest inhibitors to adopting Chaos Engineering are a lack of awareness and experience. These are followed closely by ‘other priorities’ but interestingly more than 10% mentioned the fear that something might go wrong was also a prohibitor. It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences.

We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments. As the practice matures and tooling evolves, we expect it to be more accessible and faster for engineers and operators to design and run experiments to improve the reliability of their systems across environments - today, 30% of respondents are running chaos experiments in production. We believe that chaos experiments will become more targeted and automated, while also becoming more commonplace and frequent.

We’re excited about the future of Chaos Engineering and its role in making systems more reliable.

Demographics

The data sources for this report include a comprehensive survey with 400+ responses and Gremlin’s product data. Survey respondents are from a range of company sizes and industries, primarily in Software and Services. Adoption of Chaos Engineering has hit the enterprise, with nearly 50% of respondents working for companies with more than 1,000 employees, and nearly 20% working for companies with more than 10,000 employees.

The survey highlighted a tipping point in cloud computing, where nearly 60% of respondents ran a majority of their workloads in the cloud, and used a CI/CD pipeline. Containers and Kubernetes are reaching a similar level of maturity, but the survey confirmed that service meshes are still in their early days. The most common cloud platform is AWS at nearly 40%, with GCP, Azure, and on-premises following around 11-12%.

400 +

QUALIFIED
RESPONDENTS

How many employees work at your company?

How old is your company?

What industry is your company in?

What is your job title?

What percent of production workloads are in the cloud?

What percent of your incidents (SEV0&1) have been caused by:

What percent of your incidents (SEV0&1) have been caused by:

What percent of your incidents (SEV0&1) have been caused by:

What percent of production environment routes leverage service mesh?

In addition to examining the survey results, we also aggregated information about the technical environments of Gremlin users to understand what specific tools and layers of the stack are most often targets of Chaos Engineering experiments. Those findings are below.

What is your cloud provider?

What is your container orchestrator?

What is your messaging provider?

What is your monitoring tool?

What is your database?

Contributors

Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers precise answers about the performance and security of applications, the underlying infrastructure, and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver more value with dramatically less effort.
Learn more
Epsagon enables teams to instantly visualize, understand and optimize their microservices architecture. With our unique lightweight auto-instrumentation, gaps in data and manual work associated with other APM solutions are eliminated, providing significant reductions in issue detection, root cause analysis and resolution times.
Learn more
Grafana Labs provides an open and composable monitoring and observability platform built around Grafana, the leading open source technology for dashboards and visualization. More than 1,000 customers such as Bloomberg, JP Morgan Chase, eBay, PayPal, and Sony use Grafana Labs, with more than 600,000 active installations of Grafana around the globe. Commercial products include Grafana Cloud, a managed stack that integrates includes Prometheus & Graphite (metrics), Grafana Enterprise, an enhanced version of Grafana with enterprise features, plugins, and support; Loki (logs), and Tempo (traces) with Grafana; and Grafana Metrics Enterprise, which enables Prometheus-as-a-service for large organizations running at scale.
Learn more
Founded in 2014 by Edith Harbaugh and John Kodumal, LaunchDarkly is the feature management platform that software teams use to build better software, faster with less risk. Development teams use feature management as a best practice to separate code deployments from feature releases. With LaunchDarkly, teams control their entire feature lifecycles from concept to launch to value. Serving over 1 trillion feature flags a day, LaunchDarkly is used by teams at Atlassian, Microsoft, and CircleCI.
Learn more
PagerDuty, Inc. (NYSE:PD) is a leader in digital operations management. In an always-on world, organizations of all sizes trust PagerDuty to help them deliver a perfect digital experience to their customers, every time. Teams use PagerDuty to identify issues and opportunities in real time and bring together the right people to fix problems faster and prevent them in the future. Notable customers include GE, Cisco, Genentech, Electronic Arts, Cox Automotive, Netflix, Shopify, Zoom, DoorDash, Lululemon and more.
Learn more