Chaos Engineering works, but it has to scale

Over the years, Chaos Engineering has proven its effectiveness time and time again, uncovering risks and saving companies millions they would have lost in painful, brand-impacting outages.

But as Chaos Engineering adoption increased, we found organizations running into the same stumbling blocks when they tried to scale. Individual teams would get great results with Chaos Engineering, then stall as they tried to get more teams involved.

Chaos Engineering as a practice is still essential for uncovering failure modes and preventing outages, but the only way an entire organization can improve reliability at scale is to incorporate the exploratory testing of Chaos Engineering with a scalable approach built on standards, validation testing, and reporting.

What are the limits of Chaos Engineering?

Chaos Engineering is built around experimentation. By designing experiments and scenarios, you can test specific failure modes to uncover risks, find weaknesses, and see if a service behaves the way it’s supposed to.

It’s an incredibly effective practice, but requires a deep knowledge of both the practice and the individual service itself to interpret the results. Often the knowledge and capability to perform Chaos Engineering experiments resides within a single team, or individual, that moves around between service owners.

This approach creates a bottleneck that limits the effectiveness and coverage of Fault Injection testing. The areas that get the SRE team’s attention will become more reliable and prevent outages, but other services will have testing gaps and reliability risks.

Even this limited practice is still incredibly effective, especially if the Chaos Engineering efforts are focused exclusively on high-priority critical systems.

But what about the rest of the services? Even if your critical services function at 99.99% availability, your application as a whole could be dragged down to 99% (or even 98%!) uptime by less resilient services. Unless you expand the program beyond standard Chaos Engineering, it will, eventually, plateau in effectiveness.

How to build on Chaos Engineering

In order for an entire application and system to have higher reliability, more resilience, and increased uptime, the core learnings and tests of Chaos Engineering need to be scaled across the organization.

Add validation testing built on exploratory testing

The chain of distributed systems is only as reliable as its most unreliable link. You need to bring all of your services to the same level of reliability, and you do that with a library of standard tests that all teams run. These should come from your Chaos Engineering exploratory testing, but also include common outage causes.

For example, Gremlin includes a pre-built test suite that tests the failures behind 80% of outages. Effective customers have started with those tests, then added to them based on the outages they’ve experienced to cover the remaining 20%.

The tests include verifying things like autoscaling, failing dependencies, zone redundancy, and more.

Your goal is to create a group of standards that every service owner can use to validate the reliability of their services.

Test regularly and work towards automation

Reliability isn’t a one-and-done effort. Your applications and systems are constantly shifting as new features get deployed, external dependencies update, underlying infrastructure shifts, and more. Just because a service worked as expected two weeks ago doesn’t mean it’s going to continue to work, and the last time you want to find out is when there’s an outage.

Once you have a standardized group of tests, make sure you test regularly, something automation will really help with. At Gremlin, we run every service through the complete set of test suites every week by scheduling the tests to run on automation throughout the week. In fact, overseeing those tests and checking the results is part of our on-call rotation.

The goal with regular testing is to know the actual reliability of your services at any given time. Instead of hoping it won’t fail, you have a record of tests that show you which failure modes it’s currently resilient to, giving you confidence in how your services will respond. Just as importantly, you have a list of your resiliency gaps so you can schedule them on future roadmaps.

Add reporting, processes, and accountability

Once you’re testing regularly, build standard processes around the results. By using the results of the tests over time, you can create a metric of reliability for each service. And once you have a metric, you can create processes for reporting, reviewing, and taking action based on it.

This is the reliability goal for any organization. Test results should be regularly reviewed and teams held accountable for the results. If the metric goes up, then the team is applauded. If it goes down, then work is added to the sprint to address the issue. As part of this, you will (and should!) have to discuss trade-offs between addressing identified gaps and other engineering initiatives.

Accountability from a leadership level is essential at this point. Engineering leaders are already paying attention to availability and uptime as key metrics, but these metrics are hard to track between outages. A reliability metric gives you a way to track progress that directly ties to those metrics. Make that tie obvious by building dashboards and reporting those results to leadership. Time and time again, the difference between success and failure is whether executives visibly make reliability a priority for the organization.

The goal for reporting, processes, and accountability isn’t to blame engineers and cause more work. In fact, it’s the exact opposite. Without reporting or measuring test results, engineers can’t show the impact of their work because they’re trying to prove the negative of preventing something that never happened. When you find an issue with testing, fix it, then test again, you have direct proof that your work was effective.

Scaling Chaos Engineering is easy with Gremlin

Gremlin is the safest, most versatile, and easiest to use platform for creating and running Chaos Engineering experiments. But that’s just the beginning.

Our platform includes everything needed to scale Chaos Engineering into a true reliability program across your organization, including Reliability Management test suites, Dependency Discovery, Detected Risks, Reliability Intelligence, and more.

Because Chaos Engineering is essential for any company to improve reliability. But it’s also just the beginning.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Ready to learn more?

See Gremlin in action with our fully interactive, self-guided product tours.

Take the tour