Where to automate resilience testing in your SDLC

When organizations begin to deploy resilience testing or Chaos Engineering, there’s a natural question: can we integrate this with our CI/CD pipeline or release automation tools? After all, you’re likely running unit, performance, and integration tests already—is resiliency different?

The short answer is yes—to both. Integration is possible, but resiliency is different, so automation is a nuanced conversation. Gremlin has many customers running resilience tests as part of their release automation today, but it’s not actually the primary way to automate testing in your SDLC more broadly.

Ultimately, most teams want to get to a point where they are running verification tests regularly in production, either automating them post-deployment or on a schedule. Where to automate tests largely depends on your goals, the impact you’re willing to have on your deployment frequency, and how closely your staging environments mirror production.

The difference between exploratory and verification testing

There are two kinds of resilience testing: exploratory and verification. It’s common to lump the two of these together, but they actually have different goals and approaches.

Exploratory testing involves venturing into the unknown to better understand our systems. This is classic Chaos Engineering experimentation where the primary objective is learning about our systems. When run in production, especially for the first time, these often involve higher risks, such as unintended outages, direct customer impact, and poorly understood consequences on other services.

We don’t advise automating exploratory tests, because, by definition, they have unknown outcomes. It’s best to start exploring with manual testing and organized Gamedays. Exploratory testing is often performed in lower environments to start, but can be performed in production with the right safeguards, such as limiting the blast radius of an experiment to non-critical services.

Verification testing focuses on scenarios where we know what failure looks like and have established ways to mitigate it. We’re simply looking to verify our systems are working as expected.

Gremlin can absolutely help automate verification testing. In fact, we highly recommend it and make it easy to do. With verification, you’re testing your systems against known, expected failures, so the risk of unintended or unexpected consequences is typically lower. As a safety precaution, some teams start by running verification tests in lower environments, but there is more urgency to start testing in production. Testing in production is the only way to validate both our system's logic and its current operational configuration with 100% certainty.

The cost of testing in your pipeline is speed

A key input to the conversation around where to automate testing is how much latency you’re willing to introduce in your pipeline. It can take up to two hours to run the full suite of resiliency tests that are included in Gremlin’s default recommended Reliability Management test suite—and five to six hours to run the extensive test suites common with Gremlin’s most mature users. That’s because it takes time to see how systems respond to tests, such as scaling up or down and failing over. And these tests shouldn’t be run in parallel (unless by design), so they must be sequenced out. This time doesn’t include the time needed to resolve any issues discovered during these tests.

The bottom line: if you’re releasing multiple times per day or even multiple times per week, it may not be possible to test and mitigate issues in every release without significantly slowing down your pipeline.

Three approaches to automated testing

We see three primary approaches to automating resilience testing in any given organization’s SDLC. Ultimately, it’s about striking the right balance between pipeline latency, safety, and coverage.

1. Gating release candidates on running tests

Catching defects before production is often the goal with any type of testing, and if you have sufficient QA cycle time, gating promotion can work. But as noted above, resilience tests can take significant time, so there’s a tradeoff between safety and time to delivery.

Gating releases also comes with one major downside: it can build false confidence when tests aren’t replicated in production. Testing in pre-production environments isn’t the same as testing in production, as the underlying infrastructure and network topology is nearly impossible—and very expensive—to replicate. You can mitigate this risk by augmenting your automated testing by running periodic tests in production when possible.

Recommendation: Some teams start here and run resilience tests alongside integration and performance tests. This can work provided they are deploying infrequently enough that this doesn’t slow down delivery.

2. Running tests after production deployments

If you're doing a large number of deployments (for example, multiple times a day), many organizations find benefit in running these tests post-production deployment.

Since risks are exposed in a controlled manner, identifying these after deployment is an acceptable trade-off as you'll typically have monitoring in place and time to either validate quickly or identify unintended impact. Gremlin uses this monitoring to halt running tests with minimal to no impact on customers.

Resilience testing is different from many other forms of testing in that it significantly varies based on your underlying infrastructure. Organizations that test post-deployment should be aware of changes to network topology or underlying infrastructure that are decoupled from releases, as these changes introduce risks asynchronous from releases.

Recommendation: Watch your deployment frequency to ensure each deployment has time to run through an entire test suite. Switch to scheduling if deployments become too frequent or you are making out-of-band infrastructure changes.

3. Scheduling tests at regular intervals

Resilience is impacted by both new software releases and changes to the infrastructure and network topology. These are dynamic, are often beyond our control, and can happen without warning. Regularly scheduled testing is the most effective way to continuously detect risks introduced both by software releases and infrastructure changes. Additionally, scheduling tests provides a few more advantages: because testing is decoupled from the release pipeline, there’s no impact on delivery velocity.

Recommendation: Manually run tests once, then automate tests on a weekly schedule so that regressions can be quickly discovered and resolved.

Tradeoffs for each automation strategy

Strategy	Pros	Cons
Gating Release Candidates on running tests	Catches some resilience risks before production Fits into existing QA/performance testing cycles	Expensive and difficult to run production-like test environments Can miss infrastructure and network-level risks Can lead to false confidence without also testing in production Slows down QA process
Running tests after production deployments	Catches some resilience risks before production Runs in production, so it can capture software, infrastructure, and network risks Fast cycle time between deployment and risk identification	Slows down deployment process if deploying faster than test cycle length Requires strong monitoring Misses risks introduced through out-of-band infrastructure and network-level changes Some risks will make it to production but should be mitigated quickly
Scheduling tests at regular intervals	Runs in production, so it can capture software, infrastructure, and network risks Decoupled from software release cycle; no impact on time to deployment	Requires strong monitoring Some risks will make it to production but should be mitigated quickly

‍

Additional considerations for gating releases

When organizations decide to gate releases, we typically see two more questions: should you use the score or test results as a gate, and should you test in canary builds?

Gating releases on the Gremlin score or test results

Either one can work for your organization. If using the score, we recommend a ratchet approach, where releases are only approved if they have at least the same reliability score as they had on the previous release. If using individual test results, you can compare them to past results to ensure tests that have passed continue to do so. With both of these approaches, improvements to your resiliency posture can be made over time.

Testing in canary builds

Resilience testing can also be performed during canary builds to test the resilience of your system when deploying, rather than the deployment itself. However, this is not a substitute for testing in production: it's a different state, so you may see false positives or false negatives compared to your staging or production environments. If you deploy frequently, then this could be valuable because the state of currently deploying is a valid status to test. However, unless your systems are frequently in this state, starting with one of the other forms of testing is generally more valuable for most organizations.

Conclusion: Each approach comes with its own pros and cons

The good news is that there’s no single right answer for where to test that fits every organization. In fact, many organizations will find that a combination of approaches is the best fit for them. The important thing is to get started where you can, and evolve your practice over time as it matures. With Gremlin, you can integrate resiliency testing where it makes sense, whether that’s through the Gremlin API, CI/CD integrations for Jenkins and Github Actions, or scheduling capabilities.

To dig into this topic further, we recommend the three posts from Sam Rossoff, a Principal Engineer at Gremlin: The two kinds of failure testing, The case for Fault Injection testing in Production, and Fault Injection in your release automation.

And if you’d like to discuss automating testing in your organization, get in touch with us to have a demo in your environment.

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL