Fault Injection in your release automation

One of the real successes of the Agile Software development movement has been the push to have regular, frequent deployments. This has manifested as build and deployment automation and the general adoption of CI/CD. As engineers automate more processes of their software release lifecycle, an important question is how to automate Quality Assurance, which includes resilience testing and, more specifically, Fault Injection.

As compared to some of the more traditional forms of testing (e.g. Unit Tests), Fault Injection experiments typically have significant latency and concurrency requirements. There is always a push to get most tests to be fast and parallel. When we're testing the holistic performance and resilience of a system as a whole, we need to make pretty large sacrifices in both of these domains. Fault Injection experiments are typically much longer running in nature (as our resilience systems often operate on the order of minutes) and can only really be run one at a time on a given piece of software (lest we obscure the results as to which test failed or accidentally cause false positives). To this end, Fault Injection oftentimes fits into various automated processes in different places depending on your release goals. You can't just add a bunch of Fault Injection tests to your CircleCI job and call it a day.

There are 3 major strategies we see as being successful for integrating Fault Injection into your release automation, and they depend very strongly on your organization’s specific goals:

Scheduling them at regular intervals (decoupled from release cadence)
Running Tests after Production Deployments
Gating Release Candidates on running tests

Scheduling

While new code is often the major source of defects in software, when we look at resilience we have to consider not just software releases, but actual changes to the infrastructure and network topography. These systems are inherently dynamic and typically change without warning and sometimes (in the case of broader internet network topography) beyond our control. As a result, it's typically necessary to execute your Fault Injection testing on a regular basis in order to identify risks caused by new deployments or factors beyond your control.

Many organizations will do this on a yearly or quarterly basis, but often that's not frequent enough. Instead, the typical recommendation is to exercise your failure modes once and then rerun them on a weekly basis so that regressions can get proactively discovered and resolved.

Post deployment

Due to the length of testing required to perform a lot of Chaos Engineering, if you're doing a large number of deployments (for example, multiple times a day) it doesn't usually make sense to run Fault Injection as a gate in your continuous integration pipeline. Instead, many organizations find benefit in running these post-production deployment.

Since Fault Injection typically exposes existing risks in a controlled manner, identifying these after deployment is an acceptable trade-off as you'll typically have time to remediate the risk. Environments which push to production with such frequency typically have excellent monitoring in place to validate these releases quickly and that monitoring can be used to identify impact to halt running Fault Injection experiments with minimal or no impact to customers. This is an environment which is typically ideal for testing in production and thus a post-deployment automated test becomes a solid solution.

Gating releases

While identifying problems prior to release is the goal of most continuous integration systems, it's important to consider the trade-offs in running Fault Injection experiments. These tests can significantly increase the latency of continuous integration and deployment systems. However, for those organizations who already have significantly long QA cycles, gating promotion on running Fault Injection experiments is a credible option. It's important that this trade-off (time to delivery vs safety) is explicitly called out and evaluated against other organizational goals.

Testing During Deployment

Fault Injection can also take place during deployment to test the resilience of your system when deploying, rather than the deployment itself. This is not unheard of for deployment strategies such as Blue/Green or Canary deployments. During these times the system under test is in a hybrid state which doesn't accurately represent the state before or after the deployment. This means that you may pass (or fail) these tests when you would otherwise perform differently outside of the deployment.

If you deploy frequently, then this could be valuable because the state of currently deploying is one which your users are regularly exposed to and needs to be resilient as well. However, unless your systems are very frequently in this state, starting with one of the other forms of testing is generally more valuable for most organizations as performing this additional testing has a similar, additional cost.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL

Fault Injection in your release automation

Scheduling

Post deployment

Gating releases

Testing During Deployment

The two kinds of failure testing

The case for Fault Injection testing in Production

Your reliability scorecard: How to measure and track service reliability