The case for Fault Injection testing in Production

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

As we've discussed previously, testing tends to take on very different appearances depending on the maturity of your operational organization and how well that organization is familiar with what failure looks like. This split between Exploratory and Verification patterns of testing has immediate implications on the trade-offs associated with running those tests in production vs. non-production environments.

Exploratory testing and non-production environments

Most organizations just starting to improve their resilience are effectively using Fault Injection as an exploratory tool to understand how their system behaves in various failure scenarios. This means that they don't always have the monitoring in place for edge cases, or enough familiarity with the monitoring they have to identify what failure mode their software might be under. In cases of large distributed systems, and service-oriented architectures, this software may never have been tested in the organization. This uncertainty can lead organizations to prefer to do this initial testing in non-production environments. This is a totally reasonable solution, but they need to be aware of the shortcomings of this approach.

Test and non-production environments have significant differences from Production environments, some of which are intentional, while others may be due to a lack of resourcing. Non-production environments typically lack the complexity and scale of production environments. This means that they may fail in differing ways, or not have critical components which need testing. They also often differ from production in terms of configuration, hardware, and software setups. These differences can result in false positives, where a system appears resilient in a non-production environment, but fails in production due to the variations.

The biggest advantage of testing in non-production environments, typically, is the lack of impact to actual customer traffic. However, this, too, is a double-edged sword. They typically do not accurately represent the variety and volume of customer traffic that Production systems handle. Even when load testing tools are available, they may not have representative traffic blends to mimic customer behavior, or they operate on canned data, which is not exhaustive of customer use cases. While avoiding customer impact may be desirable, it also means that you are limiting your understanding of the potential impact of failures on actual customers. Without testing in Production, you might miss critical insights into how outages or service disruptions affect those end-users.

Often times these concerns about non-production environments may be dismissed as simply gaps in testing that will eventually be covered with more organizational investment in those environments, but it's important to keep in mind that replicating a production-like environment for testing can be costly, both in terms of hardware and operational expenses. For many organizations, creating an exact replica is impractical or financially prohibitive. This means those gaps will likely persist far longer than the tenures of the engineers who work on them. As a result, relying solely on non-production testing can provide a false sense of security. A system that appears robust in a controlled environment may still experience failures when exposed to the realities of Production. Non-production testing may not adequately identify gaps in your mitigation strategies for production environments. A system that performs well in a non-production setting may still fail when subjected to the unique challenges of Production.

This said, it's important to understand that this form of testing does not have zero benefit. While being resilient to a given Fault Injection scenario in a test environment might not guarantee that resilience exists in Production, software experiencing failure in non-production environments can still help us understand how that software might fail in Production. This is exceptionally valuable when performing these sorts of Exploratory testing. What's more, for some customers, they may have concerns with Regulatory Compliance and Operation Burdens with running Fault Injection in Production environments. Thus, this sort of coarser-grained testing in non-production environments is still necessary.

Verification testing and Production environments

By comparison, when doing Verification testing for well-understood failure modes, Production is an ideal environment for Fault Injection. The experience and existing mitigations dramatically reduce the concerns with impacting customer traffic. What's more, we know that these failure modes will occur regardless as to whether or not we test them. If we are unwilling to test them in a way where we are in control of the nature and duration of the impact, then that's a strong signal that the environment is a time bomb waiting to go off.

When it comes to Fault Injection testing, the choice between Production and non-production environments is not an either/or decision; it's about striking a balance between safety and real-world insights. By testing Fault Injection in Production, you're embracing the edge where real-world challenges reside. You acknowledge the risks, but also recognize the potential for transformative insights and the creation of more resilient systems.

Both types of testing are needed

As an individual operator, the key takeaway is to leverage both types of testing to ensure your systems are not only robust, but also continuously evolving to meet the demands of an ever-changing landscape.

To be clear, testing in Production is the only way to truly validate the reliability of your systems. However, in order to get there, you may have to spend time testing in lower environments, and while this is valid, your eventual goal should be testing in Production.

Ultimately, it's about improving your systems' reliability, one test at a time.

‍

No items found.

Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL