Fault Injection is a tool, and like all tools, there are a variety of ways operators can employ it, but most of them tend to fall into one of two categories:

  • Exploratory
  • Validation

Exploratory testing

Most teams who are just getting started or are unfamiliar with Fault Injection use it as a tool to explore the failure modes in their software. This often involves injecting various failure modes and observing how their software responds and what this looks like for critical metrics.

This sort of Exploratory failure testing is often characterized by a lack of foreknowledge of what failure looks like in their system. The purpose here is not to be reckless, but rather to test a system to suss out the unknowns in how it responds to external pressures. This is not that different from other sorts of resilience testing, such as Load Testing. Doing this kind of testing helps teams understand components that may be part of complex systems, and identify what to look for when that software is struggling or collapsing.

This information can be used to prioritize resilience improvements, prepare operators for on call, or document what various metric changes correspond to which failure modes. All of this is valuable for teams to be able to operationalize their software.

Validation testing

Organizations, or programs, which are more mature often have a different motivation for employing Fault Injection. For these groups, they're attempting to attest to the resilience of a wide array of different systems by validating which ones are resilient to which failures.

This is especially common for organizations looking to grow or scale their Availability Programs. Here, failure modes are much better understood and systems can be relied upon to have strong alerting and observational systems. This sort of Validation failure testing shares more in common with auditing than green field development. Systems are systematically tested for a set number of known failure modes on regular cadences; either chronologically or as part of deployment systems (e.g. CI/CD).

Where the purpose of Exploratory testing is to help individual teams or operators gain confidence and understanding of various pieces of software, this sort of Validation is often aimed at broader organizations or cross-organizational teams (such as SREs) who are trying to obtain a better understanding of myriad systems. In this way, it shares more in common with traditional integration and acceptance testing.

Integrating Fault Injection with deployment tools

While integrating with deployment tools can be a high-leverage way for teams to drive acceptance of Fault Injection testing, it's worth calling out the trade-offs associated with doing so. The cause of fragility is not always deployments to the code in question. Changes to dependencies, consumers, infrastructure, and even the network itself, can all undermine resilience mechanisms.

Systems which do not experience constant deployments (e.g. less frequently than once a week) may miss serious regressions if they rely solely on deployment mechanisms to drive their Fault Injection testing. Additionally, the goal of CI/CD systems is to focus on delivering software quickly. For many systems, running a complete suite of Fault Injection tests can take hours to complete. This can be at odds with the goals and purpose of CI/CD systems.

Depending on the organization, some teams may find it preferable to instead run these sorts of Validation tests on a weekly cron, and only run the most critical tests as part of deployments. This allows them to retain confidence that regressions haven't slipped through without negatively impacting their development goals.

No items found.
Sam Rossoff
Sam Rossoff
Principal Software Engineer
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.