Start your 30 day free trial.
START FOR FREE

Failure Flags

Supported platforms:

N/A

Failure Flags lets you run Chaos Engineering experiments, Scenarios, and reliability tests on serverless workloads, containers, and similar managed environments at the application level. Much like how feature flags let you selectively enable or disable application features, Failure Flags lets you test specific areas of your applications and services while providing complete control, safety, and zero performance impact when disabled.

Failure Flags are also extensible, letting you run custom experiments and tests on your applications. You can leverage Gremlin’s built-in tests, bring your own, or combine them for comprehensive reliability testing.

You can use Failure Flags to simulate real-world failures, such as:

  • Incorrect or corrupted data
  • Errors impacting specific customers or users of your application
  • Lock-contention on hot data
  • Breaking API changes
  • Unexpected API responses
  • Partial service failures
  • Message double-delivery or ordering issues

In addition to failure testing, Failure Flags can help:

  • Test observability and alarm configurations
  • Exercise automated recovery systems
  • Isolate experiments in any environment by customer, user, or any other attribute of your application
Tip
You can find examples of how to deploy, configure, and manage Failure Flags in our GitHub repository.

Important note on safety

Like all Gremlin products, Failure Flags is designed for safety. When you’re not running experiments, Failure Flags has no adverse effect on the availability and performance of your applications. Failure Flags fails safely: any misconfigurations, configuration omissions, or Gremlin service outages will only prevent experiments from running, and will not impact your applications.

Further, the Failure Flags SDKs and examples are published under the Apache-2.0 license. You're encouraged to audit those libraries as you see fit. Adopting Failure Flags will not lock your applications into Gremlin.

Supported platforms, languages, and frameworks

Failure Flags runs on any platform or environment that supports multiple processes with a shared localhost. These include most Kubernetes platforms, AWS Lambda, AWS ECS, virtual machines, container platforms with shared network namespaces, and many others. Gremlin officially supports:

For more advanced use cases, Gremlin also provides language-specific SDKs that are released under the Apache-2.0 license:

Note
Gremlin provides executables and packages that can be used on other platforms, but we do not provide support for those.

Architecture and implementation

Deploying Failure Flags to your environment involves three main steps:

  1. Configuring Failure Flags for your environment. This involves configuring your environment with the appropriate environment variables (or configuration options).
  2. Deploying the Failure Flags sidecar or SDK. Gremlin provides two different options for deploying Failure Flags: proxying network traffic through a sidecar container (recommended) or by integrating the Failure Flags SDK into your code. 
    1. Proxying network traffic through a sidecar (recommended). Also called "Failure Flags by proxy," this approach lets you deploy Failure Flags without making any code changes. It works by proxying your application's network traffic through a sidecar container. Gremlin automatically creates fault injection points based on this network traffic, which you can then run experiments on. This is the recommended way of deploying Failure Flags.
    2. Integrating an SDK into your application. For more advanced use cases, we provide SDKs for integrating Failure Flags into your application code. This lets you extend Failure Flags with your own custom experiments while giving you full control and customization over where and how experiments affect your application. However, this requires changes to your application’s code.
  3. Running a Failure Flags experiment. When Gremlin registers your application instance and Failure Flags, you can begin running Failure Flags experiments.

Without all three of these steps configured correctly, it's impossible for Gremlin to make any impact on your application. Even if you've deployed and configured the Failure Flags sidecar, without an actively running experiment, Failure Flags will no-op and add no perforrmance impact to your application.

Takeaways

  • It is safe to add Failure Flags to your code and leave them there.
  • It is easy to prevent experimentation in any environment.
  • Using Failure Flags by proxy requires no code changes and can be enabled or disabled using environment variables.
  • The SDKs are licensed under Apache-2.0.
  • Adding Failure Flags will not create vendor lock-in.

Preparing and Next Steps

Before deploying, we recommend taking the following steps:

  1. Identify the application you will instrument. Consider which of your applications you want to start with. If you’re not sure, start with the most essential to your business, or consider the use cases listed above.
  2. Check firewalls and network routes: The Failure Flags sidecar needs to communicate with Gremlin’s API servers. Make sure that wherever you deploy the sidecar, it can send outbound network requests to api.gremlin.com over port 443 (HTTPS) and receive responses.
  3. Configure any proxies: If that network uses an outbound HTTP or HTTPS proxy, you'll need to gather its URL, any credentials, and certificate material it uses. That certificate material should be PEM encoded.
  4. (When using an SDK) Add new library dependencies to your package manager: You will add a library dependency to your project. If your organization uses an internal package/library cache, make sure that you've included the Failure Flags SDK for your application.

See the following pages to get started:

  1. Adding the Failure Flags SDK to your code or using Failure Flags by proxy
  2. Deploying on Kubernetes
  3. Deploying on AWS Lambda
  4. Deploying on AWS ECS
  5. Running experiments using Failure Flags

On this page
Back to top