Failure Flags


Gremlin Failure Flags lets you run Chaos Engineering experiments and reliability tests on serverless workloads, containers, and similar managed environments. Just like feature flags, Failure Flags let you perform experiments on specific parts of your services and applications with minimal impact to your application code and no performance impact when disabled. Failure Flags are safe to deploy in your application and will default to disabled when you have no actively running experiments.


Failure Flags is an application level fault injection tool and its use-cases cover simulating or realizing those failures in your system that either have impact at the application level or target application data. These typically represent the bulk of the issues teams see day-to-day. Issues like:

  • Incorrect or corrupt data
  • Customer-specific failures
  • Lock-contention on hot data
  • Breaking API changes
  • Unexpected API responses
  • Partial service failures
  • Message double-delivery or ordering issues

But more than testing issues, Failure Flags can help you:

  • Test observability and alarm configuration
  • Exercise automated recovery systems
  • Isolate experiments in any environment to well-knows users or customers

Architecture and Performance Impact

Failure Flags involves integration with your applications and for that reason it is critical that you can be confident that adopting this technology will not adversely affect either the availability or performance of those applications outside of experiment parameters. Failure Flags - like other Gremlin products - is designed to fail safely.

Failure Flags is made up of three major components: the Gremlin SaaS API, the Failure Flags Sidecar or Lambda Extension, and one of the SDKs. No impact to your applications is possible unless all three are configured correctly at runtime. Working backwards from your application:

  1. The SDK must be integrated with your application and explicitly enabled via environment variable.
  2. The sidecar or extension must be deployed with your application and use a common <span class="code-class-custom">localhost</span> interface.
  3. The sidecar or extension must be enabled and provided with current credentials to the Gremlin API via environment variables or other configuration options.
  4. The sidecar or extension must have a stable network route to the Gremlin API and be provided with configuration required to traverse corporate proxies.
  5. Your company Gremlin account must have Failure Flags enabled.
  6. Your team must have created and run an experiment.

Any misconfiguration, configuration omission, or service outage can only prevent experimentation and will minimize any adverse impact to your applications. Further, the various Failure Flags SDKs are published under the Apache-2.0 license. You're encouraged to audit those libraries as you see fit. Adopting Failure Flags will in no way lock-in your applications to Gremlin.


  • It is safe to add Failure Flags to your code and leave them there
  • It is easy to prevent experimentation in any environment
  • The SDKs are licensed under Apache-2.0
  • Adding Failure Flags will not create lock-in

Supported Platforms

Failure Flags can run on any platform or environment that supports multiple processes with shared localhost. These include most if not all Kubernetes platforms, AWS Lambda, AWS ECS, virtual machines, container platforms with shared network namespaces, and many others (like your laptop). Gremlin currently provides support for:

  • AWS Lambda
  • Kubernetes

Gremlin does provide executables and a variety of packages that can be used in other platforms but we cannot provide support for those at this time.

Supported Languages and Frameworks

The Failure Flags SDKs are language-specific and released under the Apache 2.0 license. These include support for:

  • JavaScript / TypeScript / NodeJS
  • Python
  • Go
  • Java

Each of these are minimal SDKs and support similar features and semantics when possible.

Preparing and Next Steps

In order to prepare for the Failure Flags demo you should reach out to your Cloud or Platform Engineering Team to gather the following information:

  • If running in a VPC or private network, is there a proxy server needed for a Lambda to communicate to the Internet? If so, please provide the proxy address.
  • Are there firewall rules or network security changes required for the application or service to connect with the Gremlin API ?

See the following pages to get started:

  1. Adding Failure Flags to your Code
  2. Deploying on Kubernetes
  3. Deploying on AWS Lambda
  4. Deploying on AWS ECS
  5. Running experiments using Failure Flags
No items found.
This is some text inside of a div block.
Installing the Gremlin Agent
Authenticating the Gremlin Agent
Configuring the Gremlin Agent
Managing the Gremlin Agent
User Management
Health Checks
Command Line Interface
Updating Gremlin
Reliability Management (RM) Quick Start Guide
Services and Dependencies
Detected Risks
Reliability Tests
Reliability Score
Deploying Failure Flags on AWS Lambda
Deploying Failure Flags on AWS ECS
Deploying Failure Flags on Kubernetes
Classes, methods, & attributes
API Keys
Container security
Additional Configuration for Helm
Amazon CloudWatch Health Check
AppDynamics Health Check
Blackhole Experiment
CPU Experiment
Certificate Expiry
Custom Health Check
Custom Load Generator
DNS Experiment
Datadog Health Check
Disk Experiment
Dynatrace Health Check
Grafana Cloud Health Check
Grafana Cloud K6
IO Experiment
Install Gremlin on Kubernetes manually
Install Gremlin on OpenShift 4
Installing Gremlin on AWS - Configuring your VPC
Installing Gremlin on Kubernetes with Helm
Installing Gremlin on Windows
Installing Gremlin on a virtual machine
Installing the Failure Flags SDK
Latency Experiment
Memory Experiment
Network Tags
New Relic Health Check
Packet Loss Attack
PagerDuty Health Check
Preview: Gremlin in Kubernetes Restricted Networks
Private Network Integration Agent
Process Collection
Process Killer Experiment
Prometheus Health Check
Configuring Role Based Access Control (RBAC)
Running Failure Flags experiments
Scheduling Scenarios
Shared Scenarios
Shutdown Experiment
Managing Teams
Time Travel Experiment
Troubleshooting Gremlin on OpenShift
User Authentication via SAML and Okta
Managing Users
Integration Agent for Linux
Test Suites
Restricting Testing Times
Process Exhaustion Experiment
Enabling DNS collection
Authenticating Users with Microsoft Entra ID (Azure Active Directory) via SAML
AWS Quick Start Guide
Installing Gremlin on Amazon ECS