The beta version of Gremlin’s application layer failure injection solution (ALFI) has been closed. At this time the ALFI solution is deprecated and will be replaced with a better alternative once available.
Overview
Why application-level fault injection is useful
Operators think in requests
Most metrics, dashboards, and alerts that we consume are in terms of requests. RPS, error rate, and latency all implicitly use a request as a unit of work. Requests are not a concept available at the infrastructure-level. At that level, all we see are streams of packets with IP addresses and ports. By moving up to the application-level, we can use all of the request-level metadata in constructing an attack.
Since requests can include identifiers like customer ID, device ID, country, etc, those facets may be used in constructing an attack. When you have that ability, it is much easier to create a small, well-defined blast radius in your attack. That, in turn, allows for much faster feedback loops and lets you discover latent problems more quickly.
Fault injection without system access
Injecting infrastructure failures requires running a process and accessing other system-level resources. In serverless environments such as AWS Lambda, Google Cloud Functions, and Azure Functions, this access is impossible. In these cases, it is necessary to include the fault-injection mechanism within the application itself. ALFI runs in the JVM as a library, so once you have integrated it into your application, you may use it in any environment.
Examples
Simulate an outage in production by creating an attack on your customer ID only. Then you can look for signs of problems when logged in as yourself, while no other users are even aware an attack is occurring.
Simulate a problem with a specific endpoint. Partial failure in distributed systems is quite common - some endpoints may be unavailable while others are working perfectly. In order to simulate such a scenario, you can create an attack targeted to some endpoints only and then determine how your system reacts.
Always-on failure testing. If you limit an attack to a set of devices you control, then you can run tests against those devices on a regular basis and evaluate how the user experience works when the system is degraded.
You must add the above repository to your maven or gradle file. Otherwise, you will encounter an error message similar to Could not find artifact com.gremlin:[client]:pom:[version] in central (https://repo.maven.apache.org/maven2)
// If your application is hosted on AWS EC2 or Lambda, use this to integrate with AWS
// (like Parameter Store Configuration support)
implementation group: 'com.gremlin', name: 'alfi-aws', version: '0.5+'
Maven
XML
<!-- If your application is hosted on AWS EC2 or Lambda, use this to integrate with AWS
(like Parameter Store Configuration support) -->>
<dependency>
<groupId>com.gremlin</groupId>
<artifactId>alfi-aws</artifactId>
<version>LATEST</version>
</dependency>
You must add the above repository to your maven or gradle file. Otherwise, you will encounter an error message similar to Could not find artifact com.gremlin:[client]:pom:[version] in central (https://repo.maven.apache.org/maven2)
// If your application is hosted on AWS EC2 or Lambda, use this to integrate with AWS
// (like Parameter Store Configuration support)
implementation group: 'com.gremlin', name: 'alfi-aws', version: '0.5+'
Maven
XML
<!-- If your application is hosted on AWS EC2 or Lambda, use this to integrate with AWS
(like Parameter Store Configuration support) -->
<dependency>
<groupId>com.gremlin</groupId>
<artifactId>alfi-aws</artifactId>
<version>LATEST</version>
</dependency>
In order to authenticate to Gremlin, you must provide the following configuration values to your application.
<span class="code-class-custom">GREMLIN_ALFI_IDENTIFIER </span>: A unique identifier for the application. This will be used to distinguish all of the application instances from one another
<span class="code-class-custom">GREMLIN_TEAM_ID </span>: The Team ID that this application belongs to. Only users in that team may conduct attacks on it.
<span class="code-class-custom">GREMLIN_TEAM_CERTIFICATE_OR_FILE</span> : Certificate for authenticating to Gremlin. See below for syntax on permissible values.
<span class="code-class-custom">GREMLIN_TEAM_PRIVATE_KEY_OR_FILE</span> : Private key for authenticating to Gremlin. See below for syntax on permissible values.
You may set these as environment variables or in a <span class="code-class-custom">gremlin.properties</span> file on the classpath. Certificates can be downloaded for each team from the Settings Page.
The following keys may be set to tune how ALFI operates.
<span class="code-class-custom">GREMLIN_ALFI_ENABLED </span>: If set to anything other than <span class="code-class-custom">true</span>, all functionality is turned off. This is designed to give you the ability to safely deploy ALFI, knowing you've got a simple off-switch. When the functionality is off, no failures are ever injected by ALFI, no calls are made to the API, and no logging past configuration-time occurs.
<span class="code-class-custom">GREMLIN_REFRESH_INTERVAL_MS</span> : You may optionally provide this value to set the frequency with which the library will contact the Gremlin API. Minimum of 1000 (1 second), maximum of 300000 (5 minutes). Default of 10000 (10 seconds). This determines how quickly your application reacts to attacks being halted or created and the amount of network traffic generated by the library.
<span class="code-class-custom">http_proxy</span> : You may specify a proxy for traffic from the ALFI library back to the Gremlin control plane. This may optionally include basic auth.
As described above, the default configuration resolution mechanism is to use either properties defined in <span class="code-class-custom">gremlin.properties</span>, or in environment variables where your application runs. If those don't fit your needs, then you can provide an alternate mechanism by subclassing GremlinConfigurationResolver (javadocs) and supplying it to GremlinServiceFactory (javadocs) at construction-time.
Optionally (if using a custom TrafficCoordinates instance) inject the fault using <span class="code-class-custom">com.gremlin.GremlinService#applyImpact(trafficCoordinates)</span>. Add this line of code anywhere in your application, you wish the fault to be injected.
Choose a Gremlin attack - Set the amount of latency in ms to apply and optionally throw a<span class="code-class-custom"> RuntimeException</span> within your application.
Run the attack - Set the duration in seconds for how long the attack will last.
Test your application to observe the impact of the attack.
This example has been developed for AWS Lambda but could be used in any application one deploys to AWS. Include alfi-aws jar in your calasspath to use this library.
package com.example.rec;
import com.gremlin.ApplicationCoordinates;
import com.gremlin.GremlinCoordinatesProvider;
import com.gremlin.GremlinService;
import com.gremlin.GremlinServiceFactory;
import com.gremlin.http.servlet.GremlinServletFilter;
import org.springframework.boot.web.servlet.FilterRegistrationBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
public class WebConfig {
@Bean
public FilterRegistrationBean recommendationsFilterRegistrationBean() {
FilterRegistrationBean registrationBean = new FilterRegistrationBean();
registrationBean.setName("recs");
final GremlinCoordinatesProvider alfiCoordinatesProvider = new GremlinCoordinatesProvider() {
@Override
public ApplicationCoordinates initializeApplicationCoordinates() {
return new ApplicationCoordinates.Builder()
.withType("local")
.withField("service", "recommendations")
.build();
}
};
final GremlinServiceFactory alfiFactory = new GremlinServiceFactory(alfiCoordinatesProvider);
final GremlinService alfi = alfiFactory.getGremlinService();
GremlinServletFilter alfiFilter = new GremlinServletFilter(alfi);
registrationBean.setFilter(alfiFilter);
registrationBean.setOrder(1);
return registrationBean;
}
}
There is no need to define a TrafficCoordinates when using the GremlinServletFilter. This library takes care of that for you.
This enables you to target any verb and any route hosted by your application! For example, you could narrow the blast radius of an attack to only GET requests to https://somehost/recommendations.
Fill out the Application Query and Traffic Query fields to match the following:
The custom value for the traffic type is hidden behind ellipses in that screenshot. The value is getAllToDos.
Attacks
Integrate the library
To use ALFI, you must first integrate the Gremlin libraries into your application and redeploy. Please see the JVM Installation Guide for more details. Once you have successfully integrated the library, you should see logging like this:
INFO com.gremlin.GremlinServiceFactory - Gremlin enabled for Team abcdefgh-1234-9876-3333-nopqrstuvwxy
Create attacks via the Web UI
Now you can start creating attacks from the Web UI. Here you will see a history of ALFI attacks run by your team.
Once you click <span class="code-class-custom">New ALFI Attack</span>, you will receive a form with <span class="code-class-custom">Application Type</span>, <span class="code-class-custom">Traffic Type</span>, and <span class="code-class-custom">Impact</span> sections.
Application Type
This section provides a way to determine which applications are eligible for the ALFI attack.
Upon application startup, the ALFI code running in each application creates an instance of <span class="code-class-custom">ApplicationCoordinates</span> and passes that to the Gremlin API. Each <span class="code-class-custom">ApplicationCoordinates</span> instance is eligible to pick up an ALFI attack. Please see Application Coordinates Setup for details on how to populate <span class="code-class-custom">ApplicationCoordinates</span>.
The ALFI library comes with two Application Types out of the box: AWS Lambda and AWS EC2. Custom Application Types can also be created from your application, which can then be used in the Web UI with the <span class="code-class-custom">Add Custom Field</span> button. Keep in mind that the most effective chaos experiments start small, so keep your custom Application Types as specific as possible.
Traffic Type
This section provides a way to select individual requests within your application and only impact that set.
Any attribute which you have supplied in a <span class="code-class-custom">TrafficCoordinates</span> is eligible to use in constructing the attack. Please see Traffic Coordinates Setup and Attaching Request Context data to all TrafficCoordinates for details on how to control the data being placed into a <span class="code-class-custom">TrafficCoordinates</span> instance.
The ALFI library includes integrations for the Apache HTTP client and Dynamo DB client (with more to come!), however you are free to create any sort of Traffic Type you would like and use those custom fields as attributes of the attack.
For Traffic Type, you may also supply a <span class="code-class-custom">Percentage of Traffic</span> value. As probability is used to target this percentage, the actual impact may not exactly reflect the value specified.
Impact
This section provides a way to declare what impact you would like to inject.
You may choose an amount of latency to inject as well as a yes/no switch on whether you want this call to fail. These can also be combined to simulate a slow call which eventually fails. This impact gets applied to all traffic which matches the Traffic Type you've described above on the Application Type you've described above.
In this section, you also are required to declare the duration of the attack. For this duration, the attack is active and ALFI-enabled applications are impacted. As soon as the duration elapses, the applications no longer know about the attack and are no longer impacted.
Observe attack results
Once you press the <span class="code-class-custom">Unleash Gremlin</span> button, the attack becomes active and applications will start picking it up. Here you can see all of the attributes used in scoping the attack, as well as what the impact is and the duration of the attack. The attack then starts progressing through different phases of its lifecycle, as described here:
Stage
Description
Pending
Created but no applications have picked up the attack
Distributed
At least one application has picked up the attack, but none have been impacted
Impacted
At least one application has picked up the attack and been impacted
Successful
Impact was applied and duration elapsed
ApplicationNotFound
No application ever picked up the attack and duration elapsed
TrafficNotFound
No application ever applied impact and duration elapsed
Halted
Attack was halted (by UI or API) prior to the duration elapsing
Libraries
Java Client library
alfi-core: Core library required for all ALFI functionality
alfi-aws: Optional AWS integration, providing coordinate discover for <span class="code-class-custom">AwsLambda</span> and <span class="code-class-custom">AwsEc2</span>
In ALFI, each application has a set of identifying attributes. This set of attributes is named <span class="code-class-custom">ApplicationCoordinates</span> and is used to determine when an application matches an attack.
<span class="code-class-custom">.inferFromEnvironment()</span> will extract the region and name of your Lambda function from your environment and use it as the <span class="code-class-custom">Region</span> and <span class="code-class-custom">Name</span> fields respectively the in the Gremlin UI.
<span class="code-class-custom">.inferFromEnvironment()</span> will extract the region, availability zone and instance ID from your environment and use it as the <span class="code-class-custom">Region</span>, <span class="code-class-custom">Availability Zone</span> and <span class="code-class-custom">Instance ID</span> fields respectively the in the Gremlin UI.
Let's imagine you have an application called TheShop which contains a UserService and a PaymentService. In this case, to uniquely identify each of these services in the Gremlin control plane, you would construct two <span class="code-class-custom">ApplicationCoordinate</span> s, each with the same value set for the <span class="code-class-custom">withType(...)</span> field and a unique value set for the <span class="code-class-custom">.withField(...)</span>.
Take notice of the <span class="code-class-custom">withType(...)</span> and <span class="code-class-custom">withField(...)</span> methods. The value defined in the <span class="code-class-custom">withType(...)</span> method will need to be defined in the <span class="code-class-custom">Name</span> field of the Gremlin UI (see images below). The value defined in the <span class="code-class-custom">withField(...)</span> method will need to be defined in the <span class="code-class-custom">Custom Value</span><span class="code-class-custom"> field of the Gremlin UI (see images below).</span>
To target both services, configure the UI like this:
To target one of the services, configure the UI like this:
Don't forget to click on the + icon
TrafficCoordinates
<span class="code-class-custom">com.gremlin.TrafficCoordinates</span> instances are used to control the blast radius of an ALFI experiment. The blast radius for ALFI could be all or a subset of HTTP verbs, all or a subset of your application's HTTP request paths, or even a specific block of code within your application.
Outbound HTTP Traffic
The <span class="code-class-custom">com.gremlin.TrafficCoordinates</span> instance for Outbound HTTP Traffic will be automatically generated by the <span class="code-class-custom">com.gremlin.http.client.GremlinApacheHttpRequestInterceptor</span> which comes with the alfi-apache-http-client library. This interceptor will give you the ability to impact any HTTP verb or request route within your application. To take advantage of the <span class="code-class-custom">com.gremlin.http.client.GremlinApacheHttpRequestInterceptor</span>, you will need to add an instance of it to <span class="code-class-custom">org.apache.http.impl.client.HttpClientBuilder</span> when you create your <span class="code-class-custom">org.apache.http.client.HttpClient</span> client.
final GremlinApacheHttpRequestInterceptor gremlinInterceptor = new GremlinApacheHttpRequestInterceptor(gremlinService, "alfi-client-demo");
final HttpClientBuilder clientBuilder = HttpClientBuilder.create().addInterceptorFirst(gremlinInterceptor);
The configuration in the screenshot above, targets 50% of all HTTP GET traffic to the application.
The second argument to com.gremlin.http.client.GremlinApacheHttpRequestInterceptor is a string and must match the value defined in the Client Name (required) input field of the Gremlin UI..
Inbound HTTP Traffic
<span class="code-class-custom">com.gremlin.TrafficCoordinates</span> instances are automatically created for you if alfi-http-servlet-filter is on the classpath.
The configuration in the screenshot above, targets 50% of all HTTP POST requests to the /payments route
Dynamo DB Traffic
The <span class="code-class-custom">com.gremlin.TrafficCoordinates</span> instance for Dynamo DB Traffic will be automatically generated by the <span class="code-class-custom">com.gremlin.aws.GremlinDynamoRequestInterceptor</span> which comes with the alfi-aws library. This interceptor will give you the ability to impact any DynamoDB operation (<span class="code-class-custom">Get Item</span>, <span class="code-class-custom">Delete Item</span>, etc...). To take advantage of the <span class="code-class-custom">com.gremlin.aws.GremlinDynamoRequestInterceptor</span>, you will need to add an instance of it to <span class="code-class-custom">com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder</span> when you create your <span class="code-class-custom">com.amazonaws.services.dynamodbv2.AmazonDynamoDB</span> client.
JAVA
final RequestHandler2 gremlinDynamoInterceptor = new GremlinDynamoRequestInterceptor(gremlinService(), CLIENT_EXECUTION_TIMEOUT, CLIENT_REQUEST_TIMEOUT);
final AmazonDynamoDB dbClient = AmazonDynamoDBClientBuilder
.standard()
.withRegion(region)
.withClientConfiguration(new ClientConfiguration()
.withClientExecutionTimeout(CLIENT_EXECUTION_TIMEOUT)
.withConnectionTimeout(CLIENT_REQUEST_TIMEOUT)
.withMaxErrorRetry(2)
).withRequestHandlers(gremlinDynamoInterceptor)
.build();
The configuration in the screenshot above, targets 50% of all Get Item traffic to the application.
Custom Traffic Type
JAVA
final TrafficCoordinates trafficCoordinates = new TrafficCoordinates.Builder()
.withType("PaymentController")
.withField("method", "submitPayment")
.build();
public HttpEntity submitPayment(Payment paymentRequest) {
this.gremlinService.applyImpact(trafficCoordinates); // Fault injected!
return paymentService.makePayment(paymentRequest);
}
The configuration in the screenshot above, targets 50% of all calls to the PaymentController#submitPayment(PaymentRequest paymentRequest) method.
Extend TrafficCoordinates
Often, companies set up their infrastructure to maintain a per-request data structure and use this information to provide logging, monitoring, and observability data points. A common pattern is to set up a <span class="code-class-custom">RequestContext</span> and have authentication filters put in information like <span class="code-class-custom">customerId</span> or <span class="code-class-custom">deviceId</span> into the <span class="code-class-custom">RequestContext</span> object. This object then permits access from any later point, so that those attributes are easily available. These are often excellent locations on which to create attacks. If your system operates in this way, then you can set up a mapping to populate these values on all <span class="code-class-custom">TrafficCoordinates</span>. This code lives in a concrete subclass of <span class="code-class-custom">GremlinCoordinatesProvider</span>, which you've already seen in: Initialize Application Coordinates.
JAVA
import com.gremlin.GremlinCoordinatesProvider;
import com.gremlin.TrafficCoordinates;
public class MyCoordinatesProvider extends GremlinCoordinatesProvider {
@Override
public TrafficCoordinates extendEachTrafficCoordinates(TrafficCoordinates incomingCoordinates) {
incomingCoordinates.putField("customerId", MyRequestContext.getCustomerId());
incomingCoordinates.putField("deviceId", MyRequestContext.getDeviceId());
incomingCoordinates.putField("country", MyRequestContext.getCountry());
return incomingCoordinates;
}
}
With this code wired into the construction of your <span class="code-class-custom">GremlinService</span> instance, all <span class="code-class-custom">TrafficCoordinates</span> will now get those 3 attributes and they are eligible to be matched for any type of traffic you'd like to attack.
GremlinService
To create a <span class="code-class-custom">com.gremlin.GremlinService</span>, you need a <span class="code-class-custom">com.gremlin.GremlinCoordinatesProvider</span>, which needs a com.gremlin.ApplicationCoordinates.
To construct a GremlinService using the alfi-aws library:
JAVA
final GremlinServiceFactory factory = new GremlinServiceFactory(new GremlinCoordinatesProvider() {
@Override
public ApplicationCoordinates initializeApplicationCoordinates() {
ApplicationCoordinates coords = AwsApplicationCoordinatesResolver.inferFromEnvironment()
.orElseThrow(IllegalStateException::new);
return coords;
}
});
final GremlinService gremlinService = factory.getGremlinService();
Design
com.gremlin.GremlinService should be a singleton.
Injecting fault
Once you have a reference to the <span class="code-class-custom">com.gremlin.GremlinService</span> singleton and have defined your Custom com.gremlin.TrafficCoordinates, you can inject fault like this:
JAVA
gremlinService.applyImpact(trafficCoordinates);
0.7.4
July 7, 2020
Fix: If the gremlin.properties file was on the classpath, Gremlin was not properly using it when resolving configuration.
0.7.3
December 23, 2019
Fix: Change the payload of the authorization header sent to Gremlin API to resolve HTTP 401s from a server-side change that does extra certificate validation
New: Added support for HTTP proxy. Set http_proxy environment variable, and ALFI traffic to Gremlin API will use the specified proxy URL.
0.7.2
April 24, 2019
Fix: Allow certificate parsing to work properly on Windows
Info: Updated dependencies
0.7.1
April 11, 2019
Fix: Much friendlier error messages when installation/setup is unsuccessful
0.7.0
February 9, 2019
New: Addition of Inbound HTTP injections points, both for javax.servlet Filters and JAX-RS Filters
0.6.1
February 21, 2019
Info: Updated dependencies
0.6.0
February 12, 2019
Fix: Allow chaining of property sources, so that a failure to lookup in Parameter Store still allows a lookup from environment variables
0.5.3
January 22, 2019
Info: Release process changes only
0.5.2
January 10, 2019
Info: Change artifact location to maven.gremlin.com
0.5.1
October 23, 2018
Info: The GREMLIN_ALFI_IDENTIFIER is required (previously was optional) when authenticating your application with Gremlin
0.5.0
October 11, 2018
New: Install with Maven now avialable
New: Client library modules available individually
New: AWS Parameter Store can be used for configuration