How to run a Chaos Engineering experiment on AWS Lambda using Java and Failure Flags

Andre Newman
Sr. Reliability Specialist
Last Updated:
March 21, 2024
Categories:
Failure Flags
,
Chaos Engineering
,
Learn how to improve the resiliency of your Java applications running on AWS Lambda using Gremlin Failure Flags.

In this tutorial, you’ll learn how to run a Chaos Engineering experiment on a Java application running on AWS Lambda using Failure Flags. Failure Flags is a Gremlin feature that lets you inject faults into applications and services running on fully managed serverless environments, such as AWS Lambda, Azure Functions, and Google Cloud Functions. With Failure Flags, you can:

  • Add latency or errors to applications.
  • Inject data into function calls without having to edit or re-deploy source code.
  • Simulate partial outages behind API gateways or reverse proxies.
  • Customize the behavior and impact of experiments.

This tutorial will focus on testing a Java application on AWS Lambda. You can learn about our other supported languages and platforms in the Failure Flags documentation.

Overview

This tutorial will show you how to:

  • Install the Failure Flags Java SDK.
  • Deploy an application and the Failure Flags agent to AWS Lambda.
  • Run a latency experiment using Failure Flags.

Prerequisites

Before starting this tutorial, you’ll need:

  • A Gremlin account (sign up for a free trial here).
  • An AWS account with access to Lambda (you can use the lowest-tier x86 or Arm instance for this tutorial to save on costs).
  • A Java runtime installed on your local machine. This tutorial was written using OpenJDK version 21, the latest supported version on Lambda as of this writing.

Step 1 - Set up your Java application with Failure Flags

In this step, we’ll create a Java application and add a Failure Flag. This is a simple application that responds to HTTP requests with the current timestamp and the time taken to process the response.

First, initialize a new project using Gradle. You can accept the default options for each prompt that appears:

Shell

mkdir failure-flags-java
cd failure-flags-java
gradle init --type java-application

Next, you’ll need to add the Failure Flags Java library as a dependency. Open app/build.gradle.kts and add the following to the repositories section:

Java

maven { url = uri("https://maven.gremlin.com/") }

Then, add the following to the dependencies section. This pulls in all of our dependencies for Failure Flags, Lambda, and Jackson (a popular JSON library for Java):

Java

implementation("com.amazonaws:aws-lambda-java-core:1.1.0")
implementation("com.amazonaws:aws-lambda-java-log4j:1.0.0")
implementation("com.fasterxml.jackson.core:jackson-core:2.8.5")
implementation("com.fasterxml.jackson.core:jackson-databind:2.8.5")
implementation("com.fasterxml.jackson.core:jackson-annotations:2.8.5")
implementation("com.gremlin:failure-flags-java:1.0")

Lastly, add the following to the end of the file. These tell Gradle how to build and package the project into a Zip file:

Java

tasks.register<Zip>("buildZip") {
 	archiveFileName.set("app.zip")
  into("lib") {
		from(tasks.jar)
		from(configurations.runtimeClasspath)
  }
}

tasks.build {
	dependsOn("buildZip")
}

The final file should look similar to this:

Java

plugins {
    // Apply the application plugin to add support for building a CLI application in Java.
    application
}

repositories {
    // Use Maven Central for resolving dependencies.
    mavenCentral()
	maven { url = uri("https://maven.gremlin.com/") }
}

dependencies {
    // Use JUnit Jupiter for testing.
    testImplementation(libs.junit.jupiter)

    testRuntimeOnly("org.junit.platform:junit-platform-launcher")

    // This dependency is used by the application.
    implementation(libs.guava)
    implementation("com.amazonaws:aws-lambda-java-core:1.1.0")
    implementation("com.amazonaws:aws-lambda-java-log4j:1.0.0")
    implementation("com.fasterxml.jackson.core:jackson-core:2.8.5")
    implementation("com.fasterxml.jackson.core:jackson-databind:2.8.5")
    implementation("com.fasterxml.jackson.core:jackson-annotations:2.8.5")
    implementation("com.gremlin:failure-flags-java:1.0")
}

// Apply a specific Java toolchain to ease working on different environments.
java {
    toolchain {
        languageVersion = JavaLanguageVersion.of(21)
    }
}

application {
    // Define the main class for the application.
    mainClass = "org.example.App"
}

tasks.named<Test>("test") {
    // Use JUnit Platform for unit tests.
    useJUnitPlatform()
}

tasks.register<Zip>("buildZip") {
	archiveFileName.set("app.zip")

    into("lib") {
		from(tasks.jar)
        from(configurations.runtimeClasspath)
    }
}

tasks.build {
    dependsOn("buildZip")
}

Step 1a - Write the code for the Java application

Since this application is somewhat verbose, it’s easier to split this step into two parts. We already set up Gradle and our dependencies, so now we’ll write the actual application.

You should have an app/src/main/java/org/example folder in your project containing an App.java file. Rename this file to Handler.java and enter the following contents:

Java

package com.gremlin.demo.failureflags.demo;

import java.time.*;

import java.util.Date;
import java.util.HashMap;
import java.util.Map;

import org.apache.log4j.Logger;

import com.gremlin.failureflags.FailureFlags;
import com.gremlin.failureflags.FailureFlag;
import com.gremlin.failureflags.GremlinFailureFlags;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;

public class Handler implements RequestHandler<Map<String, Object>, ApiGatewayResponse> {

  private static final Logger LOG = Logger.getLogger(Handler.class);
  private final FailureFlags gremlin;

  public Handler() {
    gremlin = new GremlinFailureFlags();
  }

  @Override
  public ApiGatewayResponse handleRequest(Map<String, Object> input, Context context) {
    LocalDateTime start = LocalDateTime.now();

    gremlin.invoke(new FailureFlag("failure-flags-java", Map.of("method", "POST")));

    LocalDateTime end = LocalDateTime.now();
    Duration processingTime = Duration.between(start, end);

    Response responseBody = new Response(processingTime.toMillis());
    Map<String, String> headers = new HashMap<>();
    headers.put("Content-Type", "application/json");
    return ApiGatewayResponse.builder()
        .setStatusCode(200)
        .setObjectBody(responseBody)
        .setHeaders(headers)
        .build();
  }
}

Create a new file in the same directory named Response.java with the following contents:

Java

package com.gremlin.demo.failureflags.demo;

public class Response {
  private final long processingTime;
  public Response(long processingTime) {
    this.processingTime = processingTime;
  }
  public String getProcessingTime() {
    return ""+this.processingTime;
  }
}

Lastly, create a file named ApiGatewayResponse.java with the following contents:

Java

package com.gremlin.demo.failureflags.demo;

import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.Collections;
import java.util.Map;
import org.apache.log4j.Logger;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;

public class ApiGatewayResponse {

  private final int statusCode;
  private final String body;
  private final Map<String, String> headers;
  private final boolean isBase64Encoded;

  public ApiGatewayResponse(int statusCode, String body, Map<String, String> headers, boolean isBase64Encoded) {
    this.statusCode = statusCode;
    this.body = body;
    this.headers = headers;
    this.isBase64Encoded = isBase64Encoded;
  }

  public int getStatusCode() {
    return statusCode;
  }

  public String getBody() {
    return body;
  }

  public Map<String, String> getHeaders() {
    return headers;
  }

  public boolean isIsBase64Encoded() {
    return isBase64Encoded;
  }

  public static Builder builder() {
    return new Builder();
  }

  public static class Builder {
    private static final Logger LOG = Logger.getLogger(ApiGatewayResponse.Builder.class);
    private static final ObjectMapper objectMapper = new ObjectMapper();
    private int statusCode = 200;
    private Map<String, String> headers = Collections.emptyMap();
    private String rawBody;
    private Object objectBody;
    private byte[] binaryBody;
    private boolean base64Encoded;

    public Builder setStatusCode(int statusCode) {
      this.statusCode = statusCode;
      return this;
    }

    public Builder setHeaders(Map<String, String> headers) {
      this.headers = headers;
      return this;
    }

    public Builder setRawBody(String rawBody) {
      this.rawBody = rawBody;
      return this;
    }

    public Builder setObjectBody(Object objectBody) {
      this.objectBody = objectBody;
      return this;
    }

    public Builder setBinaryBody(byte[] binaryBody) {
      this.binaryBody = binaryBody;
      setBase64Encoded(true);
      return this;
    }

    public Builder setBase64Encoded(boolean base64Encoded) {
      this.base64Encoded = base64Encoded;
      return this;
    }

    public ApiGatewayResponse build() {
      String body = null;
      if (rawBody != null) {
        body = rawBody;
      } else if (objectBody != null) {
        try {
          body = objectMapper.writeValueAsString(objectBody);
        } catch (JsonProcessingException e) {
          LOG.error("failed to serialize object", e);
          throw new RuntimeException(e);
        }
      } else if (binaryBody != null) {
        body = new String(Base64.getEncoder().encode(binaryBody), StandardCharsets.UTF_8);
      }
      return new ApiGatewayResponse(statusCode, body, headers, base64Encoded);
    }
  }
}

Step 2 - Download your client configuration file

Before you can deploy your application, you’ll need to authenticate it with Gremlin. Gremlin provides a downloadable client configuration file that you can use to authenticate any Gremlin agent, including Failure Flags agents. This file contains your Gremlin team ID and TLS certificates, but you can add additional labels like your application name, version number, region, etc.

  1. Download your client configuration file from the Gremlin web app and save it in the root directory of your project folder as config.yaml.
  2. Optionally, add any labels to your configuration file. You can use these labels to identify unique deployments of this application, letting you fine-tune which deployments to impact during experiments. For example, you could add the following block to identify your function as being part of the us-east-2 region and the failure-flags-java project, letting you target all functions running in us-east-2 or that belong to the failure-flags-java project:
YAML

labels:
    datacenter: us-east-2
    project: failure-flags-java

The configuration file supports other options, but the defaults are all you need for this tutorial.

Step 3 - Deploy your Java application to Lambda

Next, let’s deploy our app to Lambda.

So far, we’ve configured our application and the Failure Flags SDK. The SDK is what’s responsible for injecting faults into our app, but it doesn’t handle communicating with Gremlin’s backend servers or orchestrating experiments. For that, we need the Failure Flags Lambda layer.

We’ll deploy the Lambda layer alongside our Java app. The specifics of deploying an app to Lambda go beyond the scope of this tutorial, so we’ll link to the AWS docs where necessary.

  1. Follow the instructions in Deploy Java Lambda functions with .zip or JAR file archives. Remember to download your Gremlin client configuration file to your project folder!
  2. Create a new Lambda function using the instructions in Uploading a deployment package with the Lambda console.
  3. Before deploying the function, we need to add some environment variables. These are necessary for enabling Failure Flags. In the AWS Console, select the Configuration tab, then select Environment Variables. Click Edit, then enter the following variables:
    1. FAILURE_FLAGS_ENABLED=1
    2. GREMLIN_LAMBDA_ENABLED=1
    3. GREMLIN_CONFIG_FILE=/var/task/config.yaml
  4. We’ll also need to change the Handler (the entrypoint of the application). Click on the Code tab, then under Runtime settings, click Edit. Change the Handler field to com.gremlin.demo.failureflags.demo.Handler::handleRequest, then click Save.
  5. Optionally, you can provide additional metadata via environment variables, including your Gremlin team credentials. This isn't necessary for this tutorial, since we're authenticating via a config file. See the Failure Flags installation docs for details.
  6. Click Test to confirm that your application can receive and process requests correctly.
  7. Now we need to add the Failure Flags Lambda layer. Select the Code tab, then scroll down to Layers and click Add a layer:
    1. Under Choose a layer, select Specify an ARN.
    2. Enter one of the ARNs presented in this link, depending on which region and architecture your function is running on. For example, if your Lambda is running in us-east-2 on x86, enter arn:aws:lambda:us-east-2:044815399860:layer:gremlin-lambda-x86_64:13.
    3. Click Verify to confirm that the ARN matches your region and architecture, then click Add.
  8. Publish your Lambda by scrolling to the top of the page, clicking Actions, then clicking Publish new version. Enter a name for this version, then click Publish to push your function live.
  9. Create a new Function URL by following the instructions in Creating and managing Lambda function URLs. Once the URL is created, click on the link to see your function's output in a new tab. You should see the response time appear in your browser as a JSON file. The complete response will look similar to the following:
JSON

{
  "statusCode": 200,
  "body": "{\"processingTime\":\"300\"}",
  "headers": {
    "Content-Type": "application/json"
  },
  "isBase64Encoded": false
}

Step 4 - Run an experiment

Now, let’s run an experiment on our Java application.

To set the context: Lambda functions are susceptible to changes in network throughput. In other words, if our network connection becomes congested or slow, we’ll likely see less throughput and slower response times. The question is: how does this added latency impact the overall performance of our Lambda function, as well as any other services that depend on it? To test this, we’ll run a latency experiment, add one full second of latency, and observe what happens to our application.

  1. In the Gremlin web app, select Failure Flags in the navigation pane (or click this link).
  2. Click + Experiment to create a new experiment.
  3. Enter a name for the new experiment.
  4. Under Failure Flag Selector, click the combo box to show a list of active applications with Failure Flags that Gremlin detected. If your app doesn't show up, confirm that it's finished deploying on Lambda and has responded to at least one request.
  5. Optionally, you can add any additional attributes, such as label selectors, in the Attributes box. You can ignore this field for this tutorial.
  6. In the Effects box, specify the impact that you want to have on your app. For example, say we want to add 1000 ms (one second) of latency to each call to this function. We can do this by adding the following JSON to this field:
    1. { "latency": 1000 }
  7. Set the Impact Probability percentage. For now, set it to 100% to ensure that every call to this function gets impacted.
  8. Optionally, change the Experiment Duration to your preferred time. For now, set it to 5 min so you have plenty of time to observe the impact. You can always stop the experiment using Gremlin's Halt button.
  9. Click Save & Run to start the experiment.

While the experiment runs, open your Lambda URL in a web browser or a performance testing tool. How is it responding? How noticeable is the latency? Is the amount of latency more than you expected (longer than one second)? If so, why do you think that is? How might you rearchitect this app so the latency doesn't have as big of an impact?

When you're finished making observations and want to stop the experiment, simply click Halt this experiment in the Gremlin web app to stop the experiment.

Conclusion 

Congratulations on running a serverless Chaos Engineering experiment on AWS Lambda with Gremlin! Now that you have Failure Flags set up, try running different kinds of experiments. Add jitter to your network latency, impact a larger or smaller percentage of traffic, generate exceptions, or perform a combination of effects. For more advanced tests, you can even define your own experiments or inject data into your app. Failure Flags also has language-specific features, but this is currently only available for Go.

If you'd like to try Failure Flags outside Lambda, we also have a sidecar for Kubernetes. Just deploy the sidecar, then define and run your experiment. Remember that Failure Flags has no performance or availability impacts on your application when not in use, so don't be afraid to add it to your applications. We also offer SDKs for Node.js, Golang, and Python. These are all available on Github under an Apache-2.0 license.

No items found.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.GET STARTED

Product Hero ImageShape