How to run application layer fault injection attacks on Apache Kafka

How to run application layer fault injection attacks on Apache Kafka

In a previous tutorial, we showed how to run chaos experiments on a self-managed Apache Kafka cluster. In this tutorial, we’ll use application layer fault injection (ALFI) to run chaos experiments directly in a Kafka application. We’ll show you how to use the Gremlin ALFI client and the Gremlin web app to run attacks on a Java application, with the goal of improving the performance and reliability of our application and of Kafka itself.

What is ALFI?

Application layer fault injection (ALFI) is a method of injecting failure directly into an application. ALFI lets us run chaos experiments on individual applications, services, and even specific requests. For example, we can add latency to specific function calls, generate errors on a percentage of requests, and even experiment in managed or serverless environments such as Confluent Cloud, Amazon MSK, AWS Lambda, and Heroku.

With ALFI, we have much greater control over the blast radius (the scope) of our attacks and can target specific Kafka interactions. For example, we can limit attacks to specific messages, producers, consumers, and topics, whereas in our previous tutorial we were only able to target entire nodes. This makes it much easier to ensure that your applications are production-ready before you deploy to customers.

Overview

We’ll demonstrate ALFI using an open source application called Kafka ALFI Demo. This is a Spring Boot application that creates a REST API endpoint and records the IP address and client name for each device that connects to it. A producer class that publishes this data to a Kafka topic, while a consumer service pulls entries from the topic and writes them to a log file.

We’ll cover the following topics:

  • Downloading and configuring the Kafka ALFI Demo project.
  • Configuring the Gremlin ALFI library.
  • Running an attack on a producer.
  • Running an attack on a consumer.
  • Halting an attack.

Take a deeper dive into Chaos Engineering for Kafka

Learn about 4 common Kafka failure modes and how to design chaos experiments to test against them.

Read the Guide →


Prerequisites

To complete this tutorial, you’ll need:

Step 1 - Download the Kafka ALFI Demo project

First, clone the demo application repository from GitHub. If you’re using an IDE like Eclipse or IntelliJ IDEA, you can import the project as a Gradle project.

bash
1git clone https://github.com/8bitbuddhist/kafka-alfi-demo.git

Next, enter your Kafka cluster configuration by opening the src/main/resources/application.yml file. Change the values of bootstrap-servers so that they contain a comma-separated list of your Kafka brokers:

java
1spring:
2 kafka:
3 consumer:
4 bootstrap-servers: broker1:9092,broker2.9092
5 producer:
6 bootstrap-servers: broker1:9092,broker2.9092

Step 2 - Configure the Gremlin ALFI library

Next, we need to configure the Gremlin library so that it can authenticate with the Gremlin Control Plane. Follow the instructions in the authentication & configuration documentation and set the required configuration values in the src/main/resources/gremlin.properties file. Your file should look similar to this:

properties
1GREMLIN_ALFI_IDENTIFIER=KafkaALFIDemo
2GREMLIN_TEAM_ID=[your Gremlin team ID]
3GREMLIN_TEAM_PRIVATE_KEY_OR_FILE=file:////path/to/your/private.key
4GREMLIN_TEAM_CERTIFICATE_OR_FILE=file:////path/to/your/team.cert
5GREMLIN_ALFI_ENABLED=true

Now let’s run the application. From the command line, navigate to the project’s root directory and type:

bash
1./gradlew bootRun

If you’re using an IDE, create a Gradle configuration and run the bootRun task. Here, we’re using IntelliJ IDEA:

IntelliJ IDEA Gradle configuration

When we run the application, it creates and exposes a REST API endpoint at http://localhost:9000/kafka/publish. You can test this by running the following on your command line prompt:

bash
1curl -X POST -F "client=test" -F "ip=0.0.0.0" http://localhost:9000/kafka/publish

This creates a new Kafka topic named “users” containing our message and a consumer group named “group_id”. Next, log into the Gremlin web app, select Clients, and then select Application. The application instance is listed as “KafkaALFIDemo”.

Gremlin clients list showing our application

Now we can start running attacks!

Step 3 - Run an attack on the producer

For our first attack, we want to see the impact that latency has on Kafka’s throughput. Latency can have many different sources: poor network quality, overloaded servers, poor application optimization, or slowdowns in downstream components. Any of these can cause message throughput to drop, which can cascade up the pipeline through our producers and to our applications. With Gremlin, we can inject latency directly into our application to simulate degraded performance and observe the impact that this has on the user experience. This can help with capacity planning and load balancing.

  • In the Gremlin web app, create a new attack.

  • Select the “Applications” tab.

  • Under Application Query, select “Custom Application Type”.

    • In the “Name” field, enter the value you used for GREMLIN_ALFI_IDENTIFIER (“KafkaALFIDemo” by default).
  • Under Traffic Query, select “Custom Traffic Type”.

    • In the “Name” field, enter the value you provided in TrafficCoordinates.Build().withType() (“KafkaALFIDemo” by default).
    • For “Percent to Impact”, enter 100.
    • For “Custom Value”, enter service for the key and producer for the value.
  • Under “Choose a Gremlin”, set Latency to 500.

  • Change “Duration” to 120 to run the test for 2 minutes.

  • Click “Unleash Gremlin” to run the attack.

  • Open a terminal in your project’s root directory and run the generate-data.sh script to generate requests to the REST API.

Setting up the producer attack in Gremlin

To verify that the attack is running, check your application output for messages like this:

bash
12020-07-07 16:08:11,955 INFO com.gremlin.GremlinService: Gremlin injecting ExperimentImpact{experimentGuid='ec65137e-afba-42ed-a513-7eafbaa2ed38', impact=com.gremlin.Impact{latency_to_add_in_ms="500.0", exception_should_be_thrown="false"}} for : Coordinates{type=KafkaALFIDemo, fields={service=producer}}

The data generation script sends a randomly generated IP address and user client to our application’s REST API endpoint every 100 ms. When the attack starts, you’ll notice that the rate of requests drops significantly. This means that the extra latency in our producer is causing the application to respond more slowly to API requests, which in turn is affecting client performance.

In a real-world application, we might try to mitigate this by creating a queue to buffer API calls, load balancing requests across multiple producers, or by making our API calls asynchronous.

Step 4 - Run an attack on the consumer

Now that we’ve injected failure into our producer, let’s do the same with our consumer. This time, instead of injecting latency, we’ll inject errors.

Why would we deliberately cause errors? Imagine our application receives malformed or unexpected data, such as an empty client name or IP address containing letters. We want our consumer to be able to handle situations like this quickly and effectively, and using ALFI, we can proactively test this.

Here, we’ll generate an error for one in every five messages (20%). We’ll keep an eye on our throughput and log output while the test is running.

  • In the Gremlin web app, create a new attack.

  • Select the “Applications” tab.

  • Under Application Query, select “Custom Application Type”.

    • In the “Name” field, enter the value you used for GREMLIN_ALFI_IDENTIFIER (“KafkaALFIDemo” by default).
  • Under Traffic Query, select “Custom Traffic Type”.

    • In the “Name” field, enter the value you provided in TrafficCoordinates.Build().withType() (“KafkaALFIDemo” by default).
    • For “Percent to Impact”, enter 20.
    • For “Custom Value”, enter service for the key and consumer for the value.
  • Under “Choose a Gremlin”, set Latency to 0 and enable Throw Exception.

  • Change “Duration” to 120 to run the attack for 2 minutes.

  • Click “Unleash Gremlin” to run the attack.

  • Open a terminal in your project’s root directory and run the generate-data.sh script to generate requests.

Setting up the consumer attack in Gremlin

To verify that the attack is running, check your application output for errors and stack traces:

bash
12020-07-07 16:28:29,905 INFO com.gremlin.GremlinService: Gremlin injecting ExperimentImpact{experimentGuid='371a66f6-7dd0-4632-9a66-f67dd06632d3', impact=com.gremlin.Impact{latency_to_add_in_ms="0.0", exception_should_be_thrown="true"}} for : Coordinates{type=KafkaALFIDemo, fields={service=producer}}
22020-07-07 16:28:29,906 ERROR org.apache.juli.logging.DirectJDKLog: Servlet.service() for servlet \[dispatcherServlet] in context with path \[] threw exception \[Request processing failed; nested exception is java.lang.RuntimeException: Fault injected by Gremlin] with root cause
3java.lang.RuntimeException: Fault injected by Gremlin
4 at com.gremlin.GremlinService.a(SourceFile:133)
5 ...

If we look at our log file, we’ll notice a significant number of missing messages. This is because we don’t have any error logging in place, so the message is lost when the exception is thrown. This is a problem: because the consumer successfully fetched the message from Kafka, it won’t try processing the message again, and when Kafka deletes the message, it will be permanently lost. In a real-world application, we might fix this adding error-handling logic to clean up the message before we try processing it again.

Step 5 - Halt the attack

If you’re done experimenting and you want to stop a running attack:

  • Open the Gremlin web app.

  • To halt all active attacks:

    • Click the big red “Halt All Attacks” button in the top-right corner of the screen.
  • To halt a single attack:

    • Open the “Attacks” page.
    • Select the “Application” tab.
    • Under the “Active” list, click on the attack you want to halt.
    • Click the “Halt Attack” button in the top-right corner.

The UI will update to show that the attack is halted, and your application’s traffic patterns will return to normal.

Halting an attack in Gremlin

Conclusion

In this tutorial, we configured the Gremlin ALFI client for a Spring Boot application and ran several ALFI attacks on Kafka. From here, you can try creating your own attacks and integrating ALFI into your own applications. You can find the complete source code for our demo application on GitHub, and you can find more information about installing ALFI in our documentation.

If you want to run infrastructure level chaos experiments on Kafka, read our tutorial on running chaos experiments on Kafka, as well as our white paper on the first four chaos experiments to run on Apache Kafka. If you have any questions, get support directly from our engineers in the Chaos Engineering Slack.

Slack

Join the Chaos Engineering Slack

Get your questions answered in the #support channel.

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started
  • TechCrunch
  • Forbes
  • Business Insider
  • VentureBeat


© 2020 Gremlin Inc. San Jose, CA 95113