Validating horizontal pod autoscaling on EKS with Gremlin

In a previous tutorial, we showed how to run chaos experiments on a self-managed Apache Kafka cluster. In this tutorial, weâll use application layer fault injection (ALFI) to run chaos experiments directly in a Kafka application. Weâll show you how to use the Gremlin ALFI client and the Gremlin web app to run attacks on a Java application, with the goal of improving the performance and reliability of our application and of Kafka itself.
Application layer fault injection (ALFI) is a method of injecting failure directly into an application. ALFI lets us run chaos experiments on individual applications, services, and even specific requests. For example, we can add latency to specific function calls, generate errors on a percentage of requests, and even experiment in managed or serverless environments such as Confluent Cloud, Amazon MSK, AWS Lambda, and Heroku.
With ALFI, we have much greater control over the blast radius (the scope) of our attacks and can target specific Kafka interactions. For example, we can limit attacks to specific messages, producers, consumers, and topics, whereas in our previous tutorial we were only able to target entire nodes. This makes it much easier to ensure that your applications are production-ready before you deploy to customers.
Weâll demonstrate ALFI using an open source application called Kafka ALFI Demo. This is a Spring Boot application that creates a REST API endpoint and records the IP address and client name for each device that connects to it. A producer class that publishes this data to a Kafka topic, while a consumer service pulls entries from the topic and writes them to a log file.
We'll cover the following topics:
Learn about 4 common Kafka failure modes and how to design chaos experiments to test against them.
To complete this tutorial, youâll need:
First, clone the demo application repository from GitHub. If youâre using an IDE like Eclipse or IntelliJ IDEA, you can import the project as a Gradle project.
1git clone https://github.com/8bitbuddhist/kafka-alfi-demo.git
Next, enter your Kafka cluster configuration by opening the src/main/resources/application.yml
file. Change the values of bootstrap-servers
so that they contain a comma-separated list of your Kafka brokers:
1spring:2 kafka:3 consumer:4 bootstrap-servers: broker1:9092,broker2.90925 producer:6 bootstrap-servers: broker1:9092,broker2.9092
Next, we need to configure the Gremlin library so that it can authenticate with the Gremlin Control Plane. Follow the instructions in the authentication & configuration documentation and set the required configuration values in the src/main/resources/gremlin.properties
file. Your file should look similar to this:
1GREMLIN_ALFI_IDENTIFIER=KafkaALFIDemo2GREMLIN_TEAM_ID=[your Gremlin team ID]3GREMLIN_TEAM_PRIVATE_KEY_OR_FILE=file:////path/to/your/private.key4GREMLIN_TEAM_CERTIFICATE_OR_FILE=file:////path/to/your/team.cert5GREMLIN_ALFI_ENABLED=true
Now letâs run the application. From the command line, navigate to the projectâs root directory and type:
1./gradlew bootRun
If youâre using an IDE, create a Gradle configuration and run the bootRun task. Here, weâre using IntelliJ IDEA:
When we run the application, it creates and exposes a REST API endpoint at http://localhost:9000/kafka/publish. You can test this by running the following on your command line prompt:
1curl -X POST -F "client=test" -F "ip=0.0.0.0" http://localhost:9000/kafka/publish
This creates a new Kafka topic named âusersâ containing our message and a consumer group named âgroup_idâ. Next, log into the Gremlin web app, select Clients, and then select Application. The application instance is listed as âKafkaALFIDemoâ.
Now we can start running attacks!
For our first attack, we want to see the impact that latency has on Kafkaâs throughput. Latency can have many different sources: poor network quality, overloaded servers, poor application optimization, or slowdowns in downstream components. Any of these can cause message throughput to drop, which can cascade up the pipeline through our producers and to our applications. With Gremlin, we can inject latency directly into our application to simulate degraded performance and observe the impact that this has on the user experience. This can help with capacity planning and load balancing.
In the Gremlin web app, create a new attack.
Select the âApplicationsâ tab.
Under Application Query, select âCustom Application Typeâ.
GREMLIN_ALFI_IDENTIFIER
(âKafkaALFIDemoâ by default).Under Traffic Query, select âCustom Traffic Typeâ.
TrafficCoordinates.Build().withType()
(âKafkaALFIDemoâ by default).100
.service
for the key and producer
for the value.Under âChoose a Gremlinâ, set Latency to 500
.
Change âDurationâ to 120
to run the test for 2 minutes.
Click âUnleash Gremlinâ to run the attack.
Open a terminal in your projectâs root directory and run the generate-data.sh
script to generate requests to the REST API.
To verify that the attack is running, check your application output for messages like this:
12020-07-07 16:08:11,955 INFO com.gremlin.GremlinService: Gremlin injecting ExperimentImpact{experimentGuid='ec65137e-afba-42ed-a513-7eafbaa2ed38', impact=com.gremlin.Impact{latency_to_add_in_ms="500.0", exception_should_be_thrown="false"}} for : Coordinates{type=KafkaALFIDemo, fields={service=producer}}
The data generation script sends a randomly generated IP address and user client to our applicationâs REST API endpoint every 100 ms. When the attack starts, youâll notice that the rate of requests drops significantly. This means that the extra latency in our producer is causing the application to respond more slowly to API requests, which in turn is affecting client performance.
In a real-world application, we might try to mitigate this by creating a queue to buffer API calls, load balancing requests across multiple producers, or by making our API calls asynchronous.
Now that weâve injected failure into our producer, letâs do the same with our consumer. This time, instead of injecting latency, weâll inject errors.
Why would we deliberately cause errors? Imagine our application receives malformed or unexpected data, such as an empty client name or IP address containing letters. We want our consumer to be able to handle situations like this quickly and effectively, and using ALFI, we can proactively test this.
Here, weâll generate an error for one in every five messages (20%). Weâll keep an eye on our throughput and log output while the test is running.
In the Gremlin web app, create a new attack.
Select the âApplicationsâ tab.
Under Application Query, select âCustom Application Typeâ.
GREMLIN_ALFI_IDENTIFIER
(âKafkaALFIDemoâ by default).Under Traffic Query, select âCustom Traffic Typeâ.
TrafficCoordinates.Build().withType()
(âKafkaALFIDemoâ by default).20
.service
for the key and consumer
for the value.Under âChoose a Gremlinâ, set Latency to 0
and enable Throw Exception
.
Change âDurationâ to 120
to run the attack for 2 minutes.
Click âUnleash Gremlinâ to run the attack.
Open a terminal in your projectâs root directory and run the generate-data.sh
script to generate requests.
To verify that the attack is running, check your application output for errors and stack traces:
12020-07-07 16:28:29,905 INFO com.gremlin.GremlinService: Gremlin injecting ExperimentImpact{experimentGuid='371a66f6-7dd0-4632-9a66-f67dd06632d3', impact=com.gremlin.Impact{latency_to_add_in_ms="0.0", exception_should_be_thrown="true"}} for : Coordinates{type=KafkaALFIDemo, fields={service=producer}}22020-07-07 16:28:29,906 ERROR org.apache.juli.logging.DirectJDKLog: Servlet.service() for servlet \[dispatcherServlet] in context with path \[] threw exception \[Request processing failed; nested exception is java.lang.RuntimeException: Fault injected by Gremlin] with root cause3java.lang.RuntimeException: Fault injected by Gremlin4 at com.gremlin.GremlinService.a(SourceFile:133)5 ...
If we look at our log file, weâll notice a significant number of missing messages. This is because we donât have any error logging in place, so the message is lost when the exception is thrown. This is a problem: because the consumer successfully fetched the message from Kafka, it wonât try processing the message again, and when Kafka deletes the message, it will be permanently lost. In a real-world application, we might fix this adding error-handling logic to clean up the message before we try processing it again.
If youâre done experimenting and you want to stop a running attack:
Open the Gremlin web app.
To halt all active attacks:
To halt a single attack:
The UI will update to show that the attack is halted, and your applicationâs traffic patterns will return to normal.
In this tutorial, we configured the Gremlin ALFI client for a Spring Boot application and ran several ALFI attacks on Kafka. From here, you can try creating your own attacks and integrating ALFI into your own applications. You can find the complete source code for our demo application on GitHub, and you can find more information about installing ALFI in our documentation.
If you want to run infrastructure level chaos experiments on Kafka, read our tutorial on running chaos experiments on Kafka, as well as our white paper on the first four chaos experiments to run on Apache Kafka. If you have any questions, get support directly from our engineers in the Chaos Engineering Slack.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started