Chaos Engineering using Gremlin on IBM Cloud
Managing the reliability of mission-critical cloud-native applications is a significantly different paradigm than that of traditional Enterprise/N-tier architectures. Container orchestration platforms like IBM Cloud Kubernetes Services and Red Hat OpenShift on IBM Cloud do a significant amount of heavy lifting to ensure availability and reliability but they also come with multitude of architectural and configuration options.
When designing and maintaining mission critical cloud workloads that require the utmost levels of reliability, resiliency and availability, application owners need new resiliency testing methods in place to validate a system's reliability.
Chaos Engineering is a practice to test a system's response to turbulent behavior such as infrastructure failures, unresponsive services or missing components by performing experiments which create this behavior in a controlled fashion. Chaos Engineering applies agile methodologies to identify and rectify weaknesses of a system.
At IBM, we have developed a set of principles and a 10-step methodology to run such experiments.
In this article, we will show how Chaos Engineering can be used to manage the reliability of an application on IBM Cloud Kubernetes Services and Red Hat OpenShift on IBM Cloud. This article demonstrates a set of experiments to assess the reliability of these services using Gremlin, a popular Chaos Engineering platform supported on IBM Cloud.
Using Gremlin to run a chaos experiment
Gremlin makes it easy to improve system resilience by helping you finding weaknesses in your systems before they cause problems for your customers. Gremlin has an ever-growing library of attacks including resource attacks, network attacks, and system state attacks. You can run individual attacks on a system, or combine multiple attacks into a Scenario to simulate real-world outages.
Gremlin operates in a SaaS model, where you only need to install a small agent on a host or virtual machine (VM), or as a Kubernetes DaemonSet. Once the agent is installed, the host will appear in the Gremlin web app as a target that you can run experiments on.
Before running a chaos experiment
Before running your first experiment, consider the following:
- Start in a contained, low-risk environment, such as a sandbox or test environment. Use a load testing tool like Locust or Apache JMeter to simulate a baseline user load. This will make it easier to assess the impact of attacks injected via Gremlin.
- Make sure you have proper observability and traceability measures in place. These will let you assess the health and performance of your clusters and applications. Use a logging service such as IBM Log Analysis with LogDNA, a monitoring service such as IBM Cloud Monitoring with Sysdig, and a traceability tool such as ZipKin.
Capture the Baseline
Before selecting and launching a chaos experiment from Gremlin’s library, capture the metrics for a baseline user load for the scope of your experiment. It is important to understand the behavior of the application during normal conditions to compare it to the behavior during experiments.
Select an experiment, execute, and observe
Next, select the experiment that you want to run. In a Kubernetes platform, one of the simplest experiments is assessing the impact of one container on another within a single pod. While most organizations try to follow a one-container-per-pod deployment model, there are cases where multiple containers need to be in the same pod because of dependencies or the need to share resources.
Run experiments on multiple containers per Pod
When running container-level experiments, consider the following:
- When selecting a Kubernetes pod as your target, Gremlin lets you choose which container(s) to run the attack on. If a pod has multiple replicas on different worker nodes, you can narrow the blast radius by choosing which specific pod (or a percentage of pods) to target.
- Start with running CPU attacks before running memory attacks. Most applications exhibit higher latency when available CPU is low and become highly unstable when faced with low memory.
- For pods that require persistent storage like file shares and databases, start with IO attacks first to observe the impact of large amounts of read/write operations. Gradually add attacks to include concurrent writes, bigger block sizes, and low free space.
Run experiments at the pod level
The next step is to experiment at the pod level and add network experiments. Pod-level experiments open up the opportunity to launch DNS, latency, packet loss, and other attacks.
- Start with lower-risk attacks like injecting delays or introducing packet loss to find out whether your application can withstand poor network conditions. DNS plays a critical role in keeping your systems running. Many companies have fallen victim to DNS attacks.
- Use the attack parameters provided by Gremlin (IP address, network interface, protocol, and more) to control the blast radius of the attack.
- To test a complete service outage, use a blackhole attack. Blackhole attacks drop all network traffic to and from the selected pod. Browser-based frontends often use proxies and caching services to improve performance and security, and blackhole attacks let you test the impact on user experience when such services go down.
Run experiments at the microservice level
Once you get comfortable with isolated container and pod-level attacks, it's time to experiment with attacks on a microservice as a whole. In a microservice architecture, an application is composed of multiple microservices that communicate and transacts between each other. Microservices typically span many containers and pods. You can craft a mixture of attacks on different components and schedule them for the same time.
Before running microservice-level attacks:
- Understand your critical path, which is the set of components required for your application to serve its core function. This is important for knowing how transactions move through your application, and how different microservices interact with each other.
- Define a set of experiments to determine the potential impact of attacks on these transactions. Will they gracefully fail and allow users to retry or do they corrupt the data, resulting in financial losses or user attrition?
- Gremlin’s Scenario feature will be useful here since you can orchestrate different sets of attacks in different orders. For example, you can use a node shutdown attack to halt a node and test if your applications are properly failing over to other worker nodes. Then, you can increase your blast radius and shutdown a second node to make sure you can tolerate larger-scale failures.
In addition to the application layer, you should plan for broader infrastructure layer attacks. Following the principle of starting small, create experiments that target the CPU, memory, and network components of your infrastructure.
There are many other use cases and experiments that could be used to assess the reliability and resiliency of applications. For example, some application features and functions are sensitive to date and time differences, such as SSL certificates, cookie expiration, ID expiration, session expiration. Potential failures due to timing errors are often overlooked during normal testing. To uncover these issues, use Gremlin's Time Travel attack to change the system time of a host and observe any impact on your application's usability. Start with smaller offsets to uncover session and cookie related issues, then gradually increase the offset to test situations such as Daylight Savings Time.
Chaos Engineering is an essential practice to assess the reliability and resiliency of cloud-native applications. IBM has embraced Chaos Engineering in many of its own cloud and SaaS services, and also when managing the thousands of mission critical workloads of its clients.