Modern applications come in a variety of forms–monoliths, microservices, serverless functions, and containers to name a few–but at the heart of all of these are processes. Processes are the fundamental unit of execution that we use to run programs, and although we need processes to run our applications, software engineers rarely think about them. We leave it to the operating system to manage them for us, and rather than monitor individual processes for performance and availability, we monitor services as a whole. This doesn’t mean we shouldn’t care about them, as even one failed process can make an entire system unstable.
In this blog, we’ll take an in-depth look at the Process Killer attack and how you can use it to build resilience to failed or crashed processes. We’ll explain how it works, ways you can apply it, and how using it will help your team and organization build more reliable systems.
The Process Killer attack sends a signal (
SIGKILL by default) to one or more processes on a host. It sends this signal repeatedly for a specified interval for the duration of the attack. You can identify the target process by ID (PID), name, or using a regular expression (regex). Instead of sending a
SIGKILL signal, you can choose from a number of other signals including
The attack supports these parameters:
Length: How long the attack runs for.
Interval: The number of seconds between each send of the signal.
Process: The process name or process ID to match. Supports regular expressions (regex).
Group: The group name or ID to match.
User: The user name or ID to match.
Exact: If true, then the process name must be exact and not just a substring (name matches only).
Kill Children: If true, then the processes’ children will also be killed.
Full Match: If true, the processes’ name match will occur against the full command line string that the process was launched with.
Signal: The signal to send to the target process(es).
Process Selection: When multiple processes match, this determines whether the newest or oldest matching process will be killed (name matches only).
These attributes are called the magnitude of the attack. The magnitude increases as you target a broader set of processes, and/or send a stronger signal. As with all Gremlin attacks, you can run a process attack on multiple hosts simultaneously. This is called the blast radius. You can also target processes running in containers and Kubernetes Pods. Note that the Exact, Kill Children, Full Match, Signal, and Process Selection options are only visible in the Gremlin Web app by clicking “Show Advanced Options.”
When running your first Process Killer attack, start small by reducing the magnitude and blast radius as much as possible. Start by targeting a single, non-essential process on a single host to learn how the attack works (for example, open a text editor with a blank document). Use a process monitoring tool like top, htop, atop, nmon, or System Monitor to monitor your target process before, during, and after the attack. Before firing off a kill signal, consider sending a
SIGTERM lets the process go through its shutdown routine and close gracefully, whereas
SIGKILL immediately terminates the process. Instead of sending the signal multiple times, send it once by setting both the
Interval parameters to 0. If you want to target multiple processes using regex, use a regex validator tool like regex101 or RegExr to check your expression and ensure it doesn’t match on unintended targets.
As you run these experiments, remember to record your observations, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
We rarely think about the processes running on our systems unless we’re deploying a new instance of an application or service, monitoring resource consumption, or if an application or service stops functioning. Yet under the hood, processes can get interrupted or even terminated (for example, if the host runs low on memory). We rely on tools to monitor and recover processes for us such as Monit, daemontools, Supervisor, or systemd; or in the case of containerized processes, Kubernetes. With a Process Killer attack, we can validate that these systems are working as designed and can quickly detect and recover failed processes.
With Process Killer attacks, we can validate that:
- Watchdog processes can successfully detect and restart a failed process.
- In a containerized environment, the container orchestrator will restart failed containers or Pods.
- Clustered workloads, like a Kubernetes application or Kafka cluster, can continue running even if a key process fails.
This helps us maintain high availability, reduce the risk of downtime, and provide a better overall experience for our customers.
Now that you know how the Process Killer attack works, try running it yourself:
Log into your Gremlin account (or sign up for a free trial account).
Create a new attack and select a host to target. Start with a single host to limit your blast radius.
Under Choose a Gremlin, select the State category, then select Process Killer.
- Set the Length of the attack and the Interval between signals.
- In the Process field, enter the name(s) of the process(es) you want to send the signal to. You can enter a regular expression (regex) in this field.
- Optionally, in the Group and User fields, enter the name or ID of the group or user that the process belongs to.
Optionally, open the Show Advanced Options section and configure the following options:
- Enable Exact to require an exact match on the process name entered in the Process field.
- Enable Kill Children to also send the signal to processes created by the target process.
- Enable Full Match to compare the Process field against the entire command line string, not just the process name.
- Select the Signal to send to the process(es). By default, the
KILLsignal is sent.
- If the process name matches multiple processes, use the Process Selection to specify which of those processes should be killed (the oldest, or the newest).
Click Unleash Gremlin to start the attack.
Make sure to have your process monitor up and running during the attack, and compare your observations to your hypothesis:
- If you’re testing a watchdog process like daemontools, does the process start up again like you expected?
- If you killed a process running in a container, did your container orchestrator detect the failure and automatically restart the process?
- If you sent a non-
TERM, did your process shut down gracefully? If you sent a
KILLsignal and restarted the process, was there any data loss or corruption?
- If you were testing a distributed workload, such as a replicated Kubernetes Deployment, did traffic get successfully redirected to other instances with minimal downtime?
- If you were testing the high availability of a cluster (for example, by terminating the kube-controller-manager process on a Kubernetes master) did the cluster maintain its integrity?
Once you’ve comfortably answered your initial hypothesis, try increasing the magnitude of your attack by targeting more critical processes, like system processes. You can also increase the blast radius by targeting more hosts simultaneously. How does this impact the stability of your systems and applications? More importantly, if these processes were to fail in production, how would they impact your users, and how long would it take for you to recover from them?
Now that you’ve run the attack, try using a Scenario to run multiple attacks sequentially. You can use a Scenario to target different processes or process groups without complex regex rules, send different signals to different processes, target processes for different users at different times, and more. Try it out in the Gremlin web app, and remember to record your observations!