
Using Gremlin to simulate resource contention is a great way to help understand how your application responds during bounding conditions, to test and validate autoscaling, and help ensure you have the proper notifications configured from your monitoring system. To do this, Gremlin provides mechanisms for creating CPU load, consuming memory, and consuming your disk I/O & space. While Gremlin does this well, the Linux kernel also does a great job of helping to balance application load in your systems; often, it’s best to skew the kernels ability to balance Gremlin, after all it is just another application.
To do that, let’s explore the tools in the Linux OS that can help to unbalance your system, to test the extremities of what your applications and systems are capable of. Linux provides a great toolbox for this, commands such as nice
, chrt
, and ionice
come to the top of mind, as well as adjustments to OOM Killer
. In this article, we’ll dive into the use and use-case for each one as it relates to Gremlin and injecting chaos experiments into your systems.
By default, the Linux scheduler CFS fairly weights every process at 0 and uses the SCHED_OTHER
scheduling policy. The Gremlin daemon, gremlind
, is treated no differently than other process out of the box. This means when running a CPU contention experiment, it will at worst simulate the boundaries of a normal runaway process. For most experiments, this is desired and expected. Afterall, you don’t normally run your process in a heavily weighted way.
Sometimes, however, we want to smash past those boundaries, the built in safeguards, and fairness of the linux scheduler. To do this, you’ll need to engage two mechanisms to tell the Linux scheduler that gremlind
is your priority process, and therefore your system’s priority process.
Nice will set a processes priority on a scale from +19 to -20, 0 being the default. This scale is how nice we want a process to behave, +19 being very very nice, meaning that the process will more readily defer processor time to other applications, while -20 at the other end of the spectrum being a very un-nice process indeed. We want gremlind
to be very un-nice to our system, that is allow it more CPU time and the ability to preempt other processes in favor of itself.
This command will set gremlind
to the highest priority on the machine:
1sudo renice -20 `pgrep gremlind`
The other mechanism we need to adjust is the scheduling policy for Gremlin. As I mentioned above, the default scheduling policy is SCHED_OTHER
. In total, there are 5 scheduling policies in CFS: SCHED_FIFO
, SCHED_BATCH
, SCHED_IDLE
, SCHED_OTHER
, SCHED_RR
. Without diving too deeply into the technical details behind each one, SCHED_BATCH
is designed for CPU intensive workloads. Setting the gremlind
process to use this scheduling policy will, in conjunction with making it very un-nice, enable it consume the majority of the CPU resources available to your host.
To view the current policy, run the following command:
1sudo chrt -p `pgrep gremlind`
This command will set gremlind
to the SCHED_BATCH
policy:
1sudo chrt -b -p 0 `pgrep gremlind`
To return gremlind
to normal operating conditions, run the following commands:
1sudo renice -0 `pgrep gremlind`2sudo chrt -o -p `pgrep gremlind`
Along with the compute scheduler, Linux also has an I/O scheduler, with scheduling policies of its own. The policies of the Linux I/O scheduler are: Idle, Best Effort and Real Time.
The default I/O policy is Best Effort, which actually takes some of its direction from processes niceness. Best Effort has a priority scale of 0-7, with 0 being the highest priority. The default equation to determine where a process falls on the priority scale is: io_priority = (cpu_nice + 20) / 5
. Therefore, if you’ve already set your niceness to -20, without changing anything you’ve got the best I/O scheduling that Best Effort can afford you. We can do better though.
The Real Time policy gets first access to disk, regardless of what else is happening in the system. Like Best Effort, it also has a priority scale of 0-7, 0 being the highest priority.
To set gremlind
to the Real Time policy with a priority of 0, run the following command:
1sudo ionice -c 1 -n 0 -p `pgrep gremlind`
Post experiment, to return gremlind
to normal conditions, run the following command:
1sudo ionice -c 2 -n 4 -p `pgrep gremlind`
When Linux starts to run out of memory, it gets a bit defensive. Enter the OOM Killer - a process the kernel uses to free up memory when it starts to hit the limits of memory exhaustion. OOM Killer works by giving each running process a oom_score
; that is, how likely it is to terminate a process in the case of low or no available memory.
It computes that score proportional to the amount of memory used by the process. The equation is oom_score = 10 * %_of_process_memory
. So if your host has 10Gb of memory, your application is using around 3Gb, another 1Gb is being utilized by other tasks and gremlind
is using roughly 5Gb, then your app would receive an oom_score
of ~300, while gremlind
would receive an oom_score
of 500 - gremlind
would be killed and the system should return to normal.
You can see the oom_score
of any given process by running the command
1cat /proc/$PID/oom_score
For instance, on a T2.micro at idle state, gremlind
has an oom_score
around 8
There are a couple ways to modify the score. The first one is through /proc/$PID/oom_score_adj
and the second is through /proc/$PID/oom_adj
; the first being a very granular scale, similar to nice where positive integers make it more likely to be killed and negative numbers less likely. The second method of adjustment is less granular, on a scale of 15 to -17, with -17 having a special value meaning of never kill.
To set gremlind
to never be killed by OOM Killer, run the following command:
1echo -17 > /proc/`pgrep gremlind`/oom_adj
To return gremlind
to normal conditions, run the following command:
1echo 0 > /proc/`pgrep gremlind`/oom_adj
Finding the edge cases of where our systems breakdown, and recording what happens in those events, is one of the many use cases for Chaos Engineering. Coupled with the right observability and devops practices, you can start to understand what happens at the extreme end of scalability for your applications. Resource contention is one of those extreme end cases.
You wouldn’t start here, but you may be able to improve performance further by adjusting the policies around how Linux treats gremlind
. By doing so, you’ll be able to experiment with pushing your hosts and applications into those extreme scenarios and prevent big problems should those scenarios occur naturally, while also finding new ways to tweak performance and enhance both reliability and process execution for your system.
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.
Get started