Getting the most out of Gremlin Resource Experiments

Last Updated:

July 10, 2019

Topics:

Chaos Engineering

,

Gremlin

,

Getting the most out of Gremlin Resource Experiments

This is an older tutorial

We strive to keep all tutorials current. However, this tutorial has not been updated recently and may contain out-of-date instructions.

Using Gremlin to simulate resource contention is a great way to help understand how your application responds during bounding conditions, to test and validate autoscaling, and help ensure you have the proper notifications configured from your monitoring system. To do this, Gremlin provides mechanisms for creating CPU load, consuming memory, and consuming your disk I/O & space. While Gremlin does this well, the Linux kernel also does a great job of helping to balance application load in your systems; often, it’s best to skew the kernels ability to balance Gremlin, after all it is just another application.

To do that, let’s explore the tools in the Linux OS that can help to unbalance your system, to test the extremities of what your applications and systems are capable of. Linux provides a great toolbox for this, commands such as nice, chrt, and ionice come to the top of mind, as well as adjustments to OOM Killer. In this article, we’ll dive into the use and use-case for each one as it relates to Gremlin and injecting chaos experiments into your systems.

Pushing the Compute Boundaries

By default, the Linux scheduler CFS fairly weights every process at 0 and uses the SCHED_OTHER scheduling policy. The Gremlin daemon, gremlind, is treated no differently than other process out of the box. This means when running a CPU contention experiment, it will at worst simulate the boundaries of a normal runaway process. For most experiments, this is desired and expected. Afterall, you don’t normally run your process in a heavily weighted way.

Sometimes, however, we want to smash past those boundaries, the built in safeguards, and fairness of the linux scheduler. To do this, you’ll need to engage two mechanisms to tell the Linux scheduler that gremlind is your priority process, and therefore your system’s priority process.

Nice will set a processes priority on a scale from +19 to -20, 0 being the default. This scale is how nice we want a process to behave, +19 being very very nice, meaning that the process will more readily defer processor time to other applications, while -20 at the other end of the spectrum being a very un-nice process indeed. We want gremlind to be very un-nice to our system, that is allow it more CPU time and the ability to preempt other processes in favor of itself.

This command will set gremlind to the highest priority on the machine:

BASH


sudo renice -20 `pgrep gremlind`

‍

The other mechanism we need to adjust is the scheduling policy for Gremlin. As I mentioned above, the default scheduling policy is SCHED_OTHER. In total, there are 5 scheduling policies in CFS: SCHED_FIFO, SCHED_BATCH, SCHED_IDLE, SCHED_OTHER, SCHED_RR. Without diving too deeply into the technical details behind each one, SCHED_BATCH is designed for CPU intensive workloads. Setting the gremlind process to use this scheduling policy will, in conjunction with making it very un-nice, enable it consume the majority of the CPU resources available to your host.

To view the current policy, run the following command:

BASH


sudo chrt -p `pgrep gremlind`

‍

This command will set gremlind to the SCHED_BATCH policy:

BASH


sudo chrt -b -p 0 `pgrep gremlind`

‍

To return gremlind to normal operating conditions, run the following commands:

BASH


sudo renice -0 `pgrep gremlind`
sudo chrt -o -p `pgrep gremlind`

‍

Pushing the I/O Boundaries

Along with the compute scheduler, Linux also has an I/O scheduler, with scheduling policies of its own. The policies of the Linux I/O scheduler are: Idle, Best Effort and Real Time.

The default I/O policy is Best Effort, which actually takes some of its direction from processes niceness. Best Effort has a priority scale of 0-7, with 0 being the highest priority. The default equation to determine where a process falls on the priority scale is: io_priority = (cpu_nice + 20) / 5. Therefore, if you’ve already set your niceness to -20, without changing anything you’ve got the best I/O scheduling that Best Effort can afford you. We can do better though.

The Real Time policy gets first access to disk, regardless of what else is happening in the system. Like Best Effort, it also has a priority scale of 0-7, 0 being the highest priority.

To set gremlind to the Real Time policy with a priority of 0, run the following command:

BASH


sudo ionice -c 1 -n 0 -p `pgrep gremlind`

‍

Post experiment, to return gremlind to normal conditions, run the following command:

BASH


sudo ionice -c 2 -n 4 -p `pgrep gremlind`

‍

Pushing the Memory Boundaries

When Linux starts to run out of memory, it gets a bit defensive. Enter the OOM Killer - a process the kernel uses to free up memory when it starts to hit the limits of memory exhaustion. OOM Killer works by giving each running process a oom_score; that is, how likely it is to terminate a process in the case of low or no available memory.

It computes that score proportional to the amount of memory used by the process. The equation is oom_score = 10 * %_of_process_memory. So if your host has 10Gb of memory, your application is using around 3Gb, another 1Gb is being utilized by other tasks and gremlind is using roughly 5Gb, then your app would receive an oom_score of ~300, while gremlind would receive an oom_score of 500 - gremlind would be killed and the system should return to normal.

You can see the oom_score of any given process by running the command

BASH


cat /proc/$PID/oom_score

‍

For instance, on a T2.micro at idle state, gremlind has an oom_score around 8

There are a couple ways to modify the score. The first one is through /proc/$PID/oom_score_adj and the second is through /proc/$PID/oom_adj; the first being a very granular scale, similar to nice where positive integers make it more likely to be killed and negative numbers less likely. The second method of adjustment is less granular, on a scale of 15 to -17, with -17 having a special value meaning of never kill.

To set gremlind to never be killed by OOM Killer, run the following command:

BASH


echo -17 > /proc/`pgrep gremlind`/oom_adj

‍

To return gremlind to normal conditions, run the following command:

BASH


echo 0 > /proc/`pgrep gremlind`/oom_adj

‍

Conclusion

Finding the edge cases of where our systems breakdown, and recording what happens in those events, is one of the many use cases for Chaos Engineering. Coupled with the right observability and devops practices, you can start to understand what happens at the extreme end of scalability for your applications. Resource contention is one of those extreme end cases.

You wouldn’t start here, but you may be able to improve performance further by adjusting the policies around how Linux treats gremlind. By doing so, you’ll be able to experiment with pushing your hosts and applications into those extreme scenarios and prevent big problems should those scenarios occur naturally, while also finding new ways to tweak performance and enhance both reliability and process execution for your system.

No items found.

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

start your trial

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

get started

Getting the most out of Gremlin Resource Experiments

Pushing the Compute Boundaries

Pushing the I/O Boundaries

Pushing the Memory Boundaries

Conclusion

Related

How to run an experiment on AWS Lambda using Failure Flags and Node.js

How to run multiple experiments in parallel using Gremlin

How to use your Gremlin reliability score in Jenkins to ensure reliable releases

Avoid downtime. Use Gremlin to turn failure into resilience.