Getting started with Disk attacks
Persistent storage is one of the more difficult aspects of managing distributed systems. When we attach a storage device to a host—whether it’s flash storage, network attached storage (NAS), or old fashioned spinning disks—we generally don’t give it much thought until we start running distributed applications or need to increase capacity. But there’s more that can go wrong with storage, and this can have unexpected consequences for our systems, services, and applications.
In this blog, we take an in-depth look at disk attacks and explain how they work, their technical use cases, and how to tie these back to business objectives. By reading this blog, you’ll learn why disk attacks are useful and how they can deliver value to your organization.
Why should you run disk attacks?
Storage is a finite resource that, when exhausted, can have unpredictable and potentially catastrophic effects on systems and applications. When applications and services can no longer write to disk, they might stop accepting new requests, go into an automatic suspend state, or crash. If a large file transfer is hogging up IOPS, applications that use disk may slow down, causing system-wide latency. We need to make sure that our applications can tolerate low disk space and latency without failure.
With disk attacks, we can validate that:
- We have enough disk capacity and IOPS to handle a large workload migration, such as a database transfer.
- Our automatic disk cleanup and compression methods are working, such as log rotation.
- Dynamic provisioning systems like database sharding and Kubernetes Dynamic Volume Provisioning are working as intended.
This lets us:
- Lower operating expenses by dynamically adding storage when needed, instead of over provisioning upfront.
- Reduce the mean time to detection (MTTD) for storage-related incidents by adjusting monitoring and alerting thresholds for disk consumption.
How does a disk attack work?
A disk attack consumes disk space on a storage device until disk usage reaches the percentage that you specify. Although the attack is called a “disk” attack, it’s not limited to a traditional spinning hard disk. The attack works by continuously writing text to a directory on the filesystem, so it works on any storage device that can be accessed over a filesystem. This includes HDDs, SSDs, NVMe drives, and even network attached storage. With a disk attack, you can configure:
- <span class="code-class-custom">Directory</span>: the root directory where the attack will be executed.
- <span class="code-class-custom">Workers</span>: the number of concurrent workers writing to the disk.
- <span class="code-class-custom">Block Size</span>: the number of kilobytes (KB) that are read/written at a time.
- <span class="code-class-custom">Volume Percentage</span>: the percentage of the target volume to fill.
These attributes are called the magnitude of the attack. As with all Gremlin attacks, you can run a disk attack on multiple systems simultaneously. This is called the blast radius. Note that the Workers and Block Size options are only available by clicking on “Show Advanced Options” under Volume Percentage. Adjusting these options can decrease the amount of time needed for the disk attack to initialize and consume storage space.
Gremlin will only consume disk space up to the amount specified in Volume Percentage. For example, if you’re currently using 50% of your disk space, and you set the magnitude of your disk attack to 80%, Gremlin will only use an additional 30%. If you instead set your magnitude to a lower amount, like 20%, Gremlin will recognize that you’re currently using more disk space than the magnitude, and won’t do anything. Gremlin will also create and use its own files for the attack, so at no point is your data modified in any way.
The primary focus of disk attacks is to consume storage space on a specific volume. Depending on the speed and throughput of the device (commonly called input/output operations per second, or IOPS), this process can also stress a storage device’s throughput. This allows for more complex experiments, such as simulating copying a large file like a database or backup.
Unlike a CPU or Memory attack, Gremlin doesn’t show disk usage metrics during the attack. Instead, you can use an observability tool, or a simple command line tool like du to check disk usage before the attack.
When running your first disk attack, start small. Avoid using 100% of your disk space at first, as this can have unintended and unexpected consequences for applications and processes running on your target. Instead, only consume enough space to validate the hypothesis you’re trying to solve for. For example, if you’re trying to validate a disk monitor you set up that fires at 80%, set your magnitude to 80%. If you’re using a dynamic or scalable storage solution, consume just enough space to test the scaling process. Once you feel confident that your systems can withstand these conditions, then try increasing the magnitude to 80, 90 or even 99%.
As you run these experiments, remember to record your observations in the Gremlin web app, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
Get started with disk attacks
Now that you know how disk attacks work, try running one for yourself:
- Log into your Gremlin account (or sign up for a free trial).
- Create a new attack and select a host to target. Start with a single host to limit your blast radius.
- Under Choose a Gremlin, select the Resource category, then select Disk.
- Enter the Directory to run the attack in. This defaults to /tmp on Linux systems to prevent any files created by Gremlin from persisting.
- Enter the percentage of disk space to consume in the Volume Percentage field.
- Optionally, in the Advanced Options section, enter the number of Workers simultaneously writing to disk, and the amount of data to write at a time in Block Size.
- Click Unleash Gremlin to start the attack.
Gremlin also provides several Recommended Scenarios to guide you through different use cases including testing storage volume limits, as well as testing monitoring and alerting. Try out a Recommended Scenario by clicking “Run Scenario” below.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.sTART YOUR TRIAL
What is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...Read more
Introducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.Read more