It's the middle of the night when your phone goes off. You rub your eyes and unlock the screen to see a SEV 1 alert from your incident management tool. The application is down, multiple cloud server instances are offline, and the remaining instances are being overwhelmed by the sudden increase in demand.
You jump out of bed and start trying to troubleshoot. You log into your cloud provider and try to provision systems manually, only to find out you can't. You try opening a remote terminal on the remaining servers, but you can't find any indication of a problem. Shortly after, your remote session gets cut off, and you can no longer access any instances.
You start digging deeper, messaging teammates, and following runbooks, but to no avail. The systems are unresponsive, your cloud platform won't let you provision new instances, and you don't know why. With your head in your hands, you look at the clock, and realize what day it is: February 29th.
As over-the-top as this story sounds, it's happened before. Time-related failure modes can be just as damaging—if not more so—than hardware, network, or even data center failures. Time is central to almost every aspect of computing, from network packet synchronization to data integrity to security negotiation, and even slight variations can have significant consequences. The challenge (ironically) is that time-based failures tend to be infrequent: Daylight Savings Time (DST) only comes twice a year, leap years are only once every four years, and more extreme events like the Year 2038 end of epoch problem only come once in a lifetime. How do you plan and test for scenarios that might not happen for months, years, or even decades? The answer is with the Time Travel attack.
In this post, we'll take an in-depth look at the Time Travel attack and how you can use it to build systems that are resilient to even the most unexpected and esoteric failure modes. We'll explain how it works and how you can use it to improve the resilience of your own systems.
The Time Travel attack works by modifying the system clock of the target operating system using the
settimeofday syscall. It can move the clock forward or backwards from the current time by any number of seconds. To prevent systems from automatically synchronizing their clocks using the Network Time Protocol (NTP), this attack includes the option to disable NTP by blocking network traffic over the NTP port (123) during the attack.
SYS_TIMEcapability, which is enabled by default when installing the Gremlin agent.
The Time Travel attack supports these parameters:
Length: How long the attack runs for.
NTP: If toggled, block network traffic on port 123 to prevent NTP from correcting the system time.
Offset: How much to offset the current time (in seconds).
Offset supports both positive and negative values. Positive values move the clock forward, and negative values move the clock backwards. An easy way to convert hours, days, weeks, or longer into seconds is by using a time conversion service like CalculatorSoup.
The amount of time that the clock is offset is called the magnitude or severity of the attack. In other words, as you increase the offset, the magnitude increases due to the larger difference in time. As with all Gremlin attacks, you can run a Time Travel attack on multiple hosts simultaneously. This is called the blast radius. When running your first Time Travel attack, start small by reducing the magnitude and blast radius as much as possible, while still being able to observe the change. For example, changing the time by five minutes is much more noticeable than changing the time by just a few seconds, but not as impactful as changing the time by five days.
Start by targeting a non-production host, such as a virtual machine instance in a development or test environment. While the attack is running, compare it to a clock that you know is accurate to ensure the system time was changed without being reverted by NTP. When the clock has been changed, start testing your hypotheses about how your applications and services will behave. Do they operate the same way despite the time change, or are they generating errors? Are you noticing any connectivity problems, or security alerts? As you run these experiments, remember to record your observations, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
Note that Time Travel only works on hosts and host-based Services, not containers or Kubernetes resources. This is because containers use the host's system clock. If you want to change the system clock in a container, you'll need to identify the host that it's running on and target it instead. If you're running a Time Travel attack on a virtual machine or cloud compute instance, note that the hypervisor or cloud platform might also manage the instance's time. For example, Amazon EC2 instances running Amazon Linux 2 will sync to the Amazon Time Sync Service by default, which uses the NTP protocol. You can either run a Time Travel attack with the option to disable NTP, or disable the chrony service before the attack to prevent the time from automatically reverting.
Time-based failure modes are among the most difficult types of failure modes to test, not only because time plays a vital role in data management, network communications, and general system operation, but also because tracking time in systems is extremely complicated. In addition to the standard concerns like Daylight Savings Time and leap years, do you know if your systems are designed to handle leap seconds, 30 and 45 minute time zones (with DST), region-specific calendar changes, and other complex cases?
The Time Travel attack helps teams proactively prepare for these cases by moving systems far enough ahead in time to trigger them. By running a Time Travel attack, we can observe the behavior of our systems at the exact time of the failure, immediately revert them back to the correct time, then implement any necessary fixes. This lets us:
- Prepare for time-based scenarios like DST and "end of epoch" problems like Y2K.
- Test the impact of time skew on security and compliance mechanisms, such as TLS certificate expiration dates.
- Test the effectiveness of an NTP implementation.
Now that you know how the Time Travel attack works, try running it yourself:
- Log into your Gremlin account (or sign up for a free trial).
- Create a new attack and select a host to target. Start with a single host to limit your blast radius.
- Under Choose a Gremlin, select the State category, then select Time Travel.
- Set the Length of the attack.
- Enable NTP to block network communication with NTP servers.
- Change the Offset to
3600. This will move the system clock one hour into the future.
- Click Unleash Gremlin to start the attack.
While the attack is running, monitor the system clock on your target host. An easy way to monitor the current time on a Linux host is to use the
date command. The following command runs
date every second:
1watch -n 1 'date'
1Every 1.0s: date MacBook.local: Tue Jan 25 17:59:33 202223Tue Jan 25 17:59:33 EST 2022
As you run the attack, keep an eye on the terminal output. In a few seconds, the date will jump an hour ahead. When the attack completes, the date will jump back to the real time. Now, re-run the attack, but increase of the offset so that the new time passes an event like DST, leap year, or epoch. For example, DST in the U.S. is on the second Sunday in March. This post was published in January 2022, so in order to test our system's compatibility with DST, we'd need to increase the system clock by roughly two months. If we use CalculatorSoup to convert two months to seconds, we get
5256000 . We can copy this into our offset, re-run the attack, and now the host clock will move ahead to March. Now we can thoroughly test our services, applications, and systems to make sure they can continue operating reliably. This includes checking newly created or modified data to make sure the correct date is stored, checking for expiring or expired security certificates, and checking for any crashed or terminated processes.
Now that you’ve run an attack, try using a Scenario. Scenarios allow you to run multiple attacks sequentially, as well as monitor the availability of the target system(s) using Status Checks. Gremlin includes Recommended Scenarios designed to help test for specific use cases, such as TLS/SSL certificate expiration. A link to this Scenario is available below.