Imagine this: you're in the middle of an important presentation when all of a sudden your video feed starts to stutter. You hear other people speaking, but their words are choppy. A message comes through Slack from one of your co-workers: "I think your connection cut out." You scramble to try different solutions—restarting your videoconferencing application, checking your Internet connection, switching to your phone—but ultimately, your presentation gets cut short.
If you've been working remotely, this is probably a familiar scenario. When data is sent over a network, it gets split into discrete units of data called packets, which then get reassembled at their destination. If any of these packets are lost or become corrupt, the reassembled data is incomplete, resulting in stuttering video, latency, and dropped calls. As streaming becomes more commonplace, especially with the growing number of remote workers, testing the behavior of your systems and services under packet loss conditions is critical. This is one of the reasons why Gremlin provides a dedicated Packet Loss attack.
In this blog, we'll take a deep dive into the Packet Loss attack. We'll look at how it works, how to use it, and how it can help you build responsive, fault-tolerant applications and systems.
The Packet Loss attack works by dropping (or corrupting) a percentage of outbound network packets from a host or container. It uses existing Quality of Service (QoS) and Differentiated Services (diffserv) facilities in the Linux kernel to emulate packet latency, and it uses a custom driver for Windows systems. This way, it doesn't need to modify any firewall rules or iptables rulesets.
By default, the attack drops 1% of all outbound network traffic (except for DNS traffic and traffic to api.gremlin.com). You can configure the attack to drop a larger percentage of traffic (up to 100%), corrupt packets instead of dropping them, or only impact specific types of traffic based on port, IP address, hostname, and other parameters. Gremlin also provides a list of pre-defined third-party services that you can select for easy targeting, such as AWS EC2 regions.
The attack supports these parameters:
Length: How long the attack runs for.
IP Addresses: Restricts the attack to specific IP address(es). This field supports CIDR values (e.g.
Device: The network interface to impact traffic on. If left blank, Gremlin will target all network interfaces.
Hostnames: Only impacts traffic to these hostnames.
Remote ports: Only impacts traffic to these destination ports.
Local ports: Only impacts traffic originating from these local ports.
Percent: The percentage of packets to drop. The number entered corresponds to a percentage: for example, a value of
Providers: Which external service provider(s) to impact, if any. To access this option in the Gremlin web app, click "Show Advanced Options".
Tags: If specified, the attack will only run on Gremlin agents associated with these tags.
Protocol: Which protocol to impact (TCP, UDP, or ICMP). By default, all protocols will be impacted.
Corrupt: If true, corrupt packets instead of dropping them.
These parameters make up what's called the magnitude of the attack. As you increase the percentage and target a wider range of network traffic, the magnitude increases. As with all Gremlin attacks, you can run a Packet Loss attack on multiple hosts simultaneously. This is called the blast radius. You can also run a Packet Loss attack on containers, Kubernetes resources, and Services.
api.gremlin.comand DNS traffic (port 53) so the Gremlin agent can communicate with the Gremlin Control Plane. Removing these exceptions and running the attack could trigger a failsafe mechanism that automatically halts the attack.
When running your first Packet Loss attack, start small by reducing the magnitude and blast radius to a single non-production host and a single port number or service. For example, if you're running a web server such as Apache or Nginx, only drop packets on port 80 (or port 443 if using TLS). While the attack is running, monitor the availability and throughput of your service using a network monitoring tool to observe the impact on bandwidth, performance, and data integrity.
As you run these experiments, remember to record your observations, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
Despite networks having many different resilience mechanisms, things can still go wrong. For example:
- A faulty switch or ethernet cable is dropping valid packets.
- Too many devices are using the same network access point (e.g. a WiFi router) at the same time, causing congestion and collisions.
- A customer is using your service over an unreliable network, such as a smartphone in a remote location with poor connectivity.
- Large physical distances between your systems and your customers' is increasing the risk of packet loss or corruption.
When developing and testing services, developers are likely to use highly performant and stable networks. In the real world, networks are often slow, unreliable, and error-prone. When designing resilient services, not testing those services on these types of networks means we won't know how they'll perform when we deploy them into production. With Packet Loss attacks, we can validate that:
- Our applications can work well on oversaturated, unstable, or slow networks.
- We can reliably serve customers on unstable networks without risk of data loss.
This is especially important in the streaming era, where consumers expect to stream large volumes of data continuously and with minimal interruptions. If a media consumer can't watch their favorite show after work, or a gamer can't play a multiplayer match, or a football fan can't watch the Super Bowl, how long do you think it will take for them to get frustrated? Packet Loss attacks let us test and prepare for these scenarios in advance, so we can mitigate them before deploying to our users. This way, we can provide the best possible user experience regardless of network conditions.
Now that you know how Packet Loss attacks work, try running one yourself:
Log into your Gremlin account (or sign up for a free trial).
Create a new attack and select a host to target. Start with a single host to limit your blast radius.
Under Choose a Gremlin, select the Network category, then select Packet Loss.
Set the Length of the attack.
Change Percent to the percentage of packets that Gremlin should drop (or corrupt). This is
1by default. This value can range from 1 to 100.
Optionally, enter the IP Addresses or Hostnames to drop traffic to, the network Device to impact, and the Remote or Local Ports to drop traffic on. For convenience, you can select an external service Provider to target, as well as target specific hosts by Tags.
- If you leave all of these options at their default values, latency will be added to all outbound traffic.
Optionally, open the Show Advanced Options section and select the network Protocol to impact traffic on. By default, all protocols will be impacted.
- If you want to corrupt packets instead of dropping them, enable Corrupt.
Click Unleash Gremlin to start the attack.
Before starting the attack, gather information about the performance of your network using a network monitoring tool. Something as simple as
ping will work, but for a better understanding of your network, use a tool like Apache Bench. You can use Apache Bench to measure the responsiveness of an endpoint simply by entering its URL. You can also run it for a specific period of time. For example, here we run Apache Bench against an endpoint for 30 seconds:
1ab -t 30 http://your-endpoint/
1Requests per second: 26.87 [#/sec] (mean)2Time per request: 37.222 [ms] (mean)3Time per request: 37.222 [ms] (mean, across all concurrent requests)4Transfer rate: 136.19 [Kbytes/sec] received
Let's run a Packet Loss attack with Percentage set to
20 (i.e. drop 20% of all packets, or one in every five). When we start the attack, we'll re-run Apache Bench and measure the difference in performance. Already there's a noticeable change:
1Requests per second: 6.00 [#/sec] (mean)2Time per request: 166.720 [ms] (mean)3Time per request: 166.720 [ms] (mean, across all concurrent requests)4Transfer rate: 30.32 [Kbytes/sec] received
Unsurprisingly, the throughput of our second test is roughly a fifth (22%) of our baseline throughput. Now that we have our observations and data, let's compare them to our hypothesis. Were our original assumptions correct, or were they refuted? We should also try to answer questions about our systems, such as:
- Were we able to maintain a connection to our service despite the drop in packets?
- Was any of the data we sent lost, or were the dropped packets successfully retransmitted?
- Did resending packets have a noticeable impact on network utilization? If so, how might this affect our systems during peak hours?
- Do our systems drop corrupt packets, or do they try (and fail) to process them? Does this cause any unexpected behaviors, like process crashes or unhandled exceptions?
Once you’ve answered your initial hypothesis, try increasing the magnitude of your attack by increasing the percentage of dropped packets, or by adding additional port numbers, IP addresses, or hostnames. Target commonly used system ports like port 22 (SSH), 53 (DNS), 67 and 68 (DHCP), or 80 and 443 (HTTP and HTTPS). Target connections between your services and their critical dependencies like caches, load balancers, and remote file storage servers. If your service communicates with a database, run an attack on your outbound database traffic and see if that affects the integrity or consistency of your data. If running this attack causes incomplete or missing data in your database, you now know that you need to add stronger integrity checks to your database and/or your service.
Once you feel comfortable running Packet Loss attacks on a single host or service, increase the blast radius by targeting more hosts simultaneously. This lets you simulate large-scale network problems, such as Denial of Service (DoS) attacks, API Gateway failures, slow caches or Content Delivery Networks (CDNs), and bandwidth-limited networks. If you want to apply a Packet Loss attack to a specific region or cloud service, remember that you can use the Providers drop-down to select an AWS service provider that you want to impact traffic to:
Now that you’ve run the attack, try using a Scenario. Scenarios allow you to run multiple attacks sequentially, as well as monitor the availability of the target system(s) using Health Checks. Health Checks can periodically contact a monitor that you provide before, during, and after a Scenario, and if the monitor returns a failed state or fails to respond successfully or within a window of time, then the Scenario will automatically halt. You can use this to set an upper bound and prevent latency from increasing too much.
If you're not sure where to start, try using one of our Recommended Scenarios. Gremlin includes over nine pre-built Packet Loss Scenarios designed by our reliability experts for testing conditions such as cascading network degradation between nodes, testing for split brain conflicts in clustered workloads, and testing connectivity in a hybrid cloud. Click on the cards below to see these Scenarios in the Gremlin web app.