Whenever an online service goes down, you're likely to hear three words: "it was DNS!" Blaming DNS might be a running joke among network admins and engineers, but it's one rooted in experience. DNS problems are known for causing massive, Internet-wide outages such as the 2021 Akamai outage that temporarily made the websites for Delta Air Lines, American Express, Airbnb, and others unreachable. Since DNS is a critical component of modern networks, outages can have a huge impact, so teams must design their systems to be capable of withstanding and recovering from DNS problems.
In this blog, we'll take a deep dive into Gremlin's DNS attack. We'll look at how it works, how to use it, and how it can help you build responsive, fault-tolerant applications and systems.
DNS—short for the Domain Name System—is a distributed system used to identify networked resources by name. It's most commonly used to map IP addresses to human-friendly names. For example, DNS is how you can access the Gremlin website by typing gremlin.com into your browser instead of an IP address. This abstraction also lets you map multiple systems or resources to a single DNS name for load balancing requests, proxying and routing requests, and assigning static names to systems with dynamic IP addresses.
The Gremlin DNS attack works by blocking all outgoing DNS traffic over the standard DNS port (port 53). This is effectively the same as running a Blackhole attack on port 53, only the DNS attack includes a built-in exception for
api.gremlin.com. This ensures that the Gremlin agent can communicate with the Gremlin Control Plane while the attack is running, otherwise the agent would lose connection. This would in turn trigger a failsafe and automatically halt the attack. Like the other Gremlin network attacks, this attack uses traffic management tools built into the operating system and doesn't modify firewall or iptables rules.
NET_ADMINcapability when installing the Gremlin agent (enabled by default).
By default, the attack drops all DNS traffic from the target. You can configure the attack to only impact traffic to specific DNS servers (by IP address), network devices, network protocols (TCP, UDP, or ICMP), and service providers (such as Amazon Route 53). The attack supports these parameters:
Length: How long the attack runs for.
IP Addresses: Restricts the attack to specific IP address(es). This field supports CIDR values (e.g.
Device: The network interface to impact traffic on. If left blank, Gremlin will automatically determine which device to target.
Protocol: Which network protocol(s) to impact. The options are TCP, UDP, ICMP, or all.
Providers: Which external service provider(s) to impact, if any.
Tags: If one or more tags are selected, the attack will only impact traffic to the targets associated with those tags.
An example of using tags would be if you had dedicated DNS servers. After installing the Gremlin agent, you could add the following to your
/etc/gremlin/config.yaml file. This would add a tag with the name
service and the value
dns to all agents sharing this tag, making it easy to target all of your DNS servers at once:
1# /etc/gremlin/config.yaml2tags:3 service: dns
The above parameters make up what's called the magnitude of the attack. As with all Gremlin attacks, you can run a DNS attack on multiple hosts simultaneously. This is called the blast radius. You can also run a DNS attack on containers, Kubernetes resources, and Services.
When running your first DNS attack, start small by reducing the magnitude and blast radius to a single non-production host and a single DNS server. Keep in mind that a host may have multiple DNS servers configured, and you can target one or more of these servers individually by adding them to the IP Addresses field. If you're not sure which DNS server(s) your target device is using, you check its network configuration using the following commands.
Scroll down to your active network adapter, then look for the line starting with
1DNS Servers . . . . . . . . . . . : 184.108.40.206 220.127.116.11 192.168.1.1
For Mac and Linux:
1nameserver 18.104.22.168nameserver 22.214.171.124nameserver 192.168.1.1
Try running a DNS attack against one of these servers by adding its IP address to the IP Addresses field. While the attack is running, try sending a network request from the target to a domain name (such as example.com). If the request is successful, then that indicates your system successfully failed over to the secondary DNS server. If not, you might not have your secondary configured correctly, or the secondary is unavailable. In either case, try changing to a different DNS server (such as Cloudflare's
126.96.36.199), restart the attack, then repeat your test to see if that addresses the problem.
As you run these experiments, remember to record your observations, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.
If "it's always DNS" as the old adage goes, how can running DNS attacks help mitigate DNS-related issues? First, let's consider how DNS can fail (here's a quick introduction to the different types of DNS servers):
- A recursive resolver is down, causing DNS queries to time out or return errors.
- Your DNS provider's nameserver is down, preventing customers from resolving your website's address.
- Network saturation (or worse, a DDoS attack) is slowing down DNS queries or causing them to drop.
- A misconfigured Quality of Service (QoS) rule is causing the network to de-prioritize DNS traffic.
There are different ways of mitigating, avoiding, and recovering from DNS-related issues, such as:
- Configuring your systems with fallback DNS servers.
- Using multiple DNS providers.
- If you're using a cloud architecture, rerouting traffic to a different availability zone, region, or virtual private cloud (VPC).
Running DNS attacks lets you verify that these methods are successful at preventing outages.
Now that you know how DNS attacks work, try running one yourself:
Log into your Gremlin account (or sign up for a free trial).
Create a new attack and select a host to target. Start with a single host to limit your blast radius.
Under Choose a Gremlin, select the Network category, then select DNS.
Set the Length of the attack.
Optionally, enter the IP Addresses to drop traffic to, the network Device to impact, and the Protocol to impact (TCP or UDP). For convenience, you can select an external service Provider to target, as well as target specific hosts by Tags.
- If you leave all of these options at their default values, Gremlin will block all DNS traffic on the target.
Click Unleash Gremlin to start the attack.
Measuring and observing the outcome of a DNS attack is pretty straightforward. While the attack is running, try making a network request from the target to another system using its DNS name. Even basic command-line tools like
curl will work for this:
1ping -c 3 gremlin.com
If we run this before the attack, we get the following output:
1PING gremlin.com (188.8.131.52) 56(84) bytes of data.264 bytes from acd89244c803f7181.awsglobalaccelerator.com (184.108.40.206): icmp_seq=1 ttl=105 time=21.4 ms364 bytes from acd89244c803f7181.awsglobalaccelerator.com (220.127.116.11): icmp_seq=2 ttl=105 time=23.2 ms464 bytes from acd89244c803f7181.awsglobalaccelerator.com (18.104.22.168): icmp_seq=3 ttl=105 time=23.6 ms56--- gremlin.com ping statistics ---73 packets transmitted, 3 received, 0% packet loss, time 2003ms8rtt min/avg/max/mdev = 21.407/22.732/23.564/0.947 ms
If we run it during the attack, the command appears to hang for a few moments before displaying this:
1ping: gremlin.com: Temporary failure in name resolution
This confirms that the attack has blocked all DNS traffic, which is what we'd expect. Now, let's test falling back to a secondary DNS server. First, we need to know the IP address of our primary server, which we can do by running
cat /etc/resolv.conf or by using the
nslookup command. The output's
Address fields will contain the IP address of the DNS server:
1Server: 192.168.122.12Address: 192.168.122.1#5334Non-authoritative answer:5Name: gremlin.com6Address: 22.214.171.124
Now that we know our DNS server is at
192.168.122.1 , let's re-run the attack, only this time we'll put our DNS server's IP address in the IP Addresses field:
Now if we run the attack and run
nslookup, we get the "Temporary failure in name resolution" message again. This means our secondary DNS server (assuming we had one configured) did not work. As a result, this simulated failure of our primary DNS server resulted in our target not being able to resolve DNS queries at all, which is a big problem unless all of our outbound network traffic uses IP addresses and not hostnames (which is extremely unlikely).
Depending on how you've configured your DNS settings, you may want to try different test cases or scenarios. For example, if your systems cache DNS entries locally, only add your external DNS servers to the IP Addresses field. If you've configured your local cache correctly, your systems should still be able to resolve hostnames even while the attack is running.
Once you feel comfortable running DNS attacks on a single host or service, increase the blast radius by selecting more targets. Gremlin also makes it easy to run DNS attacks targeting specific cloud DNS services, like Amazon Route 53. While configuring the attack, use the Providers drop-down to select the Route 53 service and region that you want to impact traffic to:
Now that you’ve run the attack, try using a Scenario. Scenarios allow you to run multiple attacks sequentially, as well as monitor the availability of the target system(s) using Golden Signals. Golden Signals can periodically contact a monitor that you provide before, during, and after a Scenario, and if the monitor returns a failed state or fails to respond successfully or within a window of time, then the Scenario will automatically halt. You can use this to set an upper bound and prevent latency from increasing too much. Gremlin also includes a Recommended Scenario for testing DNS outages in a Kubernetes cluster. Click on the card below to see this Scenario in the Gremlin web app.