Whenever an online service goes down, you're likely to hear three words: "it was DNS!" Blaming DNS might be a running joke among network admins and engineers, but it's one rooted in experience. DNS problems are known for causing massive, Internet-wide outages such as the 2021 Akamai outage that temporarily made the websites for Delta Air Lines, American Express, Airbnb, and others unreachable. Since DNS is a critical component of modern networks, outages can have a huge impact, so teams must design their systems to be capable of withstanding and recovering from DNS problems.

In this blog, we'll take a deep dive into Gremlin's DNS attack. We'll look at how it works, how to use it, and how it can help you build responsive, fault-tolerant applications and systems.

How does a DNS attack work?

DNS—short for the Domain Name System—is a distributed system used to identify networked resources by name. It's most commonly used to map IP addresses to human-friendly names. For example, DNS is how you can access the Gremlin website by typing gremlin.com into your browser instead of an IP address. This abstraction also lets you map multiple systems or resources to a single DNS name for load balancing requests, proxying and routing requests, and assigning static names to systems with dynamic IP addresses.

The Gremlin DNS attack works by blocking all outgoing DNS traffic over the standard DNS port (port 53). This is effectively the same as running a Blackhole attack on port 53, only the DNS attack includes a built-in exception for <span class="code-class-custom">api.gremlin.com.</span> This ensures that the Gremlin agent can communicate with the Gremlin Control Plane while the attack is running, otherwise the agent would lose connection. This would in turn trigger a failsafe and automatically halt the attack. Like the other Gremlin network attacks, this attack uses traffic management tools built into the operating system and doesn't modify firewall or iptables rules.

Note
This attack requires the NET_ADMIN capability when installing the Gremlin agent (enabled by default).

By default, the attack drops all DNS traffic from the target. You can configure the attack to only impact traffic to specific DNS servers (by IP address), network devices, network protocols (TCP, UDP, or ICMP), and service providers (such as Amazon Route 53). The attack supports these parameters:

  • <span class="code-class-custom">Length</span>: How long the attack runs for.
  • <span class="code-class-custom">IP Addresses</span>: Restricts the attack to specific IP address(es). This field supports CIDR values (e.g. 10.0.0.0/24).
  • <span class="code-class-custom">Device</span>: The network interface to impact traffic on. If left blank, Gremlin will target all network interfaces.
  • <span class="code-class-custom">Protocol</span>: Which network protocol(s) to impact. The options are TCP, UDP, ICMP, or all.
  • <span class="code-class-custom">Providers</span>: Which external service provider(s) to impact, if any.
  • <span class="code-class-custom">Tags</span>: If one or more tags are selected, the attack will only impact traffic to the targets associated with those tags.

An example of using tags would be if you had dedicated DNS servers. After installing the Gremlin agent, you could add the following to your <span class="code-class-custom">/etc/gremlin/config.yaml</span> file. This would add a tag with the name <span class="code-class-custom">service</span> and the value <span class="code-class-custom">dns</span> to all agents sharing this tag, making it easy to target all of your DNS servers at once:

BASH

# /etc/gremlin/config.yaml
tags:
  service: dns

The above parameters make up what's called the magnitude of the attack. As with all Gremlin attacks, you can run a DNS attack on multiple hosts simultaneously. This is called the blast radius. You can also run a DNS attack on containers, Kubernetes resources, and Services.

When running your first DNS attack, start small by reducing the magnitude and blast radius to a single non-production host and a single DNS server. Keep in mind that a host may have multiple DNS servers configured, and you can target one or more of these servers individually by adding them to the IP Addresses field. If you're not sure which DNS server(s) your target device is using, you check its network configuration using the following commands.

For Windows:

POWERSHELL

ipconfig /all

Scroll down to your active network adapter, then look for the line starting with <span class="code-class-custom">DNS Servers</span>:

POWERSHELL

DNS Servers . . . . . . . . . . . : 8.8.8.8
                                    8.8.4.4
                                    192.168.1.1

For Mac and Linux:

BASH

cat /etc/resolv.conf

SHELL

nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 192.168.1.1

Try running a DNS attack against one of these servers by adding its IP address to the IP Addresses field. While the attack is running, try sending a network request from the target to a domain name (such as example.com). If the request is successful, then that indicates your system successfully failed over to the secondary DNS server. If not, you might not have your secondary configured correctly, or the secondary is unavailable. In either case, try changing to a different DNS server (such as Cloudflare's <span class="code-class-custom">1.1.1.1</span>), restart the attack, then repeat your test to see if that addresses the problem.

As you run these experiments, remember to record your observations, discuss the outcomes with your team, and track any changes or improvements made to your systems as a result. This way, you can demonstrate the value of the experiments you’ve run to your team and to the rest of the organization.

Why should you run DNS attacks?

If "it's always DNS" as the old adage goes, how can running DNS attacks help mitigate DNS-related issues? First, let's consider how DNS can fail (here's a quick introduction to the different types of DNS servers):

  • A recursive resolver is down, causing DNS queries to time out or return errors.
  • Your DNS provider's nameserver is down, preventing customers from resolving your website's address.
  • Network saturation (or worse, a DDoS attack) is slowing down DNS queries or causing them to drop.
  • A misconfigured Quality of Service (QoS) rule is causing the network to de-prioritize DNS traffic.

There are different ways of mitigating, avoiding, and recovering from DNS-related issues, such as:

  • Configuring your systems with fallback DNS servers.
  • Using multiple DNS providers.
  • If you're using a cloud architecture, rerouting traffic to a different availability zone, region, or virtual private cloud (VPC).

Running DNS attacks lets you verify that these methods are successful at preventing outages.

Get started with DNS attacks

Now that you know how DNS attacks work, try running one yourself:

  1. Log into your Gremlin account (or sign up for a free trial).
  2. Create a new attack and select a host to target. Start with a single host to limit your blast radius.
  3. Under Choose a Gremlin, select the Network category, then select DNS.
  4. Set the Length of the attack.
  5. Optionally, enter the IP Addresses to drop traffic to, the network Device to impact, and the Protocol to impact (TCP or UDP). For convenience, you can select an external service Provider to target, as well as target specific hosts by Tags.
  6. ^If you leave all of these options at their default values, Gremlin will block all DNS traffic on the target.
  7. Click Unleash Gremlin to start the attack.

Measuring and observing the outcome of a DNS attack is pretty straightforward. While the attack is running, try making a network request from the target to another system using its DNS name. Even basic command-line tools like <span class="code-class-custom">ping</span> or <span class="code-class-custom">curl</span> will work for this:

BASH

ping -c 3 gremlin.com

If we run this before the attack, we get the following output:

SHELL

PING gremlin.com (75.2.60.5) 56(84) bytes of data.
64 bytes from acd89244c803f7181.awsglobalaccelerator.com (75.2.60.5): icmp_seq=1 ttl=105 time=21.4 ms
64 bytes from acd89244c803f7181.awsglobalaccelerator.com (75.2.60.5): icmp_seq=2 ttl=105 time=23.2 ms
64 bytes from acd89244c803f7181.awsglobalaccelerator.com (75.2.60.5): icmp_seq=3 ttl=105 time=23.6 ms

--- gremlin.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 21.407/22.732/23.564/0.947 ms

If we run it during the attack, the command appears to hang for a few moments before displaying this:

SHELL

ping: gremlin.com: Temporary failure in name resolution

This confirms that the attack has blocked all DNS traffic, which is what we'd expect. Now, let's test falling back to a secondary DNS server. First, we need to know the IP address of our primary server, which we can do by running <span class="code-class-custom">cat /etc/resolv.conf</span> or by using the <span class="code-class-custom">nslookup</span> command. The output's <span class="code-class-custom">Server</span> and <span class="code-class-custom">Address</span> fields will contain the IP address of the DNS server:

BASH

nslookup gremlin.com

SHELL

Server:     192.168.122.1
Address:    192.168.122.1#53

Non-authoritative answer:
Name:   gremlin.com
Address: 75.2.60.5

Now that we know our DNS server is at <span class="code-class-custom">192.168.122.1</span> , let's re-run the attack, only this time we'll put our DNS server's IP address in the IP Addresses field:

Entering a specific DNS server IP address in the Gremlin web app

Now if we run the attack and run <span class="code-class-custom">nslookup</span>, we get the "Temporary failure in name resolution" message again. This means our secondary DNS server (assuming we had one configured) did not work. As a result, this simulated failure of our primary DNS server resulted in our target not being able to resolve DNS queries at all, which is a big problem unless all of our outbound network traffic uses IP addresses and not hostnames (which is extremely unlikely).

Note
If you manage your own DNS server (e.g. using bind ), an alternative to this experiment would be to run a DNS or Blackhole attack directly on the DNS server. This would impact all hosts that rely on that server, and more accurately simulate a DNS provider outage.

Depending on how you've configured your DNS settings, you may want to try different test cases or scenarios. For example, if your systems cache DNS entries locally, only add your external DNS servers to the IP Addresses field. If you've configured your local cache correctly, your systems should still be able to resolve hostnames even while the attack is running.

Once you feel comfortable running DNS attacks on a single host or service, increase the blast radius by selecting more targets. Gremlin also makes it easy to run DNS attacks targeting specific cloud DNS services, like Amazon Route 53. While configuring the attack, use the Providers drop-down to select the Route 53 service and region that you want to impact traffic to:

Selecting Amazon Route53 using the providers drop-down

Now that you’ve run the attack, try using a Scenario. Scenarios allow you to run multiple attacks sequentially, as well as monitor the availability of the target system(s) using Health Checks. Health Checks can periodically contact a monitor that you provide before, during, and after a Scenario, and if the monitor returns a failed state or fails to respond successfully or within a window of time, then the Scenario will automatically halt. You can use this to set an upper bound and prevent latency from increasing too much. Gremlin also includes a Recommended Scenario for testing DNS outages in a Kubernetes cluster. Click on the card below to see this Scenario in the Gremlin web app.

☸️ ✅ Kubernetes - Availability - DNS outage

This is an availability scenario for Kubernetes. This scenario will cause a DNS outage. We expect that the application will still be able to serve user traffic and operate as expected due to DNS failover. If DNS failover is not setup correctly we expect an outage to occur.

Length:

1 step

Attack Type

DNS

RUN Scenario
Categories
Andre Newman
Andre Newman
Sr. Reliability Specialist
Start your free trial

Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.

sTART YOUR TRIAL