08 / 12

Stack / 08

DNS

Most "intermittent network problems" eventually trace to DNS. The protocol is older than most engineers using it, and it's hard to debug because the answer can come from a different layer of caching than the one you suspect. Once you have the mental model — a tree of zones, each delegated, each cached at every layer between client and authoritative server — the failures stop being mysterious.

Zones, delegation, and the tree

DNS is a tree of zones. The root is "."; below it are the top-level domains (com., org., uk.); below those are second-level domains (example.com.); below those are subdomains. Each zone is owned by one or more authoritative nameservers — the source of truth for that part of the tree.

A parent zone delegates a child zone by publishing NS records that point to the child's authoritative nameservers. To look up www.example.com:

Start at the root. Ask one of the 13 root nameservers for com.'s NS records.
Ask one of com.'s nameservers for example.com.'s NS records.
Ask one of example.com.'s nameservers for the A record for www.example.com.

In practice, your client doesn't do this walk. It asks a recursive resolver (typically your ISP's, or 8.8.8.8, or 1.1.1.1) which does the walk on your behalf and caches the result for everyone in its catchment.

The reason the system is built as a tree, rather than one giant table, is that no single organisation could own or update the whole name space, and no single server could answer the whole internet's queries. Delegation hands each branch to whoever runs it. The owner of example.com edits their own zone without asking anyone, and the root and the TLD operators only have to know who to point at, not what the answers are. That split is also why the load spreads: the root and TLD servers see only the top of each name, and the bulk of the per-record traffic lands on the authoritative servers for each individual zone, plus the recursive resolvers' caches sitting in front of them.

The recursive walk. Each level hands back a referral ("ask these servers instead") until the authoritative server returns the record itself.

Recursion vs iteration

The two words sound alike but name two different behaviours, and mixing them up causes a lot of confused debugging. Your stub resolver sends a recursive query: it sets a flag that means "do the whole job and hand me back the final answer." The recursive resolver honours that by doing iterative queries up the chain: it asks the root, gets a referral, asks the TLD, gets another referral, asks the authoritative, and only then has an answer to give back. Authoritative servers, by design, refuse recursion. Ask one for a name it doesn't serve and it either says "not my zone" or hands you a referral; it will not go fetch the answer for you.

This is why the resolver in the middle does all the work and holds all the cache. It is the only party that talks to every level of the tree, so it is the natural place to remember what each level said. The stub on your laptop keeps almost nothing; the authoritative servers keep nothing about other zones. The resolver is the hinge, which is also why picking a good one — close, fast, well-cached — changes the feel of the whole network for every device behind it.

One subtlety: the resolver does not start at the root for every name. It caches referrals too, so once it has learned the .com nameservers it skips the root for the next .com lookup, and once it knows example.com's nameservers it skips the TLD as well. The full three-hop walk in the diagram above is the cold-cache case. In steady state most lookups answer from cache at the first hop, and only new names cost the trip down the tree. You can watch the difference directly with dig +trace against a cold resolver versus a warm one.

The five record types you'll actually use

Type	Maps	Notes
A	name → IPv4 address	The basic forward lookup.
AAAA	name → IPv6 address	Same role, IPv6.
CNAME	name → another name	Aliasing. Cannot coexist with other records at the same name.
MX	name → mail server priority + name	Used by SMTP; rarely touches non-mail engineers.
TXT	name → arbitrary text	SPF, DKIM, domain verification, ACME challenges. The catch-all.
SRV	name → priority + weight + port + target	Service location. Used by Kubernetes, SIP, XMPP. Largely supplanted by DNS-based service discovery layers.
CAA	name → which CAs may issue certs	Stops a misbehaving CA from issuing a cert for your domain.
NS	name → nameserver name	Delegation. Typically published in the parent zone.
SOA	per-zone metadata	Refresh, retry, expire times for secondary nameservers.

TTLs and caching

Every record carries a TTL — the number of seconds a recursor (and its downstream caches) may serve the answer without re-asking. TTLs are advisory: nothing forces a cache to honour them, and many cache too long.

The practical implications:

The TTL you publish bounds your fastest possible cutover. A 24h TTL on an A record means up to 24h of clients will keep using the old IP after you change it.
Lower TTLs cost recursors more queries. 60-second TTL means every client revalidates every minute. For high-traffic zones, this matters.
Lower the TTL before a planned change, not at the change. If your normal TTL is 24h, drop it to 5 minutes a day before; once the change is done and stable, raise it again.

Negative caching is a thing. NXDOMAIN responses are also cached, with the TTL coming from the SOA's minimum field. Misconfigure that, and a brief "this name doesn't exist" gets remembered for hours.

There is no central record of who has cached what. A query passes through several caches on its way to an answer, and each one keeps its own copy with its own remaining TTL. When you change a record, the change is instant at the authoritative server, but every cache that already holds the old answer keeps serving it until its own copy expires. That is the whole of "DNS propagation": there is nothing propagating outward, only old copies timing out at different moments. Two people on the same street can see different answers for an hour because their resolvers cached the old record at different times.

The cache layers a query crosses. A hit anywhere on the left short-circuits the rest; the leftmost layer that still holds the old answer is usually the one you forgot to flush.

Recursive vs authoritative

Two roles. Different software, different tuning.

Role	Job	Examples
Recursive	Walks the tree on behalf of clients; caches answers; serves the next request from cache	Unbound, Bind in recursor mode, dnsmasq, systemd-resolved, 8.8.8.8, 1.1.1.1
Authoritative	Holds the truth for one or more zones; answers when asked but never queries upstream	NSD, Bind in master/slave mode, PowerDNS, Cloudflare, Route 53, NS1

Public recursors (1.1.1.1, 8.8.8.8) are operationally interesting for two reasons: they're heavily cached so they answer most queries immediately, and they support newer transports (DoH, DoT) that your ISP's resolver may not.

DNSSEC, briefly

DNS as designed has no authentication. A man-in-the-middle on the path between you and the recursor can lie about what the authoritative server said. DNSSEC adds signatures to records so the recursor (or, in theory, the client) can verify the answer.

Each zone signs its records with a Zone-Signing Key. The Zone-Signing Key is signed with a Key-Signing Key. The Key-Signing Key's hash is published in the parent zone as a DS record. Walk the chain from the root and any tampering shows up.

Adoption is partial. Most TLDs are signed; most production zones are not. Where it's deployed, it works; where it isn't, it doesn't help. For most clients, "authenticated DNS" is more likely to arrive through DoH/DoT plus careful transport security than through full DNSSEC validation in every stub resolver.

DoH and DoT

DNS over HTTPS (RFC 8484) and DNS over TLS (RFC 7858) carry DNS queries inside an encrypted tunnel between the client and the recursor. Two motivations: privacy from the network (your ISP can't passively read your queries), and authentication of the recursor (the encrypted tunnel terminates at a known endpoint, not at whatever intercepts UDP port 53).

Practically:

DoT (TCP/853). Cleaner, lighter, easier to operate as a sysadmin. Used by Android by default if the network supports it.
DoH (HTTPS/443). Indistinguishable from any other HTTPS traffic; harder to block. Default in Firefox and on macOS / iOS.

Both move the privacy trade-off: you're now trusting the resolver operator (Cloudflare, Google, Quad9) instead of the path. That's a deliberate choice, and whether it's better depends on which side you trust more.

Anycast and how 1.1.1.1 works

"1.1.1.1" isn't a server — it's an IP address that hundreds of servers around the world advertise into BGP. Routing delivers your packet to whichever instance is closest to you. This is anycast, and it's how every major public DNS service operates.

Anycast gives you two free things: low latency (closest instance), and resilience (any instance can drop without breaking the IP — routing reconverges to the next closest). It's how Cloudflare's edge, Google's frontends, and most CDN load balancers present a single IP that's actually a swarm of servers.

Anycast is also what lets a handful of advertised addresses carry the whole internet's root traffic. There are 13 root server identities, named a through m, but each one is anycast: the single address for each is announced from hundreds of physical sites worldwide. So "13 root servers" really means well over a thousand machines, and a query for a TLD's nameservers lands at whichever root instance is nearest. The same trick keeps a public resolver fast everywhere at once. Without anycast, DNS at the top of the tree could not be both close to every user and resistant to a flood aimed at any one location, which is exactly the combination the root and the big resolvers need.

EDNS Client Subnet — the geo wrinkle

A CDN that wants to send US visitors to a US edge and EU visitors to an EU edge needs to know where the visitor is. Without help, all it sees is the recursor's IP — and that recursor (8.8.8.8, 1.1.1.1) might be on a different continent than the user.

ECS (RFC 7871) lets the recursor forward a truncated version of the client's IP to the authoritative server. The authoritative server can then return a region-specific answer. Most authoritative DNS providers used for CDN bootstrapping support it; some recursors strip it for privacy.

A practical consequence: if your CDN says "the user got the wrong region", check whether their resolver propagates ECS, and whether their authoritative supports it. This is one of the most common reasons CDN steering fails.

Tools

Tool	Use for
`dig +trace example.com`	Walk the tree from the root. The first thing to run when DNS is misbehaving.
`dig @8.8.8.8 example.com`	Ask a specific recursor; bypass your local one to compare answers.
`dig @ns1.example.com example.com`	Ask the authoritative directly; bypass all caching.
`dig +short MX example.com`	Just the answer, for scripts.
`dnsperf` / `resperf`	Load-test a recursor or authoritative server.
dnsviz.net	Web tool that visualises a zone's DNSSEC chain. The right way to debug DNSSEC.
`tcpdump -i any -n port 53`	Watch DNS go by. Useful when you can't trust application logs.

A query, packet by packet

What happens when your laptop looks up example.com:

1. Stub resolver call. The application calls getaddrinfo(); glibc reads /etc/resolv.conf for the configured DNS servers (typically your router or 1.1.1.1 / 8.8.8.8), constructs a query, sends it over UDP port 53.

2. Recursive resolver receives. Your ISP's resolver (or 1.1.1.1) checks its cache. Cache miss → starts iterative resolution.

3. Root query. Resolver asks one of the 13 root server clusters (a-m.root-servers.net) for example.com. Root returns NS records for .com servers plus glue records (their IPs).

4. TLD query. Resolver asks one of the .com servers (a-m.gtld-servers.net) for example.com. .com returns NS records for example.com's authoritative servers.

5. Authoritative query. Resolver asks one of those authoritative servers for the A record of example.com. Authoritative returns the IP.

6. Cache and return. Resolver caches the answer at each level (with its TTL), returns to your laptop. Stub resolver in glibc caches briefly. Application gets the IP.

Steps 3-5 are the "iterative" part — the resolver does the work; the authoritative servers do not chain. From cold start, this takes 50-200ms. From a warm resolver cache, a few milliseconds. From the stub cache, nearly zero.

dig +trace example.com   # shows the full chain explicitly
;; ANSWER SECTION (with timing per hop)
. → root → .com → ns.example.com → 93.184.216.34

dig @1.1.1.1 example.com  # query a specific resolver, see the cache effect

Why DNS is the place outages start

A short list of major outages whose root cause was DNS:

Facebook (Oct 2021). A BGP misconfiguration removed Facebook's authoritative DNS servers from the internet. Every Facebook property became unresolvable for six hours. Engineers' own access to remote management depended on the DNS that had disappeared. Lesson: out-of-band management exists for a reason.

Akamai (June and July 2021, separate incidents). Akamai's Edge DNS service had two large outages a month apart. Both affected hundreds of customers including airlines, banks, and major streaming services that ran their DNS through Akamai. Lesson: a single DNS provider is a single point of failure.

Dyn (Oct 2016). The Mirai botnet DDoS'd Dyn's DNS service, taking out Twitter, Reddit, GitHub, Spotify, Netflix and dozens more for hours. Lesson: many customers concentrate on the same DNS vendor; the multi-provider DNS pattern (with redundant NS records pointing at two unrelated providers) is the mitigation.

Cloudflare (July 2020). An internal BGP error caused some Cloudflare edge nodes to advertise routes they shouldn't, then to withdraw them. DNS resolution through 1.1.1.1 degraded globally for ~30 minutes. Lesson: even the resolver providers are subject to their own outages.

The pattern: DNS sits at the bottom of every modern service's request path. When DNS fails, the failure looks like every service is broken — because every service is unreachable. The teams that survive these incidents have multiple DNS providers, out-of-band access to their authoritative DNS, and runbooks that don't depend on the thing being recovered.

DNS for service discovery

Beyond resolving www.example.com, DNS plays a quiet but central role in service discovery inside many clusters:

Kubernetes CoreDNS. Every service in the cluster gets a DNS name like my-service.my-namespace.svc.cluster.local. Pods resolve service names to cluster IPs through CoreDNS. The DNS resolution latency budget is sub-millisecond; CoreDNS caches aggressively and runs as a Deployment with multiple replicas.

Consul DNS. Consul exposes a DNS interface on port 8600; services registered with Consul are queryable as service-name.service.consul. A/SRV records return the IPs and ports of healthy instances. Used as a service-discovery layer for non-Kubernetes deployments.

SRV records. The often-forgotten record type that returns both an address and a port plus weight/priority. Used by Active Directory for domain controller discovery, by Kerberos, by SIP. Underused for general service discovery despite being a perfect fit.

Headless services in Kubernetes. Services with clusterIP: None resolve to the actual pod IPs (one DNS A record per pod) instead of a virtual cluster IP. Used for StatefulSets where each pod has a stable identity, and for service-mesh setups where the mesh handles load balancing itself.

Performance — what DNS actually costs

A typical web page load in 2026 does 8-20 DNS resolutions (main domain + CDN + analytics + third-party scripts + fonts). Each adds a step to the critical path before content starts loading.

DNS prefetch. <link rel="dns-prefetch" href="//cdn.example"> tells the browser to start the DNS lookup early, before the resource is needed.

Preconnect. <link rel="preconnect"> goes further: DNS + TCP + TLS handshake done in advance. The next request to that origin starts at the application layer. Critical for performance-sensitive sites.

HTTP/2 and HTTP/3 connection reuse. Once a connection is established to an origin, subsequent requests to that origin reuse it. The DNS lookup is a one-time cost per origin per session. Sharding resources across many domains was a HTTP/1.1 optimisation that becomes counterproductive in HTTP/2.

At the server side, every TCP connection to a backend starts with a DNS lookup unless the connection is reused. Connection pooling (HTTP keep-alive, gRPC's persistent channel) is where DNS performance actually matters — without it, every request is a fresh lookup.

Common mistakes

Hard-coding TTLs that don't match your operational tempo. If you deploy weekly, a 1-hour TTL is fine. If you cut over IPs in incident response, you need 60s already in place.
CNAME at the apex. The DNS spec forbids a CNAME at the same name as other records, and the apex (example.com) needs an SOA. Most DNS providers offer "ALIAS" or "ANAME" pseudo-records that work around this; some don't.
Forgetting CAA records. Without one, any CA can issue a cert for your domain. Add a CAA pinning to your trusted CAs (or set of them) before someone else does.
Trusting nslookup. It's been deprecated for ~25 years and behaves differently from dig in subtle ways. Use dig or kdig.
Caching at every layer and forgetting which one to flush. The stub resolver, the recursor, the application's resolver library, the OS resolver, the JVM's networkaddress.cache.ttl, the load balancer's resolution cache. When a record changes and "the new IP doesn't propagate", it's almost always one of these.

DNS

Zones, delegation, and the tree

Recursion vs iteration

The five record types you'll actually use

TTLs and caching

Recursive vs authoritative

DNSSEC, briefly

DoH and DoT

Anycast and how 1.1.1.1 works

EDNS Client Subnet — the geo wrinkle

Tools

A query, packet by packet

Why DNS is the place outages start

DNS for service discovery

Performance — what DNS actually costs

Common mistakes

Further reading