Debugging Kubernetes DNS Issues

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

Why Kubernetes DNS Failures Are Hard to Spot

DNS failures in Kubernetes are deceptive. The symptom is almost always the same — a connection timeout or “service not found” error — but the root cause can be anywhere across three or four layers. A developer reports that their service cannot connect to the database. You check the pods, they are running. You check the Service, it exists. You check the endpoints, they are bound. Everything looks fine. But DNS is silently failing.

This is the pattern that makes Kubernetes DNS debugging frustrating. The failure is invisible until you know exactly where to look.

Kubernetes uses CoreDNS as its internal DNS resolver. Every pod in the cluster is configured to use CoreDNS automatically. When you reference a service as my-service.my-namespace.svc.cluster.local, CoreDNS resolves it to the Service ClusterIP. When CoreDNS is degraded, all service-to-service communication that relies on DNS names breaks — which is essentially every microservice architecture.

This guide walks through every DNS failure pattern you will encounter in production, from CoreDNS pod crashes to NetworkPolicy blocks to external DNS resolution failures.


How Kubernetes DNS Works

Before debugging, you need a clear mental model of the DNS resolution path. Most debugging mistakes come from not knowing which layer to blame.

Pod makes DNS request
        |
        v
/etc/resolv.conf inside the pod
  nameserver 10.96.0.10      <- CoreDNS ClusterIP
  search default.svc.cluster.local svc.cluster.local cluster.local
        |
        v
CoreDNS Service (kube-dns) in kube-system namespace
        |
        v
CoreDNS Pods (usually 2 replicas)
        |
        +-- Internal name? -> Kubernetes API for Service/Endpoint lookup
        |
        +-- External name? -> Upstream DNS (your cloud provider or 8.8.8.8)

The search domain list matters. When a pod queries my-service, CoreDNS tries these in order:

  1. my-service.default.svc.cluster.local
  2. my-service.svc.cluster.local
  3. my-service.cluster.local
  4. my-service (external lookup)

This is why you can use short names like my-service inside the same namespace, but must use the full FQDN my-service.other-namespace.svc.cluster.local to reach services in other namespaces.


Step 1 — Confirm DNS Is Actually the Problem

Before chasing CoreDNS, confirm DNS is what is failing — not the Service, not the NetworkPolicy, not the application.

bash

# Run a temporary debug pod with DNS tools
kubectl run dns-debug --image=busybox:1.28 --restart=Never -it --rm \
  -- /bin/sh

# Inside the debug pod:

# Test 1: Can you resolve the Kubernetes API itself?
nslookup kubernetes.default
# Expected: Address: 10.96.0.1 (or your cluster's service CIDR)
# If this fails: CoreDNS is completely broken

# Test 2: Can you resolve a service in the same namespace?
nslookup <service-name>
# If this fails but Test 1 works: Service does not exist or wrong namespace

# Test 3: Can you resolve a service in another namespace?
nslookup <service-name>.<namespace>.svc.cluster.local
# If this fails but short name works: namespace not specified correctly

# Test 4: Can you resolve an external hostname?
nslookup google.com
# If this fails but internal names work: upstream DNS is misconfigured

Work through these four tests in order. The first one that fails tells you exactly which layer to investigate.


Step 2 — Check CoreDNS Pod Health

CoreDNS runs as a Deployment in the kube-system namespace. Two replicas by default.

bash

# Check CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Expected output:
# NAME                       READY   STATUS    RESTARTS   AGE
# coredns-5d78c9869d-7rp6x   1/1     Running   0          5d
# coredns-5d78c9869d-kx9xz   1/1     Running   0          5d

# If pods are not running or in CrashLoopBackOff:
kubectl describe pod <coredns-pod> -n kube-system
kubectl logs <coredns-pod> -n kube-system
kubectl logs <coredns-pod> -n kube-system --previous

Common CoreDNS Log Errors

SERVFAIL responses

[ERROR] plugin/errors: 2 SERVFAIL

CoreDNS is returning SERVFAIL — it cannot resolve the query. Usually caused by upstream DNS being unreachable or a misconfigured Corefile.

Timeout reaching upstream

[ERROR] plugin/forward: no upstream
HINFO: read udp: i/o timeout

CoreDNS cannot reach its upstream DNS servers. Check network connectivity from the CoreDNS pods to the upstream DNS IPs.

Plugin errors

[ERROR] Failed to list *v1.Endpoints

CoreDNS cannot reach the Kubernetes API to look up service endpoints. Check API server health and RBAC permissions for the CoreDNS service account.

OOMKilled CoreDNS

In large clusters with many services, CoreDNS can be OOMKilled if its memory limit is too low. Check:

bash

kubectl describe pod <coredns-pod> -n kube-system | grep -A5 "Last State"
# Reason: OOMKilled

# Increase CoreDNS memory limit
kubectl edit deployment coredns -n kube-system
# Increase: limits.memory from 170Mi to 512Mi or higher

Step 3 — Inspect the CoreDNS ConfigMap

CoreDNS behavior is controlled by a ConfigMap called coredns in kube-system. A misconfigured Corefile is a common cause of DNS failures after cluster upgrades or manual edits.

bash

kubectl get configmap coredns -n kube-system -o yaml

A healthy default Corefile looks like this:

.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

What to check:

  • kubernetes cluster.local — this must match your cluster’s domain. If your cluster uses a custom domain, this line needs updating.
  • forward . — this is where external DNS queries are forwarded. /etc/resolv.conf means CoreDNS inherits the node’s upstream DNS. If the node DNS is broken, external resolution fails.
  • cache 30 — TTL for cached responses. If set too high, stale records can cause issues after service updates.

Editing the Corefile:

bash

kubectl edit configmap coredns -n kube-system

# After editing, force CoreDNS to reload
kubectl rollout restart deployment/coredns -n kube-system

Step 4 — Check the Pod’s DNS Configuration

Sometimes the issue is not CoreDNS itself but how a specific pod is configured to use DNS.

bash

# Check a pod's resolv.conf
kubectl exec -it <pod-name> -n <namespace> -- cat /etc/resolv.conf

# Expected output:
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

What each line means:

  • nameserver 10.96.0.10 — CoreDNS ClusterIP. Must match kubectl get svc kube-dns -n kube-system
  • search — the search domain list used for short name resolution
  • options ndots:5 — if a name has fewer than 5 dots, DNS tries the search domains first before treating it as a fully qualified name

ndots:5 performance issue: This setting causes every DNS lookup to generate up to 5 queries before trying the name as-is. In high-throughput microservices, this multiplies DNS query volume significantly. Use fully qualified names ending with a dot to bypass search:

bash

# Instead of:
curl http://my-service

# Use FQDN with trailing dot to skip search domain expansion:
curl http://my-service.my-namespace.svc.cluster.local.

Custom dnsConfig on a pod:

yaml

spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"    # reduce search attempts for better performance
  dnsPolicy: ClusterFirst   # default — use CoreDNS

Step 5 — Diagnose Common DNS Failure Patterns

Pattern 1: Service Not Resolving in the Same Namespace

Symptom: nslookup my-service from a pod in the same namespace returns NXDOMAIN (not found).

Diagnosis:

bash

# Verify the Service exists
kubectl get svc my-service -n <namespace>

# Verify it has endpoints
kubectl get endpoints my-service -n <namespace>
# If Endpoints shows <none>, the Service selector does not match any pods

# Try the full FQDN
nslookup my-service.<namespace>.svc.cluster.local

Most likely cause: The Service does not exist, or exists in a different namespace. DNS returns NXDOMAIN for non-existent services — it does not mean CoreDNS is broken.

Pattern 2: Service Resolves But Connection Still Fails

Symptom: nslookup my-service succeeds and returns the ClusterIP. But curl http://my-service times out.

Diagnosis:

bash

# DNS is working — problem is at the network layer
# Check if the Service has endpoints
kubectl get endpoints my-service -n <namespace>

# Test direct pod-to-pod connectivity (bypassing the Service)
kubectl exec -it <source-pod> -- curl http://<target-pod-ip>:<port>

# Check for NetworkPolicy blocking the traffic
kubectl get networkpolicy -n <namespace>

DNS resolving correctly but connections failing means the problem is kube-proxy, iptables rules, or NetworkPolicy — not DNS.

Pattern 3: External DNS Not Resolving

Symptom: Internal service names resolve fine. External hostnames like api.github.com return NXDOMAIN or timeout.

Diagnosis:

bash

# Test external resolution from inside a pod
kubectl exec -it <pod> -- nslookup google.com

# If it fails, test from CoreDNS pods directly
kubectl exec -it <coredns-pod> -n kube-system -- nslookup google.com

# Check what upstream DNS CoreDNS is forwarding to
kubectl get configmap coredns -n kube-system -o yaml | grep forward

# Check the node's DNS configuration
cat /etc/resolv.conf  # on the node via SSH

Common causes:

  • The node’s /etc/resolv.conf points to an unreachable DNS server
  • A firewall rule blocks UDP port 53 from the CoreDNS pods to the upstream server
  • In AKS: the virtual network DNS settings override the default Azure DNS

Pattern 4: Intermittent DNS Failures Under Load

Symptom: DNS works most of the time but fails intermittently during high traffic periods. Applications retry and eventually succeed.

Diagnosis:

bash

# Check CoreDNS CPU and memory under load
kubectl top pods -n kube-system | grep coredns

# Check CoreDNS metrics for errors and latency
kubectl port-forward -n kube-system svc/kube-dns 9153:9153
# Then visit: http://localhost:9153/metrics
# Look for: coredns_dns_request_duration_seconds (latency)
# Look for: coredns_dns_responses_total{rcode="SERVFAIL"} (errors)

# Check CoreDNS logs for timeout patterns
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i "timeout\|error"

Common causes:

CoreDNS under-resourced: Default CoreDNS resource limits are conservative. In clusters with many services and high request rates, CoreDNS can be CPU throttled, causing slow responses that applications interpret as failures.

yaml

# Increase CoreDNS resources
resources:
  limits:
    memory: 512Mi
    cpu: 500m
  requests:
    cpu: 100m
    memory: 70Mi

Too few CoreDNS replicas: Default is 2 replicas. In large clusters with 50+ nodes, scale CoreDNS horizontally:

bash

kubectl scale deployment coredns --replicas=4 -n kube-system

NodeLocal DNSCache: For high-scale clusters, deploy NodeLocal DNSCache — it runs a DNS cache on every node and intercepts DNS queries before they reach CoreDNS, dramatically reducing CoreDNS load:

bash

# NodeLocal DNSCache reduces DNS latency by ~50% and CoreDNS load by ~70%
# in clusters with heavy DNS traffic
kubectl apply -f https://k8s.io/examples/admin/dns/nodelocaldns.yaml

Pattern 5: DNS Works From Some Pods But Not Others

Symptom: Identical applications in different namespaces or nodes have different DNS behavior.

Diagnosis:

bash

# Compare resolv.conf between a working and failing pod
kubectl exec -it <working-pod> -- cat /etc/resolv.conf
kubectl exec -it <failing-pod> -- cat /etc/resolv.conf

# Check if a NetworkPolicy is blocking DNS traffic (UDP port 53)
# CoreDNS is in kube-system -- pods need egress to kube-system:53
kubectl get networkpolicy -n <namespace-of-failing-pod>

NetworkPolicy blocking DNS is an extremely common gotcha. If a namespace has a default-deny egress policy, pods in that namespace cannot reach CoreDNS — even though CoreDNS is a cluster system component.

yaml

# Add this NetworkPolicy to allow DNS egress from any namespace with a default-deny policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: <your-namespace>
spec:
  podSelector: {}      # applies to all pods in the namespace
  policyTypes:
  - Egress
  egress:
  - ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Step 6 — Real Production Example

Scenario: After applying a new NetworkPolicy to the payments namespace for PCI compliance, all pods in the namespace start failing to connect to downstream services. The pods are running, Services exist, Endpoints are bound — but connections time out.

bash

# Test DNS from inside a payments pod
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# ;; connection timed out; no servers could be reached

DNS is completely failing — not just one service, all DNS resolution.

bash

# Check NetworkPolicies in the namespace
kubectl get networkpolicy -n payments

# NAME                  POD-SELECTOR   AGE
# pci-default-deny      <none>         4m

kubectl describe networkpolicy pci-default-deny -n payments
# Spec:
#   PodSelector: <none> (Allowing the specific traffic to all pods in this namespace)
#   PolicyTypes: Ingress, Egress
#   Egress: <none>   <- all egress denied, including DNS

The new policy denied all egress including UDP/53 to CoreDNS.

bash

# Fix: add DNS egress allowance
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: payments
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
EOF

# Test immediately
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# Server: 10.96.0.10
# Address: 10.96.0.10:53
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1

Time to resolution: 11 minutes. Lesson: Any default-deny egress NetworkPolicy must explicitly allow UDP/TCP port 53 to CoreDNS. Add this as a required step in your NetworkPolicy runbook.


Quick Reference

bash

# Run DNS debug pod
kubectl run dns-debug --image=busybox:1.28 --restart=Never -it --rm -- /bin/sh

# Test internal resolution
nslookup kubernetes.default
nslookup <service>.<namespace>.svc.cluster.local

# Test external resolution
nslookup google.com

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system

# Check pod resolv.conf
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Check kube-dns Service ClusterIP
kubectl get svc kube-dns -n kube-system

# Scale CoreDNS for high load
kubectl scale deployment coredns --replicas=4 -n kube-system

# Check NetworkPolicies that may block DNS
kubectl get networkpolicy -n <namespace>

Summary

Kubernetes DNS failures always fit one of five patterns:

  1. CoreDNS pods not running — restart them, check OOM and resource limits
  2. Corefile misconfigured — inspect the CoreDNS ConfigMap, check upstream forward settings
  3. Service does not exist — NXDOMAIN means the Service is missing or in the wrong namespace, not a DNS bug
  4. NetworkPolicy blocking port 53 — any default-deny egress policy breaks DNS unless you explicitly allow it
  5. CoreDNS overloaded — scale replicas, increase resource limits, or deploy NodeLocal DNSCache

DNS resolving but connection still failing means the problem is at the network layer — see the Debugging Kubernetes Networking guide for kube-proxy and NetworkPolicy diagnosis.


Related guides:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top