“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”
Why Kubernetes DNS Failures Are Hard to Spot
DNS failures in Kubernetes are deceptive. The symptom is almost always the same — a connection timeout or “service not found” error — but the root cause can be anywhere across three or four layers. A developer reports that their service cannot connect to the database. You check the pods, they are running. You check the Service, it exists. You check the endpoints, they are bound. Everything looks fine. But DNS is silently failing.
This is the pattern that makes Kubernetes DNS debugging frustrating. The failure is invisible until you know exactly where to look.
Kubernetes uses CoreDNS as its internal DNS resolver. Every pod in the cluster is configured to use CoreDNS automatically. When you reference a service as my-service.my-namespace.svc.cluster.local, CoreDNS resolves it to the Service ClusterIP. When CoreDNS is degraded, all service-to-service communication that relies on DNS names breaks — which is essentially every microservice architecture.
This guide walks through every DNS failure pattern you will encounter in production, from CoreDNS pod crashes to NetworkPolicy blocks to external DNS resolution failures.
How Kubernetes DNS Works
Before debugging, you need a clear mental model of the DNS resolution path. Most debugging mistakes come from not knowing which layer to blame.
Pod makes DNS request
|
v
/etc/resolv.conf inside the pod
nameserver 10.96.0.10 <- CoreDNS ClusterIP
search default.svc.cluster.local svc.cluster.local cluster.local
|
v
CoreDNS Service (kube-dns) in kube-system namespace
|
v
CoreDNS Pods (usually 2 replicas)
|
+-- Internal name? -> Kubernetes API for Service/Endpoint lookup
|
+-- External name? -> Upstream DNS (your cloud provider or 8.8.8.8)
The search domain list matters. When a pod queries my-service, CoreDNS tries these in order:
my-service.default.svc.cluster.localmy-service.svc.cluster.localmy-service.cluster.localmy-service(external lookup)
This is why you can use short names like my-service inside the same namespace, but must use the full FQDN my-service.other-namespace.svc.cluster.local to reach services in other namespaces.
Step 1 — Confirm DNS Is Actually the Problem
Before chasing CoreDNS, confirm DNS is what is failing — not the Service, not the NetworkPolicy, not the application.
bash
# Run a temporary debug pod with DNS tools
kubectl run dns-debug --image=busybox:1.28 --restart=Never -it --rm \
-- /bin/sh
# Inside the debug pod:
# Test 1: Can you resolve the Kubernetes API itself?
nslookup kubernetes.default
# Expected: Address: 10.96.0.1 (or your cluster's service CIDR)
# If this fails: CoreDNS is completely broken
# Test 2: Can you resolve a service in the same namespace?
nslookup <service-name>
# If this fails but Test 1 works: Service does not exist or wrong namespace
# Test 3: Can you resolve a service in another namespace?
nslookup <service-name>.<namespace>.svc.cluster.local
# If this fails but short name works: namespace not specified correctly
# Test 4: Can you resolve an external hostname?
nslookup google.com
# If this fails but internal names work: upstream DNS is misconfigured
Work through these four tests in order. The first one that fails tells you exactly which layer to investigate.
Step 2 — Check CoreDNS Pod Health
CoreDNS runs as a Deployment in the kube-system namespace. Two replicas by default.
bash
# Check CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Expected output:
# NAME READY STATUS RESTARTS AGE
# coredns-5d78c9869d-7rp6x 1/1 Running 0 5d
# coredns-5d78c9869d-kx9xz 1/1 Running 0 5d
# If pods are not running or in CrashLoopBackOff:
kubectl describe pod <coredns-pod> -n kube-system
kubectl logs <coredns-pod> -n kube-system
kubectl logs <coredns-pod> -n kube-system --previous
Common CoreDNS Log Errors
SERVFAIL responses
[ERROR] plugin/errors: 2 SERVFAIL
CoreDNS is returning SERVFAIL — it cannot resolve the query. Usually caused by upstream DNS being unreachable or a misconfigured Corefile.
Timeout reaching upstream
[ERROR] plugin/forward: no upstream
HINFO: read udp: i/o timeout
CoreDNS cannot reach its upstream DNS servers. Check network connectivity from the CoreDNS pods to the upstream DNS IPs.
Plugin errors
[ERROR] Failed to list *v1.Endpoints
CoreDNS cannot reach the Kubernetes API to look up service endpoints. Check API server health and RBAC permissions for the CoreDNS service account.
OOMKilled CoreDNS
In large clusters with many services, CoreDNS can be OOMKilled if its memory limit is too low. Check:
bash
kubectl describe pod <coredns-pod> -n kube-system | grep -A5 "Last State"
# Reason: OOMKilled
# Increase CoreDNS memory limit
kubectl edit deployment coredns -n kube-system
# Increase: limits.memory from 170Mi to 512Mi or higher
Step 3 — Inspect the CoreDNS ConfigMap
CoreDNS behavior is controlled by a ConfigMap called coredns in kube-system. A misconfigured Corefile is a common cause of DNS failures after cluster upgrades or manual edits.
bash
kubectl get configmap coredns -n kube-system -o yaml
A healthy default Corefile looks like this:
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
What to check:
kubernetes cluster.local— this must match your cluster’s domain. If your cluster uses a custom domain, this line needs updating.forward .— this is where external DNS queries are forwarded./etc/resolv.confmeans CoreDNS inherits the node’s upstream DNS. If the node DNS is broken, external resolution fails.cache 30— TTL for cached responses. If set too high, stale records can cause issues after service updates.
Editing the Corefile:
bash
kubectl edit configmap coredns -n kube-system
# After editing, force CoreDNS to reload
kubectl rollout restart deployment/coredns -n kube-system
Step 4 — Check the Pod’s DNS Configuration
Sometimes the issue is not CoreDNS itself but how a specific pod is configured to use DNS.
bash
# Check a pod's resolv.conf
kubectl exec -it <pod-name> -n <namespace> -- cat /etc/resolv.conf
# Expected output:
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5
What each line means:
nameserver 10.96.0.10— CoreDNS ClusterIP. Must matchkubectl get svc kube-dns -n kube-systemsearch— the search domain list used for short name resolutionoptions ndots:5— if a name has fewer than 5 dots, DNS tries the search domains first before treating it as a fully qualified name
ndots:5 performance issue: This setting causes every DNS lookup to generate up to 5 queries before trying the name as-is. In high-throughput microservices, this multiplies DNS query volume significantly. Use fully qualified names ending with a dot to bypass search:
bash
# Instead of:
curl http://my-service
# Use FQDN with trailing dot to skip search domain expansion:
curl http://my-service.my-namespace.svc.cluster.local.
Custom dnsConfig on a pod:
yaml
spec:
dnsConfig:
options:
- name: ndots
value: "2" # reduce search attempts for better performance
dnsPolicy: ClusterFirst # default — use CoreDNS
Step 5 — Diagnose Common DNS Failure Patterns
Pattern 1: Service Not Resolving in the Same Namespace
Symptom: nslookup my-service from a pod in the same namespace returns NXDOMAIN (not found).
Diagnosis:
bash
# Verify the Service exists
kubectl get svc my-service -n <namespace>
# Verify it has endpoints
kubectl get endpoints my-service -n <namespace>
# If Endpoints shows <none>, the Service selector does not match any pods
# Try the full FQDN
nslookup my-service.<namespace>.svc.cluster.local
Most likely cause: The Service does not exist, or exists in a different namespace. DNS returns NXDOMAIN for non-existent services — it does not mean CoreDNS is broken.
Pattern 2: Service Resolves But Connection Still Fails
Symptom: nslookup my-service succeeds and returns the ClusterIP. But curl http://my-service times out.
Diagnosis:
bash
# DNS is working — problem is at the network layer
# Check if the Service has endpoints
kubectl get endpoints my-service -n <namespace>
# Test direct pod-to-pod connectivity (bypassing the Service)
kubectl exec -it <source-pod> -- curl http://<target-pod-ip>:<port>
# Check for NetworkPolicy blocking the traffic
kubectl get networkpolicy -n <namespace>
DNS resolving correctly but connections failing means the problem is kube-proxy, iptables rules, or NetworkPolicy — not DNS.
Pattern 3: External DNS Not Resolving
Symptom: Internal service names resolve fine. External hostnames like api.github.com return NXDOMAIN or timeout.
Diagnosis:
bash
# Test external resolution from inside a pod
kubectl exec -it <pod> -- nslookup google.com
# If it fails, test from CoreDNS pods directly
kubectl exec -it <coredns-pod> -n kube-system -- nslookup google.com
# Check what upstream DNS CoreDNS is forwarding to
kubectl get configmap coredns -n kube-system -o yaml | grep forward
# Check the node's DNS configuration
cat /etc/resolv.conf # on the node via SSH
Common causes:
- The node’s
/etc/resolv.confpoints to an unreachable DNS server - A firewall rule blocks UDP port 53 from the CoreDNS pods to the upstream server
- In AKS: the virtual network DNS settings override the default Azure DNS
Pattern 4: Intermittent DNS Failures Under Load
Symptom: DNS works most of the time but fails intermittently during high traffic periods. Applications retry and eventually succeed.
Diagnosis:
bash
# Check CoreDNS CPU and memory under load
kubectl top pods -n kube-system | grep coredns
# Check CoreDNS metrics for errors and latency
kubectl port-forward -n kube-system svc/kube-dns 9153:9153
# Then visit: http://localhost:9153/metrics
# Look for: coredns_dns_request_duration_seconds (latency)
# Look for: coredns_dns_responses_total{rcode="SERVFAIL"} (errors)
# Check CoreDNS logs for timeout patterns
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i "timeout\|error"
Common causes:
CoreDNS under-resourced: Default CoreDNS resource limits are conservative. In clusters with many services and high request rates, CoreDNS can be CPU throttled, causing slow responses that applications interpret as failures.
yaml
# Increase CoreDNS resources
resources:
limits:
memory: 512Mi
cpu: 500m
requests:
cpu: 100m
memory: 70Mi
Too few CoreDNS replicas: Default is 2 replicas. In large clusters with 50+ nodes, scale CoreDNS horizontally:
bash
kubectl scale deployment coredns --replicas=4 -n kube-system
NodeLocal DNSCache: For high-scale clusters, deploy NodeLocal DNSCache — it runs a DNS cache on every node and intercepts DNS queries before they reach CoreDNS, dramatically reducing CoreDNS load:
bash
# NodeLocal DNSCache reduces DNS latency by ~50% and CoreDNS load by ~70%
# in clusters with heavy DNS traffic
kubectl apply -f https://k8s.io/examples/admin/dns/nodelocaldns.yaml
Pattern 5: DNS Works From Some Pods But Not Others
Symptom: Identical applications in different namespaces or nodes have different DNS behavior.
Diagnosis:
bash
# Compare resolv.conf between a working and failing pod
kubectl exec -it <working-pod> -- cat /etc/resolv.conf
kubectl exec -it <failing-pod> -- cat /etc/resolv.conf
# Check if a NetworkPolicy is blocking DNS traffic (UDP port 53)
# CoreDNS is in kube-system -- pods need egress to kube-system:53
kubectl get networkpolicy -n <namespace-of-failing-pod>
NetworkPolicy blocking DNS is an extremely common gotcha. If a namespace has a default-deny egress policy, pods in that namespace cannot reach CoreDNS — even though CoreDNS is a cluster system component.
yaml
# Add this NetworkPolicy to allow DNS egress from any namespace with a default-deny policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: <your-namespace>
spec:
podSelector: {} # applies to all pods in the namespace
policyTypes:
- Egress
egress:
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Step 6 — Real Production Example
Scenario: After applying a new NetworkPolicy to the payments namespace for PCI compliance, all pods in the namespace start failing to connect to downstream services. The pods are running, Services exist, Endpoints are bound — but connections time out.
bash
# Test DNS from inside a payments pod
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# ;; connection timed out; no servers could be reached
DNS is completely failing — not just one service, all DNS resolution.
bash
# Check NetworkPolicies in the namespace
kubectl get networkpolicy -n payments
# NAME POD-SELECTOR AGE
# pci-default-deny <none> 4m
kubectl describe networkpolicy pci-default-deny -n payments
# Spec:
# PodSelector: <none> (Allowing the specific traffic to all pods in this namespace)
# PolicyTypes: Ingress, Egress
# Egress: <none> <- all egress denied, including DNS
The new policy denied all egress including UDP/53 to CoreDNS.
bash
# Fix: add DNS egress allowance
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: payments
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
EOF
# Test immediately
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# Server: 10.96.0.10
# Address: 10.96.0.10:53
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1
Time to resolution: 11 minutes. Lesson: Any default-deny egress NetworkPolicy must explicitly allow UDP/TCP port 53 to CoreDNS. Add this as a required step in your NetworkPolicy runbook.
Quick Reference
bash
# Run DNS debug pod
kubectl run dns-debug --image=busybox:1.28 --restart=Never -it --rm -- /bin/sh
# Test internal resolution
nslookup kubernetes.default
nslookup <service>.<namespace>.svc.cluster.local
# Test external resolution
nslookup google.com
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml
# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system
# Check pod resolv.conf
kubectl exec -it <pod> -- cat /etc/resolv.conf
# Check kube-dns Service ClusterIP
kubectl get svc kube-dns -n kube-system
# Scale CoreDNS for high load
kubectl scale deployment coredns --replicas=4 -n kube-system
# Check NetworkPolicies that may block DNS
kubectl get networkpolicy -n <namespace>
Summary
Kubernetes DNS failures always fit one of five patterns:
- CoreDNS pods not running — restart them, check OOM and resource limits
- Corefile misconfigured — inspect the CoreDNS ConfigMap, check upstream forward settings
- Service does not exist — NXDOMAIN means the Service is missing or in the wrong namespace, not a DNS bug
- NetworkPolicy blocking port 53 — any default-deny egress policy breaks DNS unless you explicitly allow it
- CoreDNS overloaded — scale replicas, increase resource limits, or deploy NodeLocal DNSCache
DNS resolving but connection still failing means the problem is at the network layer — see the Debugging Kubernetes Networking guide for kube-proxy and NetworkPolicy diagnosis.
Related guides:

