Introduction: Why Kubernetes Debugging Is a Different Beast
Kubernetes is powerful. It is also one of the most complex systems a DevOps engineer will operate in production Kubernetes clusters.
When something breaks — and it will break — the failure is rarely obvious. A pod disappears. A service stops responding. A node silently goes NotReady at 2 AM. The symptom you see on the surface is almost never where the actual problem lives.
Unlike debugging a single server, Kubernetes failures span multiple layers. A single application outage can involve container runtime issues, scheduler decisions, network policy conflicts, resource exhaustion, and control plane delays. Each layer has its own signals, tools, and failure modes.
This is what makes Kubernetes debugging hard. Not the complexity of any single component, but the sheer number of components that interact — and the fact that in production, you are debugging under pressure, with real users affected and stakeholders watching.
This handbook exists to give you a systematic approach, linking concepts across pod, node, networking, storage, and control plane troubleshooting.
After managing AKS clusters running 500+ cores for Fortune 500 clients, the most important lesson is this: a solid production Kubernetes debugging workflow is not about memorizing error messages. It is about knowing where to look, in what order, and what each signal means. That requires Kubernetes cluster observability — logs, events, metrics, and state — working together.
What You Will Learn
This handbook covers every major Kubernetes pod failure troubleshooting scenario and cluster-level issue you will encounter in real production environments:
- Pod failures — CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending pods
- Node issues — NotReady nodes, disk pressure, memory pressure
- Networking and DNS — services unreachable, DNS failures, NetworkPolicy blocks
- Resource and scheduling — misconfigured limits, unschedulable pods
- Storage failures — PVCs stuck in Pending, volume mount errors
- Control plane issues — etcd problems, API server degradation
- Real incident walkthroughs — symptom to resolution
Who This Is For
DevOps engineers, SREs, and platform engineers operating Kubernetes in production. It assumes you know what a pod is — not that you have seen every failure mode.
Tools You Will Need
kubectl— your primary interface for everythingstern— multi-pod log tailing across namespacesk9s— terminal UI for real-time cluster navigationkubectx/kubens— fast context and namespace switchingPrometheusor Azure Monitor — metrics and alerting
Tip: If you do not have
sternandk9sinstalled, start there. They cut debugging time significantly compared to running repeatedkubectl logscommands manually.
How This Handbook Is Structured
Each section follows the same pattern: what the failure means, how to diagnose it step by step, and a concrete fix checklist. Use the links below to jump directly to the section you need:
Table of Contents
- Debugging CrashLoopBackOff and Pod Failures (Full Article)
- Debugging Node NotReady Issues (Full Article)
- Debugging Kubernetes Networking and DNS (Full Article)
- Debugging Resource and Scheduling Problems (Full Article)
- Debugging Storage and Persistent Volume Issues (Full Article)
- Debugging Control Plane Failures (Full Article)
- Real Production Incident Walkthroughs (Full Article)
This handbook is a living reference. Use it alongside the Kubernetes Guide for full topic coverage.
Section 1 — Debugging CrashLoopBackOff and Pod Failures
Pod failures are the most common class of Kubernetes issues you will encounter in production. The challenge is that the same symptom — a pod not running — can have a dozen different root causes. This section gives you a systematic approach to diagnose and resolve each one.
The 5-Minute Pod Triage Checklist
Before diving into specific failure types, always start with these four commands. They give you 80% of the information you need in under five minutes.
# 1. What is the pod status?
kubectl get pods -n <namespace>
# 2. What events have occurred?
kubectl describe pod <pod-name> -n <namespace>
# 3. What do the logs say?
kubectl logs <pod-name> -n <namespace>
# 4. If the pod has already crashed, check previous logs
kubectl logs <pod-name> -n <namespace> --previous
Run these in order, every time. Do not skip steps. Events in describe often reveal the root cause before you even check logs.
CrashLoopBackOff
What it means
CrashLoopBackOff is not an error itself — it is Kubernetes telling you that your container keeps starting and immediately crashing. Kubernetes applies an exponential backoff between each restart attempt (10s, 20s, 40s, up to 5 minutes), which is where the name comes from.
Common causes
- Application error on startup (misconfigured environment variable, missing config file)
- Failed database or dependency connection on boot
- Incorrect entrypoint or command in the container image
- Liveness probe failing immediately after container starts
- OOMKilled on startup (not enough memory for the process to initialise)
How to diagnose
# Check restart count and last state
kubectl describe pod <pod-name> -n <namespace>
# Look for Exit Code in Last State section
# Exit Code 1 → application error
# Exit Code 137 → OOMKilled (killed by kernel)
# Exit Code 143 → SIGTERM (graceful shutdown signal)
# Exit Code 255 → container entrypoint not found
# Get logs from the crashed container
kubectl logs <pod-name> -n <namespace> --previous
What to look for in describe output
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 08 Mar 2026 01:00:00 +0000
Finished: Sun, 08 Mar 2026 01:00:02 +0000
A container that exits within 2 seconds almost always means the application failed to start — check your startup logs and environment variables first.
Fix checklist
- Verify all required environment variables are set and correct
- Check if referenced ConfigMaps or Secrets exist in the same namespace
- Confirm the container image entrypoint is correct
- Review liveness probe configuration — add
initialDelaySecondsif the app needs time to boot - Check resource limits — if memory limit is too low, increase it
ImagePullBackOff / ErrImagePull
What it means
Kubernetes cannot pull the container image from the registry. ErrImagePull is the first attempt failure. ImagePullBackOff is Kubernetes backing off after repeated failures.
Common causes
- Image name or tag is incorrect
- Image does not exist in the registry
- Private registry credentials missing or expired
- Network connectivity issue to the registry
- Rate limiting from Docker Hub
How to diagnose
kubectl describe pod <pod-name> -n <namespace>
# Look for Events section:
# Failed to pull image "myrepo/myapp:v1.2": ...
# unauthorized: authentication required
# not found
Fix checklist
# Verify the image exists
docker pull <image-name>:<tag>
# Check if imagePullSecret is configured
kubectl get pod <pod-name> -o yaml | grep imagePullSecret
# Check the secret exists
kubectl get secret <secret-name> -n <namespace>
# Re-create the pull secret if needed
kubectl create secret docker-registry regcred \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
-n <namespace>
OOMKilled
What it means
The Linux kernel killed your container because it exceeded its memory limit. Exit code will be 137.
How to diagnose
kubectl describe pod <pod-name> -n <namespace>
# Look for:
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Fix checklist
- Increase the memory limit in your pod spec
- Profile the application to understand actual memory usage
- Check for memory leaks in the application
- If using Java, set
-Xmxheap size below your container memory limit
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi"
Important: always set requests and limits together. A container without a memory limit can consume all node memory and cause node-level failures affecting other workloads.
Pending Pods
What it means
The pod has been accepted by the API server but the scheduler cannot place it on any node.
Common causes
- Insufficient CPU or memory across all nodes
- Node selector or affinity rules that no node satisfies
- Taints on nodes that the pod does not tolerate
- PersistentVolumeClaim not bound
How to diagnose
kubectl describe pod <pod-name> -n <namespace>
# Events section will tell you exactly why:
# 0/5 nodes are available: 5 Insufficient memory
# 0/5 nodes are available: 5 node(s) had taint {key:value} that the pod did not tolerate
# 0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims
Fix checklist
# Check node capacity
kubectl describe nodes | grep -A5 "Allocated resources"
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check PVC status
kubectl get pvc -n <namespace>
Quick Reference — Pod Exit Codes
| Exit Code | Meaning | First place to check |
|---|---|---|
| 0 | Clean exit | Check liveness/readiness probe |
| 1 | Application error | Application logs |
| 137 | OOMKilled | Increase memory limit |
| 139 | Segfault | Application or library bug |
| 143 | SIGTERM received | Check preStop hooks |
| 255 | Entrypoint not found | Check image CMD/ENTRYPOINT |
Section 2 — Debugging Node NotReady Issues
Node failures are more severe than pod failures. When a node goes NotReady, every workload running on it is at risk. In a production cluster, a single NotReady node can trigger a cascade — pods get evicted, rescheduled onto already-strained nodes, and suddenly you have a cluster-wide resource pressure event instead of a single node issue.
The key is catching it early and diagnosing the root cause before it spreads.
The 5-Minute Node Triage Checklist
# 1. Check which nodes are affected
kubectl get nodes
# 2. Get detailed status and conditions
kubectl describe node <node-name>
# 3. Check system-level pods on the node (kubelet, kube-proxy)
kubectl get pods -n kube-system -o wide | grep <node-name>
# 4. Check node resource usage
kubectl top node <node-name>
# 5. SSH into the node and check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager
Run these in order before doing anything else. The describe node output will almost always tell you which condition is failing.
Understanding Node Conditions
When you run kubectl describe node, look for the Conditions section. Each condition tells you a specific story:
| Condition | Status | Meaning |
|---|---|---|
| Ready | True | Node is healthy |
| Ready | False | Node is NotReady — kubelet has reported a problem |
| Ready | Unknown | Node controller lost contact with the node |
| MemoryPressure | True | Node is low on memory |
| DiskPressure | True | Node is low on disk space |
| PIDPressure | True | Too many processes running on the node |
| NetworkUnavailable | True | Node network is not configured correctly |
Warning: A
Ready: Unknownstatus means the node controller has not received a heartbeat from the kubelet in the last 40 seconds (defaultnode-monitor-grace-period). This usually means the node is completely unreachable — network failure, VM crash, or kubelet process died.
Node NotReady — Kubelet Failure
What it means
The kubelet process on the node has stopped reporting to the control plane. This is the most common cause of NotReady.
How to diagnose
# SSH into the affected node
ssh <node-ip>
# Check kubelet service status
systemctl status kubelet
# View kubelet logs
journalctl -u kubelet -n 200 --no-pager
# Common log patterns to look for:
# "failed to get node info"
# "PLEG is not healthy"
# "container runtime is down"
# "failed to load kubeconfig"
PLEG (Pod Lifecycle Event Generator) errors are particularly important. If you see PLEG is not healthy, it means the kubelet cannot communicate with the container runtime (containerd or Docker). This often points to a container runtime crash.
# Check container runtime status
systemctl status containerd
# or
systemctl status docker
# Restart container runtime if crashed
systemctl restart containerd
Fix checklist
# Restart kubelet
systemctl restart kubelet
# If kubelet fails to start, check config
cat /etc/kubernetes/kubelet.conf
cat /var/lib/kubelet/config.yaml
# Check certificates — expired certs cause kubelet to fail
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
Node NotReady — Memory Pressure
What it means
The node is running out of available memory. Kubernetes will stop scheduling new pods onto this node and may begin evicting existing pods.
How to diagnose
kubectl describe node <node-name> | grep -A5 "Conditions"
# Look for: MemoryPressure: True
# Check actual memory usage on the node
kubectl top node <node-name>
# SSH into node and check memory
free -h
cat /proc/meminfo
Common causes in production
- A pod without memory limits is consuming all available node memory
- A memory leak in a long-running application
- Too many pods scheduled onto a single node
Fix checklist
# Find the top memory-consuming pods on the node
kubectl top pods --all-namespaces --sort-by=memory | head -20
# Identify pods without memory limits
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits.memory == null) | .metadata.name'
# Cordon the node to stop new scheduling while you investigate
kubectl cordon <node-name>
Tip: Always set memory
requestsandlimitson every container in production. A single pod without limits can bring down an entire node.
Node NotReady — Disk Pressure
What it means
The node is running low on disk space. Kubernetes will evict pods and stop scheduling new workloads onto the node.
How to diagnose
kubectl describe node <node-name> | grep DiskPressure
# DiskPressure: True
# SSH into the node
ssh <node-ip>
# Check disk usage
df -h
# Find what is consuming disk space
du -sh /var/lib/docker/* # if using Docker
du -sh /var/lib/containerd/* # if using containerd
du -sh /var/log/*
Most common culprits
- Unused container images accumulating on the node
- Container logs growing without rotation
- Large volumes mounted at unusual paths
- Core dumps from crashing processes
Fix checklist
# Clean up unused images (safe to run on any node)
crictl rmi --prune
# or for Docker nodes:
docker image prune -af
# Clean up stopped containers
crictl rm $(crictl ps -a -q)
# Check and configure log rotation
cat /etc/docker/daemon.json
# Add: "log-opts": {"max-size": "100m", "max-file": "3"}
# For containerd, check log rotation in kubelet config
Node NotReady — Network Unavailable
What it means
The CNI (Container Network Interface) plugin has not configured networking correctly on the node. New pods cannot get IP addresses.
How to diagnose
kubectl describe node <node-name> | grep NetworkUnavailable
# NetworkUnavailable: True
# Check CNI plugin pods
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"
# Check CNI logs
kubectl logs -n kube-system <cni-pod-name>
# On the node, check CNI config
ls /etc/cni/net.d/
cat /etc/cni/net.d/<config-file>
Fix checklist
# Restart the CNI plugin pod on the affected node
kubectl delete pod -n kube-system <cni-pod-on-node>
# If CNI config is missing, re-apply the CNI manifest
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Draining vs Cordoning — When to Use Each
| Action | Command | When to use |
|---|---|---|
| Cordon | kubectl cordon <node> | Stop new pods scheduling while you investigate. Existing pods keep running. |
| Drain | kubectl drain <node> --ignore-daemonsets --delete-emptydir-data | Safely evict all pods before maintenance or node replacement. |
| Uncordon | kubectl uncordon <node> | Return node to schedulable state after fix is confirmed. |
Warning:
kubectl drainwill evict all pods from the node. Make sure your workloads have enough replicas running on other nodes before draining, or you will cause downtime.
Quick Reference — Node Conditions and First Actions
| Condition | First Command | Most Likely Cause |
|---|---|---|
| Ready: Unknown | systemctl status kubelet | Node unreachable / kubelet crashed |
| MemoryPressure | kubectl top node | Pod without memory limits |
| DiskPressure | df -h on node | Accumulated images or logs |
| NetworkUnavailable | Check CNI pod logs | CNI plugin failure |
| PIDPressure | ps aux | wc -l on node | Fork bomb or runaway process |
Section 3 — Debugging Kubernetes Networking and DNS
Networking failures in Kubernetes are among the hardest to debug. The system has multiple layers — pod networking, Services, DNS, Ingress, and NetworkPolicies — and a failure in any one of them produces similar symptoms: requests time out, connections are refused, or DNS names do not resolve. The trick is isolating which layer is failing before you start making changes.
The 5-Minute Network Triage Checklist
# 1. Can the pod reach the internet?
kubectl exec -it <pod-name> -n <namespace> -- curl -I https://google.com
# 2. Can the pod reach another pod by IP?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<pod-ip>:<port>
# 3. Can the pod reach a Service by ClusterIP?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<cluster-ip>:<port>
# 4. Can the pod reach a Service by DNS name?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>.<namespace>.svc.cluster.local:<port>
# 5. Does DNS resolution work at all?
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default
Work through these in order. If step 2 works but step 3 fails, the problem is in kube-proxy or iptables rules. If step 3 works but step 4 fails, the problem is DNS. This narrows your diagnosis before you spend time looking in the wrong place.
Service Not Reachable
What it means
A pod cannot connect to a Kubernetes Service by its ClusterIP or DNS name, even though the target pods are running and healthy.
How to diagnose
# Check the Service exists and has the right port
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
# Check if the Service has Endpoints (this is the most important check)
kubectl get endpoints <service-name> -n <namespace>
# If Endpoints shows <none>, the selector does not match any pods
The most common cause: selector mismatch
# Check what labels the Service is selecting
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A5
# Check what labels your pods actually have
kubectl get pods -n <namespace> --show-labels
# They must match exactly — including case
Fix checklist
# Verify kube-proxy is running on all nodes
kubectl get pods -n kube-system | grep kube-proxy
# Check iptables rules are being applied (on the node)
iptables -t nat -L KUBE-SERVICES | grep <service-cluster-ip>
# Test direct pod-to-pod connectivity to rule out Service layer
kubectl exec -it <source-pod> -- curl http://<target-pod-ip>:<port>
Tip: If
kubectl get endpointsshows no addresses, your pods are either not running, not passing readiness probes, or the Service selector does not match the pod labels. Fix the selector first — it is the most common cause by far.
DNS Resolution Failures
What it means
Pods cannot resolve Kubernetes service names or external hostnames. This usually points to CoreDNS.
How to diagnose
# Check CoreDNS pods are running
kubectl get pods -n kube-system | grep coredns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS from inside a pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
# Check the pod's DNS config
kubectl exec -it <pod-name> -n <namespace> -- cat /etc/resolv.conf
Common CoreDNS log errors
[ERROR] plugin/errors: 2 SERVFAIL
[ERROR] Failed to list *v1.Endpoints
HINFO: read udp: i/o timeout
Fix checklist
# Restart CoreDNS pods
kubectl rollout restart deployment/coredns -n kube-system
# Check CoreDNS ConfigMap for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml
# Check if DNS requests are hitting CoreDNS at all
kubectl get svc kube-dns -n kube-system
# Verify ClusterIP matches what pods have in /etc/resolv.conf
NetworkPolicy Blocking Traffic
What it means
A NetworkPolicy is explicitly denying traffic between pods or from external sources. This is a common gotcha in production clusters where NetworkPolicies have been applied incrementally.
How to diagnose
# List all NetworkPolicies in the namespace
kubectl get networkpolicy -n <namespace>
# Describe a specific policy to see its rules
kubectl describe networkpolicy <policy-name> -n <namespace>
# Check if any policy applies to your pod
kubectl get networkpolicy -n <namespace> -o yaml | grep -A10 podSelector
NetworkPolicy logic to remember
- If no NetworkPolicy selects a pod → all traffic is allowed (default open)
- If any NetworkPolicy selects a pod → only explicitly allowed traffic passes
- A missing
ingressoregressrule means that direction is denied
Fix checklist
# Temporarily test by deleting a restrictive policy (non-production only)
kubectl delete networkpolicy <policy-name> -n <namespace>
# Add an explicit allow rule for the traffic you need
# Example: allow ingress from another namespace
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-monitoring
namespace: <namespace>
spec:
podSelector:
matchLabels:
app: myapp
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
EOF
Ingress and LoadBalancer Troubleshooting
Ingress is where external traffic enters your cluster. When it breaks, users cannot reach your application even if every pod inside the cluster is healthy. Failures fall into three categories: misconfigured routing rules, TLS certificate issues, and Ingress controller failures.
The 5-Minute Ingress Triage Checklist
# 1. Check Ingress and whether an address is assigned
kubectl get ingress -n <namespace>
# ADDRESS column should show an IP -- if empty, controller has not processed it
# 2. Describe the Ingress for rule details
kubectl describe ingress <ingress-name> -n <namespace>
# 3. Check Ingress controller pods
kubectl get pods -n ingress-nginx
# 4. Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
# 5. Verify backend Service and Endpoints
kubectl get endpoints <backend-service> -n <namespace>
Common Misconfiguration: Wrong Path Rules
# Use Prefix for flexible matching -- not Exact
spec:
rules:
- host: api.opscart.com
http:
paths:
- path: /api
pathType: Prefix # matches /api, /api/v1, /api/v2
backend:
service:
name: api-service
port:
number: 8080
Common mistakes: trailing slash mismatch, Ingress and backend Service in different namespaces, or backend port not matching what the application actually listens on.
TLS Certificate Issues
# Verify the TLS secret has correct keys
kubectl get secret <tls-secret-name> -n <namespace> -o yaml
# Must contain: tls.crt and tls.key
# Check certificate expiry
kubectl get secret <tls-secret-name> -n <namespace> \
-o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# If using cert-manager
kubectl get certificate -n <namespace>
kubectl describe certificate <cert-name> -n <namespace>
# Look for Ready: True
# If ACME HTTP-01 challenge is stuck (domain must be publicly reachable on port 80)
kubectl get challenges -n <namespace>
kubectl describe challenge <challenge-name> -n <namespace>
Gotcha: A wildcard cert (
*.opscart.com) does not cover second-level subdomains (api.k8s.opscart.com). This catches teams off-guard when clusters use nested subdomain structures.
External DNS vs Internal DNS Hairpinning
When internal pods resolve an external hostname and get routed through the public load balancer instead of staying inside the cluster, you get unnecessary latency and potential firewall issues. Always use internal Service DNS (service.namespace.svc.cluster.local) for pod-to-pod communication and reserve external hostnames for traffic entering from outside.
Calico and Cilium — Common NetworkPolicy Gotchas
Calico
# Check Calico node pods
kubectl get pods -n kube-system -l k8s-app=calico-node
# Inspect active policies
calicoctl get networkpolicy --all-namespaces
calicoctl get globalnetworkpolicy
Gotcha: Calico’s
GlobalNetworkPolicyapplies cluster-wide and overrides namespace-level policies. A deny rule in a global policy will block traffic even if your namespace NetworkPolicy explicitly allows it. Always check global policies when namespace rules look correct but traffic is still blocked.
Cilium
# Check Cilium agents
kubectl get pods -n kube-system -l k8s-app=cilium
# Test connectivity
cilium connectivity test
# Inspect policies
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list
kubectl exec -n kube-system <cilium-pod> -- cilium policy get
Gotcha:
CiliumNetworkPolicyCRDs silently have no effect on clusters running Calico. If you migrate CNI plugins and carry over Cilium-specific policies, they are ignored without any error.
Quick Reference — Network Failure Layers
| Symptom | Layer | First Command |
|---|---|---|
| Pod to Pod by IP fails | CNI / Node networking | ping <pod-ip> from source pod |
| Pod to ClusterIP fails | kube-proxy / iptables | kubectl get endpoints <svc> |
| Pod to Service DNS fails | CoreDNS | nslookup kubernetes.default |
| Ingress returns 404 | Wrong path rules | kubectl describe ingress |
| Ingress returns 503 | Backend Endpoints empty | kubectl get endpoints <svc> |
| TLS handshake fails | Cert missing or expired | kubectl get certificate |
| NetworkPolicy blocking | Policy selector mismatch | kubectl get networkpolicy -A |
| Intermittent timeouts | CNI bug or node networking | Check CNI pod logs |
Section 4 — Debugging Resource and Scheduling Problems
Scheduling failures and resource misconfigurations are silent killers in production Kubernetes. Pods sit in Pending state while on-call engineers scramble, or applications get throttled and slow down without any obvious error. This section covers how to quickly identify what is blocking scheduling and how to fix resource configuration before it causes an incident.
The 5-Minute Scheduling Triage Checklist
# 1. Why is the pod Pending?
kubectl describe pod <pod-name> -n <namespace>
# Read the Events section — the scheduler always explains why it cannot place the pod
# 2. What resources are available on each node?
kubectl describe nodes | grep -A8 "Allocated resources"
# 3. What are the current resource requests across the cluster?
kubectl top nodes
# 4. Are there any taints blocking placement?
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# 5. Check resource quotas in the namespace
kubectl describe resourcequota -n <namespace>
Pods Stuck in Pending — Insufficient Resources
What it means
The scheduler cannot find a node with enough CPU or memory to satisfy the pod’s resource requests.
How to diagnose
kubectl describe pod <pod-name> -n <namespace>
# Events section will show:
# 0/5 nodes are available: 5 Insufficient memory.
# 0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had taint...
Fix checklist
# Find nodes with available capacity
kubectl describe nodes | grep -A5 "Allocated resources"
# Check if requests are set too high on the pending pod
kubectl get pod <pod-name> -o yaml | grep -A10 resources
# Option 1: Lower the resource requests if they are over-provisioned
# Option 2: Scale up the node pool (in cloud environments)
# Option 3: Remove idle pods consuming resources unnecessarily
# Identify resource-heavy pods
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory
Tip: Resource
requestsdetermine scheduling — notlimits. A pod withrequest: 4Gi memoryneeds a node with 4Gi free, even if it only ever uses 512Mi. Audit your requests regularly.
CPU Throttling — Pod Running But Slow
What it means
The pod is running but its CPU usage is being throttled because it has hit its CPU limit. The application slows down without any crash or visible error. This is one of the hardest production issues to spot without metrics.
How to diagnose
# Check CPU usage vs limits
kubectl top pod <pod-name> -n <namespace>
# Look for throttling in metrics (if using Prometheus)
# Metric: container_cpu_cfs_throttled_seconds_total
# Describe the pod to see configured limits
kubectl get pod <pod-name> -o yaml | grep -A10 resources
Fix checklist
- Increase the CPU limit in the pod spec
- Or remove the CPU limit entirely and rely on requests only (acceptable for non-critical workloads)
- Investigate if the application has a genuine CPU spike (code issue) vs being under-provisioned
resources:
requests:
cpu: "250m"
limits:
cpu: "1000m" # Increase this if throttling is confirmed
Taints and Tolerations Blocking Scheduling
What it means
Nodes have taints applied that prevent pods from being scheduled unless the pod explicitly tolerates those taints. Common in clusters with dedicated node pools (GPU nodes, spot nodes, system nodes).
How to diagnose
# Check node taints
kubectl describe node <node-name> | grep Taints
# Or check all nodes at once
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check if your pod has the required tolerations
kubectl get pod <pod-name> -o yaml | grep -A10 tolerations
Fix checklist
# Add the appropriate toleration to your pod spec
# Example: tolerate a spot instance taint
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
ResourceQuota Blocking Pod Creation
What it means
A namespace-level ResourceQuota has been hit. New pods cannot be created even if the cluster has capacity.
How to diagnose
kubectl describe resourcequota -n <namespace>
# Output shows used vs hard limits:
# Name: default-quota
# Resource Used Hard
# -------- --- ---
# limits.cpu 7800m 8000m
# limits.memory 14Gi 16Gi
# pods 48 50
Fix checklist
# Option 1: Increase the quota
kubectl edit resourcequota <quota-name> -n <namespace>
# Option 2: Delete unused pods to free up quota
kubectl get pods -n <namespace> | grep Completed
kubectl delete pod <completed-pod> -n <namespace>
# Option 3: Reduce limits on existing pods if over-provisioned
Quick Reference — Scheduling Failure Reasons
| Scheduler Message | Root Cause | Fix |
|---|---|---|
| Insufficient cpu | CPU requests exceed available | Lower requests or scale nodes |
| Insufficient memory | Memory requests exceed available | Lower requests or scale nodes |
| node(s) had taint | Pod missing toleration | Add toleration to pod spec |
| pod has unbound PVCs | PVC not bound | Fix PVC / StorageClass |
| node(s) didn’t match Pod’s node affinity | Affinity rules too strict | Relax affinity or label nodes |
| exceeded quota | ResourceQuota hit | Increase quota or delete idle pods |
HPA and VPA Misconfigurations
Horizontal Pod Autoscaler (HPA)
HPA scales pod count based on metrics — most commonly CPU utilization. When it does not scale as expected, the issue is almost always one of three things: metrics are not being collected, the target utilization is set incorrectly, or resource requests are missing (HPA uses requests as the baseline for utilization calculations).
# Check HPA status
kubectl get hpa -n <namespace>
# Describe HPA for detailed status
kubectl describe hpa <hpa-name> -n <namespace>
# Common output when HPA is stuck:
# Conditions:
# Type Status Reason
# AbleToScale True SucceededGetScale
# ScalingActive False FailedGetResourceMetric
# ScalingLimited False DesiredWithinRange
#
# ScalingActive: False with FailedGetResourceMetric means
# the metrics server cannot read CPU/memory for your pods
HPA not scaling up — fix checklist
# 1. Verify metrics-server is running
kubectl get pods -n kube-system | grep metrics-server
# 2. Check if metrics are available for your pods
kubectl top pods -n <namespace>
# If this command fails, metrics-server is the problem
# 3. Verify resource requests are set on the pod
# HPA calculates utilization as: actual usage / request
# Without requests, HPA cannot calculate a percentage
kubectl get pod <pod-name> -o yaml | grep -A5 resources
# 4. Check the HPA min/max replicas -- max may already be reached
kubectl describe hpa <hpa-name> -n <namespace> | grep -E "Min|Max|Current"
Common gotcha: HPA targets 50% CPU utilization but the pod has no CPU request set. HPA simply will not work. Every pod managed by HPA must have CPU requests defined.
Vertical Pod Autoscaler (VPA)
VPA adjusts CPU and memory requests automatically based on actual usage. It is powerful but has important constraints in production.
# Check VPA status
kubectl get vpa -n <namespace>
kubectl describe vpa <vpa-name> -n <namespace>
# VPA recommendations appear under Status.Recommendation
# Lower bound: minimum safe request
# Target: recommended request
# Upper bound: maximum before VPA considers scaling up node
Warning: VPA in
Automode will evict and restart pods to apply new resource values. Never use VPAAutomode with HPA on the same deployment targeting CPU/memory — they conflict. Use VPAOffmode to get recommendations only, then apply them manually.
Node Autoscaler Delays in Cloud Environments
In AKS, EKS, and GKE, when the cluster autoscaler needs to provision a new node, there is a provisioning delay — typically 2 to 5 minutes. During this window, pods sit in Pending state and your HPA may be trying to scale up but has nowhere to put the new pods.
# Check if Cluster Autoscaler is making decisions
kubectl get events -n kube-system | grep -i "scale\|autoscal"
# Check Cluster Autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
# Common log messages:
# "Scale up triggered" -- CA decided to add a node
# "Scale down triggered" -- CA decided to remove a node
# "Pod is unschedulable" -- Pod waiting for new node
# "Node group has reached maximum size" -- Cannot scale further
Fix: Use Pod Disruption Budgets and over-provisioning
To reduce the impact of cold-start delays, keep a small buffer of over-provisioned capacity using a low-priority placeholder deployment:
# Placeholder pods that get evicted when real workloads need the capacity
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-buffer
spec:
replicas: 2
template:
spec:
priorityClassName: low-priority-buffer
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
resources:
requests:
cpu: "1"
memory: "1Gi"
When real pods need scheduling, they evict the buffer pods, which triggers the autoscaler to add a node while existing workloads continue running.
Pod Anti-Affinity Surprises
Anti-affinity rules prevent pods from being scheduled on the same node or zone as other pods. In small clusters, strict anti-affinity can make pods permanently unschedulable.
# Check anti-affinity rules
kubectl get pod <pod-name> -o yaml | grep -A20 affinity
# If pods are stuck in Pending with this message:
# 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules
# You have more replicas than nodes that satisfy the anti-affinity constraint
Required vs Preferred anti-affinity
# REQUIRED -- pod will never schedule if constraint cannot be satisfied
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # hard rule
- labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname
# PREFERRED -- scheduler tries to satisfy but will proceed if not possible
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution: # soft rule
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname
Production rule of thumb: Use
requiredanti-affinity only when you truly cannot tolerate two replicas on the same node (stateful workloads with local storage). Usepreferredfor everything else — it achieves the same distribution goal without making your deployment fragile in smaller clusters.
Section 5 — Debugging Storage and Persistent Volume Issues
Storage failures are disruptive. A pod stuck waiting for a volume cannot start, and if that pod is part of a StatefulSet, the entire set stalls. Storage issues also tend to have longer blast radius — a PVC stuck in Terminating can block namespace deletion for hours if you do not know what to look for.
The 5-Minute Storage Triage Checklist
# 1. Check PVC status
kubectl get pvc -n <namespace>
# 2. Check PV status
kubectl get pv
# 3. Check StorageClass
kubectl get storageclass
# 4. Describe the stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>
# 5. Check events for volume errors
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i volume
PVC Stuck in Pending
What it means
A PersistentVolumeClaim has been created but no PersistentVolume has been bound to it. The pod waiting for this PVC cannot start.
Common causes
- No StorageClass is defined or the referenced StorageClass does not exist
- The StorageClass provisioner is not running
- No PV available that matches the PVC’s access mode and size requirements
- The requested storage size is larger than any available PV
How to diagnose
kubectl describe pvc <pvc-name> -n <namespace>
# Look for events like:
# no persistent volumes available for this claim and no storage class is set
# storageclass.storage.k8s.io "fast-ssd" not found
# waiting for a volume to be created
# Check StorageClass exists
kubectl get storageclass
kubectl describe storageclass <storageclass-name>
# Check if the provisioner pod is running
kubectl get pods -n kube-system | grep provisioner
Fix checklist
# If StorageClass is missing, create or reference the correct one
kubectl get storageclass # list available classes
# For Azure (AKS), common StorageClasses:
# managed-premium → Azure Premium SSD
# managed → Azure Standard SSD
# azurefile → Azure File Share
# Check if provisioner is healthy
kubectl logs -n kube-system <provisioner-pod>
# If using static provisioning, manually create a PV that matches the PVC
PVC Stuck in Terminating
What it means
A PVC has been deleted but is stuck in Terminating state. This commonly blocks namespace deletion.
How to diagnose
kubectl get pvc -n <namespace>
# STATUS: Terminating
kubectl describe pvc <pvc-name> -n <namespace>
# Look for finalizers:
# Finalizers: [kubernetes.io/pvc-protection]
What is happening: Kubernetes uses a finalizer kubernetes.io/pvc-protection to prevent PVC deletion while a pod is still using it. If the pod has been deleted but the PVC is still stuck, the finalizer needs to be removed manually.
Fix checklist
# Remove the finalizer to force deletion
kubectl patch pvc <pvc-name> -n <namespace> \
-p '{"metadata":{"finalizers":[]}}' \
--type=merge
# Verify deletion
kubectl get pvc -n <namespace>
Warning: Only remove finalizers when you are certain no pod is actively using the volume. Forcing deletion while a pod has an active mount can cause data corruption.
Volume Mount Failures
What it means
The PVC is bound but the pod cannot mount the volume. The pod stays in ContainerCreating state.
How to diagnose
kubectl describe pod <pod-name> -n <namespace>
# Look for Events like:
# Unable to attach or mount volumes
# timed out waiting for the condition
# Multi-Attach error for volume — volume is already exclusively attached to one node
kubectl get events -n <namespace> | grep -i "mount\|attach\|volume"
Multi-Attach Error
This is common with ReadWriteOnce volumes in cloud environments. The volume is still attached to a previous node (after a node failure or pod rescheduling) and cannot be attached to a new node simultaneously.
# Find which node the volume is attached to
kubectl get pv <pv-name> -o yaml | grep nodeAffinity
# Force detach by deleting the VolumeAttachment object
kubectl get volumeattachment
kubectl delete volumeattachment <attachment-name>
Quick Reference — Storage States
| State | Meaning | First Action |
|---|---|---|
| PVC: Pending | No matching PV or provisioner issue | Check StorageClass and provisioner |
| PVC: Bound | Healthy — PV assigned | No action needed |
| PVC: Terminating | Stuck on finalizer | Remove finalizer if pod is gone |
| PV: Released | PV was used but PVC deleted | Reclaim or delete manually |
| PV: Failed | Provisioner error | Check provisioner logs |
| Pod: ContainerCreating | Volume not mounted | Check events for attach errors |
Section 6 — Debugging Control Plane Failures
Control plane failures are the most severe class of Kubernetes issues. When the API server, etcd, or scheduler becomes degraded, the entire cluster stops responding to changes. Existing workloads may continue running (pods do not die immediately when the API server goes down), but you lose the ability to deploy, scale, or recover from failures.
In managed Kubernetes environments like AKS, EKS, or GKE, you have limited direct access to control plane components. This section covers what you can diagnose from within the cluster and what to escalate to your cloud provider.
The 5-Minute Control Plane Triage Checklist
# 1. Can you reach the API server at all?
kubectl cluster-info
# 2. Check control plane component health
kubectl get componentstatuses
# 3. Check system pods
kubectl get pods -n kube-system
# 4. Check API server response time
time kubectl get nodes
# 5. Check etcd health (self-managed clusters)
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
API Server Slow or Unresponsive
What it means
kubectl commands hang or time out. New deployments cannot be created. Existing pods continue running because kubelet operates independently, but the cluster cannot be managed.
How to diagnose
# Test API server response
kubectl get nodes --request-timeout=5s
# Check API server pod logs (self-managed)
kubectl logs -n kube-system kube-apiserver-<node>
# For AKS — check Azure portal for control plane health
# az aks show --resource-group <rg> --name <cluster> --query "provisioningState"
# Check for high API request rates
kubectl get --raw /metrics | grep apiserver_request_total
Common causes
- etcd is slow or overloaded (API server is backed by etcd)
- Too many LIST/WATCH requests flooding the API server (runaway controllers or operators)
- Certificates have expired
- Admission webhooks timing out and blocking all requests
Fix checklist
# Check for slow admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# A single failing webhook with no timeout set can block ALL API requests
# Set a failurePolicy and timeoutSeconds on all webhooks
# failurePolicy: Ignore ← safe default for non-critical webhooks
# timeoutSeconds: 5
# For AKS — if API server is unresponsive, open a support ticket immediately
# This is a cloud provider responsibility in managed clusters
etcd Issues (Self-Managed Clusters)
What it means
etcd is the key-value store that holds all cluster state. If etcd is degraded, the API server cannot read or write cluster state. This is catastrophic.
How to diagnose
# Check etcd pod status
kubectl get pods -n kube-system | grep etcd
# Check etcd cluster health
etcdctl endpoint health --cluster
# Check etcd leader
etcdctl endpoint status --cluster -w table
# Check etcd disk latency (high disk I/O = slow etcd)
etcdctl check perf
Warning signs in etcd logs
took too long to execute
failed to send out heartbeat on time
server is likely overloaded
These messages indicate etcd is under I/O pressure. etcd is extremely sensitive to disk latency — it requires fast SSDs in production.
Fix checklist
# Check disk performance on etcd node
iostat -x 1 10
# etcd should be on dedicated fast SSD (not shared with other workloads)
# Recommended: <10ms disk write latency
# Compact and defragment etcd if it has grown large
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
etcdctl defrag
# Take an etcd snapshot before any risky operations
etcdctl snapshot save /backup/etcd-snapshot.db
Scheduler and Controller Manager Failures
What it means
The scheduler is not placing new pods. The controller manager is not reconciling deployments, replicasets, or other controllers.
How to diagnose
# Check scheduler and controller manager pods
kubectl get pods -n kube-system | grep -E "scheduler|controller"
# Check logs
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
# Verify scheduler is active (HA clusters have leader election)
kubectl get lease -n kube-system
Quick Reference — Control Plane Components
| Component | Role | What breaks without it |
|---|---|---|
| API Server | Central management endpoint | All kubectl commands fail |
| etcd | Cluster state storage | API server cannot read/write state |
| Scheduler | Pod placement decisions | New pods stay Pending forever |
| Controller Manager | Reconciles deployments, replicasets | Scaling and self-healing stop working |
| Cloud Controller | Cloud provider integration | Load balancers and node registration break |
Section 7 — Real Production Incident Walkthroughs
Theory is useful. Production incidents are where you actually learn. This section walks through three real-world Kubernetes incident patterns — the kind that appear at 2 AM — covering the full journey from initial alert to root cause and resolution.
Each walkthrough follows the same format: what the alert looked like, how the investigation progressed, where the actual root cause was found, and what the fix was.
Incident 1 — CrashLoopBackOff After a Deployment
The alert
PagerDuty fires at 11:45 PM. The payment service pod count has dropped from 5 to 0 healthy pods. All 5 pods are in CrashLoopBackOff. The deployment was pushed 12 minutes ago.
Initial triage
kubectl get pods -n payments
# NAME READY STATUS RESTARTS AGE
# payment-svc-7d9f6b-xk2p9 0/1 CrashLoopBackOff 4 8m
# payment-svc-7d9f6b-mn3q1 0/1 CrashLoopBackOff 4 8m
kubectl logs payment-svc-7d9f6b-xk2p9 --previous
# Error: failed to connect to database: connection refused
# dial tcp 10.0.1.45:5432: connect: connection refused
Investigation
The logs show a database connection failure. But the database has not changed — why would a new deployment break the DB connection?
kubectl describe pod payment-svc-7d9f6b-xk2p9 -n payments | grep -A20 Environment
# DB_HOST: postgres-svc
# DB_PORT: 5432
# DB_NAME: payments_prod
# DB_PASSWORD: <set to the key 'password' in secret 'db-credentials-v2'>
kubectl get secret db-credentials-v2 -n payments
# Error from server (NotFound): secrets "db-credentials-v2" not found
Root cause
The new deployment referenced db-credentials-v2 — a new secret that had been created in the infrastructure namespace but not in the payments namespace. The previous deployment used db-credentials-v1. The secret name was updated in the deployment manifest but the secret was never created in the correct namespace.
Fix
# Copy the secret to the correct namespace
kubectl get secret db-credentials-v2 -n infrastructure -o yaml | \
sed 's/namespace: infrastructure/namespace: payments/' | \
kubectl apply -f -
# Restart the deployment
kubectl rollout restart deployment/payment-svc -n payments
# Verify
kubectl get pods -n payments
# All 5 pods Running within 45 seconds
Time to resolution: 18 minutes Lesson: Secret references must be validated as part of the deployment pipeline before rollout. Add a pre-deploy check that verifies all referenced Secrets and ConfigMaps exist in the target namespace.
Incident 2 — Node NotReady Causing Cluster-Wide Scheduling Pressure
The alert
Monday morning. Prometheus alert fires: two nodes in the production cluster have entered NotReady state within 5 minutes of each other. Pod scheduling is failing across 12 namespaces. Restart counts are spiking on multiple services.
Initial triage
kubectl get nodes
# NAME STATUS ROLES AGE
# aks-nodepool-001 NotReady agent 45d
# aks-nodepool-002 NotReady agent 45d
# aks-nodepool-003 Ready agent 45d
kubectl describe node aks-nodepool-001 | grep -A5 Conditions
# MemoryPressure: False
# DiskPressure: True
# Ready: False
kubectl describe node aks-nodepool-002 | grep -A5 Conditions
# DiskPressure: True
# Ready: False
Investigation
Both nodes show DiskPressure. Two nodes hitting disk pressure simultaneously points to a shared root cause, not random failure.
# SSH into node 1
df -h
# /dev/sda1 99% /
du -sh /var/lib/containerd/*
# 47G /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
# Check what images are stored
crictl images | wc -l
# 312 images
Root cause
A nightly batch job had been pulling a new 8GB ML model container image each run but never cleaning up previous versions. Over 3 weeks, 312 images had accumulated on each node, filling the disk. Both nodes hit the threshold at roughly the same time because both had been running since the same cluster upgrade.
Fix
# Immediate: clean up images on affected nodes
crictl rmi --prune
# Freed 43GB on each node
# Uncordon the nodes
kubectl uncordon aks-nodepool-001
kubectl uncordon aks-nodepool-002
# Long-term fix: add image garbage collection policy to kubelet config
# /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60
Time to resolution: 31 minutes Lesson: Image garbage collection thresholds should be set explicitly in kubelet config. The default thresholds (85% high, 80% low) are too conservative for nodes running batch jobs with large images.
Incident 3 — Intermittent 503 Errors from Ingress
The alert
Customer support reports intermittent 503 errors on the public API. The errors appear for 5–10 seconds and then resolve. This has been happening every 15–20 minutes for the past two hours. No deployment has occurred in the last 6 hours.
Initial triage
kubectl get pods -n api-gateway
# All pods Running — nothing obviously wrong
kubectl get ingress -n api-gateway
# Ingress exists, host and rules look correct
# Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100
# Spotting repeated entries:
# upstream timed out (110: Connection timed out) while reading response header
# connect() failed (111: Connection refused) while connecting to upstream
Investigation
The errors are upstream timeouts — the Ingress controller can reach the pods but the pods are not responding in time. The intermittent pattern (every 15–20 minutes) suggests something is triggering pod restarts or temporary unavailability on a schedule.
kubectl get pods -n api-gateway
# NAME READY RESTARTS AGE
# api-svc-6b8d9f-xp2k1 1/1 Running 8m ← recently restarted
# api-svc-6b8d9f-mn7q2 1/1 Running 6m ← recently restarted
kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway
# Liveness probe failed: HTTP probe failed with statuscode: 503
# Killing container with id ...: pod "api-svc..." container "api-svc" is unhealthy
# Check what the liveness probe is hitting
kubectl get pod api-svc-6b8d9f-xp2k1 -o yaml | grep -A15 livenessProbe
# path: /health
# periodSeconds: 10
# failureThreshold: 3
# initialDelaySeconds: 5
Root cause
The /health endpoint was performing a database health check on every probe call. Every 15–20 minutes, the database connection pool was briefly exhausted during a scheduled reporting job, causing the /health endpoint to return 503. After 3 consecutive failures (30 seconds), Kubernetes killed the pod and restarted it — during which time live traffic hit the other pods, which were also failing their probes.
Fix
# Short-term: Separate liveness from readiness probe
# Liveness → simple check (is the process alive?)
# Readiness → deep check (is the service ready for traffic?)
livenessProbe:
httpGet:
path: /ping # returns 200 immediately, no DB check
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /health # full DB check — removes pod from LB if failing
port: 8080
periodSeconds: 10
failureThreshold: 2
Time to resolution: 47 minutes Lesson: Liveness probes should never perform deep dependency checks. A liveness probe that calls a database can cause cascading pod restarts during any database hiccup. Use /ping or /alive for liveness and reserve deep checks for readiness probes.
Incident 4 — HPA Not Scaling During Traffic Spike
The alert
At 3:15 PM on a Friday, the e-commerce platform starts returning slow responses. API latency climbs from 120ms to 4,200ms. The SRE on call checks the dashboard and sees CPU at 94% on all pods. The HPA should have kicked in 20 minutes ago.
Initial triage
kubectl get hpa -n ecommerce
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# api-hpa Deployment/api-svc <unknown>/50% 3 20 3
# TARGETS shows <unknown> -- HPA cannot read metrics
Investigation
kubectl describe hpa api-hpa -n ecommerce
# Conditions:
# ScalingActive: False
# Reason: FailedGetResourceMetric
# Message: unable to get metrics for resource cpu: unable to fetch metrics
# from resource metrics API: the server is currently unable to handle the request
# Check metrics-server
kubectl get pods -n kube-system | grep metrics-server
# metrics-server-7d9c8b-xp2k 0/1 CrashLoopBackOff 8 35m
kubectl logs -n kube-system metrics-server-7d9c8b-xp2k --previous
# E0308 15:01:22 Failed to scrape node "aks-nodepool-003"
# x509: certificate signed by unknown authority
Root cause
The metrics-server was failing its TLS verification against the kubelet API. A cluster certificate rotation had been performed two days earlier, but the metrics-server deployment had not been updated with the --kubelet-insecure-tls flag (acceptable in managed AKS) or the new CA certificate. The metrics-server had been in CrashLoopBackOff for two days — but because existing pods were handling the load, nobody noticed until traffic spiked.
Fix
# Immediate: patch metrics-server to skip TLS verification (AKS managed clusters)
kubectl patch deployment metrics-server -n kube-system --type=json -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-",
"value":"--kubelet-insecure-tls"}]'
# Verify metrics-server recovers
kubectl get pods -n kube-system | grep metrics-server
# Running after ~60 seconds
# HPA immediately starts reading metrics
kubectl get hpa -n ecommerce
# TARGETS: 94%/50% -- HPA begins scaling up within 30 seconds
# Manually scale to recover faster while HPA catches up
kubectl scale deployment api-svc --replicas=12 -n ecommerce
Time to resolution: 22 minutes Lesson: Monitor metrics-server health as a first-class concern. An HPA that cannot read metrics provides zero protection during traffic spikes. Add a Prometheus alert on metrics-server pod restarts and kube_horizontalpodautoscaler_status_condition for ScalingActive=False.
Incident 5 — StatefulSet Pods Stuck After Node Replacement
The alert
During a planned AKS node pool upgrade (rolling node replacement), 3 out of 5 pods in a Kafka StatefulSet enter Pending state and do not recover after 25 minutes. The node pool upgrade appears to have completed successfully.
Initial triage
kubectl get pods -n messaging -l app=kafka
# NAME READY STATUS RESTARTS
# kafka-0 1/1 Running 0
# kafka-1 1/1 Running 0
# kafka-2 0/1 Pending 0
# kafka-3 0/1 Pending 0
# kafka-4 0/1 Pending 0
kubectl describe pod kafka-2 -n messaging
# Events:
# Warning FailedScheduling pod has unbound immediate PersistentVolumeClaims
Investigation
kubectl get pvc -n messaging
# NAME STATUS VOLUME CAPACITY ACCESS MODES
# data-kafka-0 Bound ... 100Gi RWO
# data-kafka-1 Bound ... 100Gi RWO
# data-kafka-2 Pending <none> 100Gi RWO
# data-kafka-3 Pending <none> 100Gi RWO
# data-kafka-4 Pending <none> 100Gi RWO
kubectl describe pvc data-kafka-2 -n messaging
# Events:
# Warning ProvisioningFailed: storageclass.storage.k8s.io "premium-zrs" not found
Root cause
The old node pool used a StorageClass called premium-zrs (zone-redundant storage). During the node pool upgrade, the team had switched to a new node pool in a different availability zone configuration. The premium-zrs StorageClass had been removed as part of a cleanup task three weeks earlier — but because the existing PVCs were already bound, nobody noticed. When the StatefulSet pods were evicted and rescheduled during the upgrade, they attempted to provision new PVCs using the deleted StorageClass.
Fix
# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF
# Trigger PVC reconciliation by deleting and recreating stuck PVCs
# WARNING: Only safe if the underlying Azure disk still exists
kubectl delete pvc data-kafka-2 data-kafka-3 data-kafka-4 -n messaging
# StatefulSet controller automatically recreates PVCs
# Pods recover as PVCs bind to new or existing volumes
kubectl get pods -n messaging -w
Time to resolution: 41 minutes Lesson: StorageClass deletion is irreversible and silently breaks StatefulSets that rely on it for new volume provisioning. Before deleting any StorageClass, audit all PVCs and StatefulSets that reference it. Add a pre-deletion check to your runbooks: kubectl get pvc --all-namespaces -o json | jq '.items[] | select(.spec.storageClassName=="<class>") | .metadata.name'
Summary — Incident Patterns and Prevention
| Incident | Root Cause | Prevention |
|---|---|---|
| CrashLoopBackOff after deploy | Secret missing in target namespace | Pre-deploy secret validation in CI pipeline |
| Node DiskPressure | Accumulated container images | Set kubelet image GC thresholds explicitly |
| Intermittent 503 from Ingress | Liveness probe doing DB health check | Separate liveness and readiness probe paths |
| HPA not scaling during spike | metrics-server in CrashLoopBackOff | Alert on metrics-server health, not just HPA |
| StatefulSet stuck after node replacement | StorageClass deleted mid-lifecycle | Audit StorageClass references before deletion |
Appendix — Production Kubernetes Debugging Cheatsheet
Use this section as your rapid-reference during incidents. Every command you need, organized by failure type.
Pod Failures
# Get pod status across all namespaces
kubectl get pods --all-namespaces | grep -v Running
# Describe pod (events, state, conditions)
kubectl describe pod <pod-name> -n <namespace>
# Logs from running container
kubectl logs <pod-name> -n <namespace>
# Logs from previously crashed container
kubectl logs <pod-name> -n <namespace> --previous
# Follow logs in real time
kubectl logs -f <pod-name> -n <namespace>
# Multi-pod log tailing with stern
stern <deployment-name> -n <namespace>
# Check exit code of last container run
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Get all events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Node Issues
# List all nodes with status
kubectl get nodes
# Describe node (conditions, allocated resources, events)
kubectl describe node <node-name>
# Check resource usage per node
kubectl top nodes
# Check all system pods on a specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
# Cordon node (stop new scheduling)
kubectl cordon <node-name>
# Drain node (evict all pods safely)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Uncordon node (return to schedulable)
kubectl uncordon <node-name>
# Check kubelet status on node (run via SSH)
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager
# Check disk usage on node (run via SSH)
df -h
du -sh /var/lib/containerd/*
# Clean up unused images on node (run via SSH)
crictl rmi --prune
Networking and DNS
# Test pod-to-pod connectivity
kubectl exec -it <source-pod> -n <namespace> -- curl http://<target-pod-ip>:<port>
# Test Service reachability by ClusterIP
kubectl exec -it <pod> -n <namespace> -- curl http://<cluster-ip>:<port>
# Test DNS resolution
kubectl exec -it <pod> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod> -n <namespace> -- nslookup <service>.<namespace>.svc.cluster.local
# Check Service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Check Service selector vs pod labels
kubectl get svc <service-name> -o yaml | grep selector -A5
kubectl get pods --show-labels -n <namespace>
# Check CoreDNS pods and logs
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check NetworkPolicies affecting a namespace
kubectl get networkpolicy -n <namespace>
# Check Ingress and its address
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>
# Check TLS certificate expiry
kubectl get secret <tls-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
Resource and Scheduling
# Why is pod Pending? (always check Events section)
kubectl describe pod <pod-name> -n <namespace>
# Check node allocated resources
kubectl describe nodes | grep -A8 "Allocated resources"
# Top resource consumers
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory
# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
# Check ResourceQuota
kubectl describe resourcequota -n <namespace>
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check metrics-server health
kubectl get pods -n kube-system | grep metrics-server
kubectl top nodes # if this fails, metrics-server is down
Storage
# Check PVC status
kubectl get pvc --all-namespaces
# Describe stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check PV status
kubectl get pv
# Check StorageClass
kubectl get storageclass
# Find all PVCs using a specific StorageClass
kubectl get pvc --all-namespaces -o json | jq '.items[] | select(.spec.storageClassName=="<class>") | .metadata.name'
# Remove stuck PVC finalizer (force delete)
kubectl patch pvc <pvc-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
# Check VolumeAttachments (for multi-attach errors)
kubectl get volumeattachment
kubectl delete volumeattachment <attachment-name>
Control Plane
# Check API server connectivity
kubectl cluster-info
# Check component statuses
kubectl get componentstatuses
# Check system pods
kubectl get pods -n kube-system
# Check API server response time
time kubectl get nodes
# Check Cluster Autoscaler decisions
kubectl get events -n kube-system | grep -i autoscal
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
# Check admission webhooks (can block all API requests if timing out)
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# etcd health check (self-managed clusters)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
Pod Exit Code Quick Reference
| Exit Code | Meaning | First Check |
|---|---|---|
| 0 | Clean exit | Liveness/readiness probe config |
| 1 | Application error | kubectl logs --previous |
| 137 | OOMKilled | Increase memory limit |
| 139 | Segmentation fault | Application or library bug |
| 143 | SIGTERM received | Check preStop hooks |
| 255 | Entrypoint not found | Check image CMD/ENTRYPOINT |
Node Condition Quick Reference
| Condition | Status | Meaning | First Action |
|---|---|---|---|
| Ready | True | Node healthy | No action |
| Ready | False | Kubelet reporting failure | systemctl status kubelet |
| Ready | Unknown | Node unreachable (40s timeout) | Check node connectivity |
| MemoryPressure | True | Low on memory | kubectl top node, find memory hog |
| DiskPressure | True | Low on disk | df -h on node, crictl rmi --prune |
| PIDPressure | True | Too many processes | ps aux on node |
| NetworkUnavailable | True | CNI not configured | Check CNI plugin pods |
Debugging Flow — Production Outage Decision Tree
Alert fires
|
v
kubectl get pods --all-namespaces | grep -v Running
|
+-- Pods failing? ---------> kubectl describe pod / kubectl logs --previous
| Check exit code, check events, check secrets/configmaps
|
+-- Nodes NotReady? -------> kubectl describe node
| Check: MemoryPressure / DiskPressure / kubelet status
|
+-- Pods Pending? ---------> kubectl describe pod (read Events)
| Insufficient CPU/memory? Taints? PVC unbound?
|
+-- Network timeouts? -----> kubectl get endpoints <svc>
| Selector match? CoreDNS healthy? NetworkPolicy?
|
+-- All pods Running but
app is slow/erroring? -> kubectl top pods (CPU throttling?)
kubectl describe hpa (scaling blocked?)
kubectl logs (application-level errors?)
Return to the Kubernetes Guide for the full topic cluster.
Explore related labs: Kubernetes Troubleshooting Labs
Closing — Your Debugging Mindset
Every Kubernetes incident you work through teaches you something the next one will test. The engineers who debug fastest are not the ones who have memorised the most commands — they are the ones who have a systematic approach and follow it consistently under pressure.
The framework is always the same:
- Observe — what is the symptom? What layer is failing?
- Isolate — narrow the problem to one component
- Diagnose — gather evidence before making changes
- Fix — apply the smallest change that resolves the issue
- Verify — confirm the fix and check for side effects
- Document — write the postmortem so the team learns
Use this handbook as your starting point. Over time, your own production incidents will fill in the gaps that no handbook can cover.
Return to the Kubernetes Guide for the full topic cluster.
Explore related labs: Kubernetes Troubleshooting Labs