Production Kubernetes Debugging Handbook

Introduction: Why Kubernetes Debugging Is a Different Beast

Kubernetes is powerful. It is also one of the most complex systems a DevOps engineer will operate in production Kubernetes clusters.

When something breaks — and it will break — the failure is rarely obvious. A pod disappears. A service stops responding. A node silently goes NotReady at 2 AM. The symptom you see on the surface is almost never where the actual problem lives.

Unlike debugging a single server, Kubernetes failures span multiple layers. A single application outage can involve container runtime issues, scheduler decisions, network policy conflicts, resource exhaustion, and control plane delays. Each layer has its own signals, tools, and failure modes.

This is what makes Kubernetes debugging hard. Not the complexity of any single component, but the sheer number of components that interact — and the fact that in production, you are debugging under pressure, with real users affected and stakeholders watching.

This handbook exists to give you a systematic approach, linking concepts across pod, node, networking, storage, and control plane troubleshooting.

After managing AKS clusters running 500+ cores for Fortune 500 clients, the most important lesson is this: a solid production Kubernetes debugging workflow is not about memorizing error messages. It is about knowing where to look, in what order, and what each signal means. That requires Kubernetes cluster observability — logs, events, metrics, and state — working together.

What You Will Learn

This handbook covers every major Kubernetes pod failure troubleshooting scenario and cluster-level issue you will encounter in real production environments:

Pod failures — CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending pods
Node issues — NotReady nodes, disk pressure, memory pressure
Networking and DNS — services unreachable, DNS failures, NetworkPolicy blocks
Resource and scheduling — misconfigured limits, unschedulable pods
Storage failures — PVCs stuck in Pending, volume mount errors
Control plane issues — etcd problems, API server degradation
Real incident walkthroughs — symptom to resolution

Who This Is For

DevOps engineers, SREs, and platform engineers operating Kubernetes in production. It assumes you know what a pod is — not that you have seen every failure mode.

Tools You Will Need

kubectl — your primary interface for everything
stern — multi-pod log tailing across namespaces
k9s — terminal UI for real-time cluster navigation
kubectx / kubens — fast context and namespace switching
Prometheus or Azure Monitor — metrics and alerting

Tip: If you do not have stern and k9s installed, start there. They cut debugging time significantly compared to running repeated kubectl logs commands manually.

How This Handbook Is Structured

Each section follows the same pattern: what the failure means, how to diagnose it step by step, and a concrete fix checklist. Use the links below to jump directly to the section you need:

Debugging CrashLoopBackOff and Pod Failures (Full Article)
Debugging Node NotReady Issues (Full Article)
Debugging Kubernetes Networking and DNS (Full Article)
Debugging Resource and Scheduling Problems (Full Article)
Debugging Storage and Persistent Volume Issues (Full Article)
Debugging Control Plane Failures (Full Article)
Real Production Incident Walkthroughs (Full Article)

This handbook is a living reference. Use it alongside the Kubernetes Guide for full topic coverage.

Section 1 — Debugging CrashLoopBackOff and Pod Failures

Pod failures are the most common class of Kubernetes issues you will encounter in production. The challenge is that the same symptom — a pod not running — can have a dozen different root causes. This section gives you a systematic approach to diagnose and resolve each one.

The 5-Minute Pod Triage Checklist

Before diving into specific failure types, always start with these four commands. They give you 80% of the information you need in under five minutes.

# 1. What is the pod status?
kubectl get pods -n <namespace>

# 2. What events have occurred?
kubectl describe pod <pod-name> -n <namespace>

# 3. What do the logs say?
kubectl logs <pod-name> -n <namespace>

# 4. If the pod has already crashed, check previous logs
kubectl logs <pod-name> -n <namespace> --previous

Run these in order, every time. Do not skip steps. Events in describe often reveal the root cause before you even check logs.

CrashLoopBackOff

What it means

CrashLoopBackOff is not an error itself — it is Kubernetes telling you that your container keeps starting and immediately crashing. Kubernetes applies an exponential backoff between each restart attempt (10s, 20s, 40s, up to 5 minutes), which is where the name comes from.

Common causes

Application error on startup (misconfigured environment variable, missing config file)
Failed database or dependency connection on boot
Incorrect entrypoint or command in the container image
Liveness probe failing immediately after container starts
OOMKilled on startup (not enough memory for the process to initialise)

How to diagnose

# Check restart count and last state
kubectl describe pod <pod-name> -n <namespace>

# Look for Exit Code in Last State section
# Exit Code 1  → application error
# Exit Code 137 → OOMKilled (killed by kernel)
# Exit Code 143 → SIGTERM (graceful shutdown signal)
# Exit Code 255 → container entrypoint not found

# Get logs from the crashed container
kubectl logs <pod-name> -n <namespace> --previous

What to look for in describe output

Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Sun, 08 Mar 2026 01:00:00 +0000
  Finished:     Sun, 08 Mar 2026 01:00:02 +0000

A container that exits within 2 seconds almost always means the application failed to start — check your startup logs and environment variables first.

Fix checklist

Verify all required environment variables are set and correct
Check if referenced ConfigMaps or Secrets exist in the same namespace
Confirm the container image entrypoint is correct
Review liveness probe configuration — add initialDelaySeconds if the app needs time to boot
Check resource limits — if memory limit is too low, increase it

ImagePullBackOff / ErrImagePull

What it means

Kubernetes cannot pull the container image from the registry. ErrImagePull is the first attempt failure. ImagePullBackOff is Kubernetes backing off after repeated failures.

Common causes

Image name or tag is incorrect
Image does not exist in the registry
Private registry credentials missing or expired
Network connectivity issue to the registry
Rate limiting from Docker Hub

How to diagnose

kubectl describe pod <pod-name> -n <namespace>

# Look for Events section:
# Failed to pull image "myrepo/myapp:v1.2": ...
# unauthorized: authentication required
# not found

Fix checklist

# Verify the image exists
docker pull <image-name>:<tag>

# Check if imagePullSecret is configured
kubectl get pod <pod-name> -o yaml | grep imagePullSecret

# Check the secret exists
kubectl get secret <secret-name> -n <namespace>

# Re-create the pull secret if needed
kubectl create secret docker-registry regcred \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <namespace>

OOMKilled

What it means

The Linux kernel killed your container because it exceeded its memory limit. Exit code will be 137.

How to diagnose

kubectl describe pod <pod-name> -n <namespace>

# Look for:
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

Fix checklist

Increase the memory limit in your pod spec
Profile the application to understand actual memory usage
Check for memory leaks in the application
If using Java, set -Xmx heap size below your container memory limit

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Important: always set requests and limits together. A container without a memory limit can consume all node memory and cause node-level failures affecting other workloads.

Pending Pods

What it means

The pod has been accepted by the API server but the scheduler cannot place it on any node.

Common causes

Insufficient CPU or memory across all nodes
Node selector or affinity rules that no node satisfies
Taints on nodes that the pod does not tolerate
PersistentVolumeClaim not bound

How to diagnose

kubectl describe pod <pod-name> -n <namespace>

# Events section will tell you exactly why:
# 0/5 nodes are available: 5 Insufficient memory
# 0/5 nodes are available: 5 node(s) had taint {key:value} that the pod did not tolerate
# 0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims

Fix checklist

# Check node capacity
kubectl describe nodes | grep -A5 "Allocated resources"

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check PVC status
kubectl get pvc -n <namespace>

Quick Reference — Pod Exit Codes

Exit Code	Meaning	First place to check
0	Clean exit	Check liveness/readiness probe
1	Application error	Application logs
137	OOMKilled	Increase memory limit
139	Segfault	Application or library bug
143	SIGTERM received	Check preStop hooks
255	Entrypoint not found	Check image CMD/ENTRYPOINT

⬆ Back to Table of Contents

Section 2 — Debugging Node NotReady Issues

Node failures are more severe than pod failures. When a node goes NotReady, every workload running on it is at risk. In a production cluster, a single NotReady node can trigger a cascade — pods get evicted, rescheduled onto already-strained nodes, and suddenly you have a cluster-wide resource pressure event instead of a single node issue.

The key is catching it early and diagnosing the root cause before it spreads.

The 5-Minute Node Triage Checklist

# 1. Check which nodes are affected
kubectl get nodes

# 2. Get detailed status and conditions
kubectl describe node <node-name>

# 3. Check system-level pods on the node (kubelet, kube-proxy)
kubectl get pods -n kube-system -o wide | grep <node-name>

# 4. Check node resource usage
kubectl top node <node-name>

# 5. SSH into the node and check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

Run these in order before doing anything else. The describe node output will almost always tell you which condition is failing.

Understanding Node Conditions

When you run kubectl describe node, look for the Conditions section. Each condition tells you a specific story:

Condition	Status	Meaning
Ready	True	Node is healthy
Ready	False	Node is NotReady — kubelet has reported a problem
Ready	Unknown	Node controller lost contact with the node
MemoryPressure	True	Node is low on memory
DiskPressure	True	Node is low on disk space
PIDPressure	True	Too many processes running on the node
NetworkUnavailable	True	Node network is not configured correctly

Warning: A Ready: Unknown status means the node controller has not received a heartbeat from the kubelet in the last 40 seconds (default node-monitor-grace-period). This usually means the node is completely unreachable — network failure, VM crash, or kubelet process died.

Node NotReady — Kubelet Failure

What it means

The kubelet process on the node has stopped reporting to the control plane. This is the most common cause of NotReady.

How to diagnose

# SSH into the affected node
ssh <node-ip>

# Check kubelet service status
systemctl status kubelet

# View kubelet logs
journalctl -u kubelet -n 200 --no-pager

# Common log patterns to look for:
# "failed to get node info"
# "PLEG is not healthy"
# "container runtime is down"
# "failed to load kubeconfig"

PLEG (Pod Lifecycle Event Generator) errors are particularly important. If you see PLEG is not healthy, it means the kubelet cannot communicate with the container runtime (containerd or Docker). This often points to a container runtime crash.

# Check container runtime status
systemctl status containerd
# or
systemctl status docker

# Restart container runtime if crashed
systemctl restart containerd

Fix checklist

# Restart kubelet
systemctl restart kubelet

# If kubelet fails to start, check config
cat /etc/kubernetes/kubelet.conf
cat /var/lib/kubelet/config.yaml

# Check certificates — expired certs cause kubelet to fail
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

Node NotReady — Memory Pressure

What it means

The node is running out of available memory. Kubernetes will stop scheduling new pods onto this node and may begin evicting existing pods.

How to diagnose

kubectl describe node <node-name> | grep -A5 "Conditions"
# Look for: MemoryPressure: True

# Check actual memory usage on the node
kubectl top node <node-name>

# SSH into node and check memory
free -h
cat /proc/meminfo

Common causes in production

A pod without memory limits is consuming all available node memory
A memory leak in a long-running application
Too many pods scheduled onto a single node

Fix checklist

# Find the top memory-consuming pods on the node
kubectl top pods --all-namespaces --sort-by=memory | head -20

# Identify pods without memory limits
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits.memory == null) | .metadata.name'

# Cordon the node to stop new scheduling while you investigate
kubectl cordon <node-name>

Tip: Always set memory requests and limits on every container in production. A single pod without limits can bring down an entire node.

Node NotReady — Disk Pressure

What it means

The node is running low on disk space. Kubernetes will evict pods and stop scheduling new workloads onto the node.

How to diagnose

kubectl describe node <node-name> | grep DiskPressure
# DiskPressure: True

# SSH into the node
ssh <node-ip>

# Check disk usage
df -h

# Find what is consuming disk space
du -sh /var/lib/docker/*    # if using Docker
du -sh /var/lib/containerd/* # if using containerd
du -sh /var/log/*

Most common culprits

Unused container images accumulating on the node
Container logs growing without rotation
Large volumes mounted at unusual paths
Core dumps from crashing processes

Fix checklist

# Clean up unused images (safe to run on any node)
crictl rmi --prune
# or for Docker nodes:
docker image prune -af

# Clean up stopped containers
crictl rm $(crictl ps -a -q)

# Check and configure log rotation
cat /etc/docker/daemon.json
# Add: "log-opts": {"max-size": "100m", "max-file": "3"}

# For containerd, check log rotation in kubelet config

Node NotReady — Network Unavailable

What it means

The CNI (Container Network Interface) plugin has not configured networking correctly on the node. New pods cannot get IP addresses.

How to diagnose

kubectl describe node <node-name> | grep NetworkUnavailable
# NetworkUnavailable: True

# Check CNI plugin pods
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"

# Check CNI logs
kubectl logs -n kube-system <cni-pod-name>

# On the node, check CNI config
ls /etc/cni/net.d/
cat /etc/cni/net.d/<config-file>

Fix checklist

# Restart the CNI plugin pod on the affected node
kubectl delete pod -n kube-system <cni-pod-on-node>

# If CNI config is missing, re-apply the CNI manifest
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Draining vs Cordoning — When to Use Each

Action	Command	When to use
Cordon	`kubectl cordon <node>`	Stop new pods scheduling while you investigate. Existing pods keep running.
Drain	`kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`	Safely evict all pods before maintenance or node replacement.
Uncordon	`kubectl uncordon <node>`	Return node to schedulable state after fix is confirmed.

Warning: kubectl drain will evict all pods from the node. Make sure your workloads have enough replicas running on other nodes before draining, or you will cause downtime.

Quick Reference — Node Conditions and First Actions

Condition	First Command	Most Likely Cause
Ready: Unknown	`systemctl status kubelet`	Node unreachable / kubelet crashed
MemoryPressure	`kubectl top node`	Pod without memory limits
DiskPressure	`df -h` on node	Accumulated images or logs
NetworkUnavailable	Check CNI pod logs	CNI plugin failure
PIDPressure	`ps aux \| wc -l` on node	Fork bomb or runaway process

⬆ Back to Table of Contents

Section 3 — Debugging Kubernetes Networking and DNS

Networking failures in Kubernetes are among the hardest to debug. The system has multiple layers — pod networking, Services, DNS, Ingress, and NetworkPolicies — and a failure in any one of them produces similar symptoms: requests time out, connections are refused, or DNS names do not resolve. The trick is isolating which layer is failing before you start making changes.

The 5-Minute Network Triage Checklist

# 1. Can the pod reach the internet?
kubectl exec -it <pod-name> -n <namespace> -- curl -I https://google.com

# 2. Can the pod reach another pod by IP?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<pod-ip>:<port>

# 3. Can the pod reach a Service by ClusterIP?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<cluster-ip>:<port>

# 4. Can the pod reach a Service by DNS name?
kubectl exec -it <pod-name> -n <namespace> -- curl http://<service-name>.<namespace>.svc.cluster.local:<port>

# 5. Does DNS resolution work at all?
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default

Work through these in order. If step 2 works but step 3 fails, the problem is in kube-proxy or iptables rules. If step 3 works but step 4 fails, the problem is DNS. This narrows your diagnosis before you spend time looking in the wrong place.

Service Not Reachable

What it means

A pod cannot connect to a Kubernetes Service by its ClusterIP or DNS name, even though the target pods are running and healthy.

How to diagnose

# Check the Service exists and has the right port
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check if the Service has Endpoints (this is the most important check)
kubectl get endpoints <service-name> -n <namespace>

# If Endpoints shows <none>, the selector does not match any pods

The most common cause: selector mismatch

# Check what labels the Service is selecting
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A5

# Check what labels your pods actually have
kubectl get pods -n <namespace> --show-labels

# They must match exactly — including case

Fix checklist

# Verify kube-proxy is running on all nodes
kubectl get pods -n kube-system | grep kube-proxy

# Check iptables rules are being applied (on the node)
iptables -t nat -L KUBE-SERVICES | grep <service-cluster-ip>

# Test direct pod-to-pod connectivity to rule out Service layer
kubectl exec -it <source-pod> -- curl http://<target-pod-ip>:<port>

Tip: If kubectl get endpoints shows no addresses, your pods are either not running, not passing readiness probes, or the Service selector does not match the pod labels. Fix the selector first — it is the most common cause by far.

DNS Resolution Failures

What it means

Pods cannot resolve Kubernetes service names or external hostnames. This usually points to CoreDNS.

How to diagnose

# Check CoreDNS pods are running
kubectl get pods -n kube-system | grep coredns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test DNS from inside a pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check the pod's DNS config
kubectl exec -it <pod-name> -n <namespace> -- cat /etc/resolv.conf

Common CoreDNS log errors

[ERROR] plugin/errors: 2 SERVFAIL
[ERROR] Failed to list *v1.Endpoints
HINFO: read udp: i/o timeout

Fix checklist

# Restart CoreDNS pods
kubectl rollout restart deployment/coredns -n kube-system

# Check CoreDNS ConfigMap for misconfigurations
kubectl get configmap coredns -n kube-system -o yaml

# Check if DNS requests are hitting CoreDNS at all
kubectl get svc kube-dns -n kube-system
# Verify ClusterIP matches what pods have in /etc/resolv.conf

NetworkPolicy Blocking Traffic

What it means

A NetworkPolicy is explicitly denying traffic between pods or from external sources. This is a common gotcha in production clusters where NetworkPolicies have been applied incrementally.

How to diagnose

# List all NetworkPolicies in the namespace
kubectl get networkpolicy -n <namespace>

# Describe a specific policy to see its rules
kubectl describe networkpolicy <policy-name> -n <namespace>

# Check if any policy applies to your pod
kubectl get networkpolicy -n <namespace> -o yaml | grep -A10 podSelector

NetworkPolicy logic to remember

If no NetworkPolicy selects a pod → all traffic is allowed (default open)
If any NetworkPolicy selects a pod → only explicitly allowed traffic passes
A missing ingress or egress rule means that direction is denied

Fix checklist

# Temporarily test by deleting a restrictive policy (non-production only)
kubectl delete networkpolicy <policy-name> -n <namespace>

# Add an explicit allow rule for the traffic you need
# Example: allow ingress from another namespace
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-monitoring
  namespace: <namespace>
spec:
  podSelector:
    matchLabels:
      app: myapp
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
EOF

Ingress and LoadBalancer Troubleshooting

Ingress is where external traffic enters your cluster. When it breaks, users cannot reach your application even if every pod inside the cluster is healthy. Failures fall into three categories: misconfigured routing rules, TLS certificate issues, and Ingress controller failures.

The 5-Minute Ingress Triage Checklist

# 1. Check Ingress and whether an address is assigned
kubectl get ingress -n <namespace>
# ADDRESS column should show an IP -- if empty, controller has not processed it

# 2. Describe the Ingress for rule details
kubectl describe ingress <ingress-name> -n <namespace>

# 3. Check Ingress controller pods
kubectl get pods -n ingress-nginx

# 4. Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

# 5. Verify backend Service and Endpoints
kubectl get endpoints <backend-service> -n <namespace>

Common Misconfiguration: Wrong Path Rules

# Use Prefix for flexible matching -- not Exact
spec:
  rules:
  - host: api.opscart.com
    http:
      paths:
      - path: /api
        pathType: Prefix        # matches /api, /api/v1, /api/v2
        backend:
          service:
            name: api-service
            port:
              number: 8080

Common mistakes: trailing slash mismatch, Ingress and backend Service in different namespaces, or backend port not matching what the application actually listens on.

TLS Certificate Issues

# Verify the TLS secret has correct keys
kubectl get secret <tls-secret-name> -n <namespace> -o yaml
# Must contain: tls.crt and tls.key

# Check certificate expiry
kubectl get secret <tls-secret-name> -n <namespace> \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# If using cert-manager
kubectl get certificate -n <namespace>
kubectl describe certificate <cert-name> -n <namespace>
# Look for Ready: True

# If ACME HTTP-01 challenge is stuck (domain must be publicly reachable on port 80)
kubectl get challenges -n <namespace>
kubectl describe challenge <challenge-name> -n <namespace>

Gotcha: A wildcard cert (*.opscart.com) does not cover second-level subdomains (api.k8s.opscart.com). This catches teams off-guard when clusters use nested subdomain structures.

External DNS vs Internal DNS Hairpinning

When internal pods resolve an external hostname and get routed through the public load balancer instead of staying inside the cluster, you get unnecessary latency and potential firewall issues. Always use internal Service DNS (service.namespace.svc.cluster.local) for pod-to-pod communication and reserve external hostnames for traffic entering from outside.

Calico and Cilium — Common NetworkPolicy Gotchas

Calico

# Check Calico node pods
kubectl get pods -n kube-system -l k8s-app=calico-node

# Inspect active policies
calicoctl get networkpolicy --all-namespaces
calicoctl get globalnetworkpolicy

Gotcha: Calico’s GlobalNetworkPolicy applies cluster-wide and overrides namespace-level policies. A deny rule in a global policy will block traffic even if your namespace NetworkPolicy explicitly allows it. Always check global policies when namespace rules look correct but traffic is still blocked.

Cilium

# Check Cilium agents
kubectl get pods -n kube-system -l k8s-app=cilium

# Test connectivity
cilium connectivity test

# Inspect policies
kubectl exec -n kube-system <cilium-pod> -- cilium endpoint list
kubectl exec -n kube-system <cilium-pod> -- cilium policy get

Gotcha: CiliumNetworkPolicy CRDs silently have no effect on clusters running Calico. If you migrate CNI plugins and carry over Cilium-specific policies, they are ignored without any error.

Quick Reference — Network Failure Layers

Symptom	Layer	First Command
Pod to Pod by IP fails	CNI / Node networking	`ping <pod-ip>` from source pod
Pod to ClusterIP fails	kube-proxy / iptables	`kubectl get endpoints <svc>`
Pod to Service DNS fails	CoreDNS	`nslookup kubernetes.default`
Ingress returns 404	Wrong path rules	`kubectl describe ingress`
Ingress returns 503	Backend Endpoints empty	`kubectl get endpoints <svc>`
TLS handshake fails	Cert missing or expired	`kubectl get certificate`
NetworkPolicy blocking	Policy selector mismatch	`kubectl get networkpolicy -A`
Intermittent timeouts	CNI bug or node networking	Check CNI pod logs

⬆ Back to Table of Contents

Section 4 — Debugging Resource and Scheduling Problems

Scheduling failures and resource misconfigurations are silent killers in production Kubernetes. Pods sit in Pending state while on-call engineers scramble, or applications get throttled and slow down without any obvious error. This section covers how to quickly identify what is blocking scheduling and how to fix resource configuration before it causes an incident.

The 5-Minute Scheduling Triage Checklist

# 1. Why is the pod Pending?
kubectl describe pod <pod-name> -n <namespace>
# Read the Events section — the scheduler always explains why it cannot place the pod

# 2. What resources are available on each node?
kubectl describe nodes | grep -A8 "Allocated resources"

# 3. What are the current resource requests across the cluster?
kubectl top nodes

# 4. Are there any taints blocking placement?
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# 5. Check resource quotas in the namespace
kubectl describe resourcequota -n <namespace>

Pods Stuck in Pending — Insufficient Resources

What it means

The scheduler cannot find a node with enough CPU or memory to satisfy the pod’s resource requests.

How to diagnose

kubectl describe pod <pod-name> -n <namespace>

# Events section will show:
# 0/5 nodes are available: 5 Insufficient memory.
# 0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had taint...

Fix checklist

# Find nodes with available capacity
kubectl describe nodes | grep -A5 "Allocated resources"

# Check if requests are set too high on the pending pod
kubectl get pod <pod-name> -o yaml | grep -A10 resources

# Option 1: Lower the resource requests if they are over-provisioned
# Option 2: Scale up the node pool (in cloud environments)
# Option 3: Remove idle pods consuming resources unnecessarily

# Identify resource-heavy pods
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

Tip: Resource requests determine scheduling — not limits. A pod with request: 4Gi memory needs a node with 4Gi free, even if it only ever uses 512Mi. Audit your requests regularly.

CPU Throttling — Pod Running But Slow

What it means

The pod is running but its CPU usage is being throttled because it has hit its CPU limit. The application slows down without any crash or visible error. This is one of the hardest production issues to spot without metrics.

How to diagnose

# Check CPU usage vs limits
kubectl top pod <pod-name> -n <namespace>

# Look for throttling in metrics (if using Prometheus)
# Metric: container_cpu_cfs_throttled_seconds_total

# Describe the pod to see configured limits
kubectl get pod <pod-name> -o yaml | grep -A10 resources

Fix checklist

Increase the CPU limit in the pod spec
Or remove the CPU limit entirely and rely on requests only (acceptable for non-critical workloads)
Investigate if the application has a genuine CPU spike (code issue) vs being under-provisioned

resources:
  requests:
    cpu: "250m"
  limits:
    cpu: "1000m"   # Increase this if throttling is confirmed

Taints and Tolerations Blocking Scheduling

What it means

Nodes have taints applied that prevent pods from being scheduled unless the pod explicitly tolerates those taints. Common in clusters with dedicated node pools (GPU nodes, spot nodes, system nodes).

How to diagnose

# Check node taints
kubectl describe node <node-name> | grep Taints

# Or check all nodes at once
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check if your pod has the required tolerations
kubectl get pod <pod-name> -o yaml | grep -A10 tolerations

Fix checklist

# Add the appropriate toleration to your pod spec
# Example: tolerate a spot instance taint
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
  operator: "Equal"
  value: "spot"
  effect: "NoSchedule"

ResourceQuota Blocking Pod Creation

What it means

A namespace-level ResourceQuota has been hit. New pods cannot be created even if the cluster has capacity.

How to diagnose

kubectl describe resourcequota -n <namespace>

# Output shows used vs hard limits:
# Name:            default-quota
# Resource         Used    Hard
# --------         ---     ---
# limits.cpu       7800m   8000m
# limits.memory    14Gi    16Gi
# pods             48      50

Fix checklist

# Option 1: Increase the quota
kubectl edit resourcequota <quota-name> -n <namespace>

# Option 2: Delete unused pods to free up quota
kubectl get pods -n <namespace> | grep Completed
kubectl delete pod <completed-pod> -n <namespace>

# Option 3: Reduce limits on existing pods if over-provisioned

Quick Reference — Scheduling Failure Reasons

Scheduler Message	Root Cause	Fix
Insufficient cpu	CPU requests exceed available	Lower requests or scale nodes
Insufficient memory	Memory requests exceed available	Lower requests or scale nodes
node(s) had taint	Pod missing toleration	Add toleration to pod spec
pod has unbound PVCs	PVC not bound	Fix PVC / StorageClass
node(s) didn’t match Pod’s node affinity	Affinity rules too strict	Relax affinity or label nodes
exceeded quota	ResourceQuota hit	Increase quota or delete idle pods

HPA and VPA Misconfigurations

Horizontal Pod Autoscaler (HPA)

HPA scales pod count based on metrics — most commonly CPU utilization. When it does not scale as expected, the issue is almost always one of three things: metrics are not being collected, the target utilization is set incorrectly, or resource requests are missing (HPA uses requests as the baseline for utilization calculations).

# Check HPA status
kubectl get hpa -n <namespace>

# Describe HPA for detailed status
kubectl describe hpa <hpa-name> -n <namespace>

# Common output when HPA is stuck:
# Conditions:
#   Type            Status  Reason
#   AbleToScale     True    SucceededGetScale
#   ScalingActive   False   FailedGetResourceMetric
#   ScalingLimited  False   DesiredWithinRange
#
# ScalingActive: False with FailedGetResourceMetric means
# the metrics server cannot read CPU/memory for your pods

HPA not scaling up — fix checklist

# 1. Verify metrics-server is running
kubectl get pods -n kube-system | grep metrics-server

# 2. Check if metrics are available for your pods
kubectl top pods -n <namespace>
# If this command fails, metrics-server is the problem

# 3. Verify resource requests are set on the pod
# HPA calculates utilization as: actual usage / request
# Without requests, HPA cannot calculate a percentage
kubectl get pod <pod-name> -o yaml | grep -A5 resources

# 4. Check the HPA min/max replicas -- max may already be reached
kubectl describe hpa <hpa-name> -n <namespace> | grep -E "Min|Max|Current"

Common gotcha: HPA targets 50% CPU utilization but the pod has no CPU request set. HPA simply will not work. Every pod managed by HPA must have CPU requests defined.

Vertical Pod Autoscaler (VPA)

VPA adjusts CPU and memory requests automatically based on actual usage. It is powerful but has important constraints in production.

# Check VPA status
kubectl get vpa -n <namespace>
kubectl describe vpa <vpa-name> -n <namespace>

# VPA recommendations appear under Status.Recommendation
# Lower bound: minimum safe request
# Target: recommended request
# Upper bound: maximum before VPA considers scaling up node

Warning: VPA in Auto mode will evict and restart pods to apply new resource values. Never use VPA Auto mode with HPA on the same deployment targeting CPU/memory — they conflict. Use VPA Off mode to get recommendations only, then apply them manually.

Node Autoscaler Delays in Cloud Environments

In AKS, EKS, and GKE, when the cluster autoscaler needs to provision a new node, there is a provisioning delay — typically 2 to 5 minutes. During this window, pods sit in Pending state and your HPA may be trying to scale up but has nowhere to put the new pods.

# Check if Cluster Autoscaler is making decisions
kubectl get events -n kube-system | grep -i "scale\|autoscal"

# Check Cluster Autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# Common log messages:
# "Scale up triggered"         -- CA decided to add a node
# "Scale down triggered"       -- CA decided to remove a node
# "Pod is unschedulable"       -- Pod waiting for new node
# "Node group has reached maximum size" -- Cannot scale further

Fix: Use Pod Disruption Budgets and over-provisioning

To reduce the impact of cold-start delays, keep a small buffer of over-provisioned capacity using a low-priority placeholder deployment:

# Placeholder pods that get evicted when real workloads need the capacity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-buffer
spec:
  replicas: 2
  template:
    spec:
      priorityClassName: low-priority-buffer
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.1
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"

When real pods need scheduling, they evict the buffer pods, which triggers the autoscaler to add a node while existing workloads continue running.

Pod Anti-Affinity Surprises

Anti-affinity rules prevent pods from being scheduled on the same node or zone as other pods. In small clusters, strict anti-affinity can make pods permanently unschedulable.

# Check anti-affinity rules
kubectl get pod <pod-name> -o yaml | grep -A20 affinity

# If pods are stuck in Pending with this message:
# 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules
# You have more replicas than nodes that satisfy the anti-affinity constraint

Required vs Preferred anti-affinity

# REQUIRED -- pod will never schedule if constraint cannot be satisfied
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # hard rule
    - labelSelector:
        matchLabels:
          app: myapp
      topologyKey: kubernetes.io/hostname

# PREFERRED -- scheduler tries to satisfy but will proceed if not possible
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:  # soft rule
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: myapp
        topologyKey: kubernetes.io/hostname

Production rule of thumb: Use required anti-affinity only when you truly cannot tolerate two replicas on the same node (stateful workloads with local storage). Use preferred for everything else — it achieves the same distribution goal without making your deployment fragile in smaller clusters.

⬆ Back to Table of Contents

Section 5 — Debugging Storage and Persistent Volume Issues

Storage failures are disruptive. A pod stuck waiting for a volume cannot start, and if that pod is part of a StatefulSet, the entire set stalls. Storage issues also tend to have longer blast radius — a PVC stuck in Terminating can block namespace deletion for hours if you do not know what to look for.

The 5-Minute Storage Triage Checklist

# 1. Check PVC status
kubectl get pvc -n <namespace>

# 2. Check PV status
kubectl get pv

# 3. Check StorageClass
kubectl get storageclass

# 4. Describe the stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>

# 5. Check events for volume errors
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i volume

PVC Stuck in Pending

What it means

A PersistentVolumeClaim has been created but no PersistentVolume has been bound to it. The pod waiting for this PVC cannot start.

Common causes

No StorageClass is defined or the referenced StorageClass does not exist
The StorageClass provisioner is not running
No PV available that matches the PVC’s access mode and size requirements
The requested storage size is larger than any available PV

How to diagnose

kubectl describe pvc <pvc-name> -n <namespace>

# Look for events like:
# no persistent volumes available for this claim and no storage class is set
# storageclass.storage.k8s.io "fast-ssd" not found
# waiting for a volume to be created

# Check StorageClass exists
kubectl get storageclass
kubectl describe storageclass <storageclass-name>

# Check if the provisioner pod is running
kubectl get pods -n kube-system | grep provisioner

Fix checklist

# If StorageClass is missing, create or reference the correct one
kubectl get storageclass   # list available classes

# For Azure (AKS), common StorageClasses:
# managed-premium  → Azure Premium SSD
# managed          → Azure Standard SSD
# azurefile        → Azure File Share

# Check if provisioner is healthy
kubectl logs -n kube-system <provisioner-pod>

# If using static provisioning, manually create a PV that matches the PVC

PVC Stuck in Terminating

What it means

A PVC has been deleted but is stuck in Terminating state. This commonly blocks namespace deletion.

How to diagnose

kubectl get pvc -n <namespace>
# STATUS: Terminating

kubectl describe pvc <pvc-name> -n <namespace>
# Look for finalizers:
# Finalizers: [kubernetes.io/pvc-protection]

What is happening: Kubernetes uses a finalizer kubernetes.io/pvc-protection to prevent PVC deletion while a pod is still using it. If the pod has been deleted but the PVC is still stuck, the finalizer needs to be removed manually.

Fix checklist

# Remove the finalizer to force deletion
kubectl patch pvc <pvc-name> -n <namespace> \
  -p '{"metadata":{"finalizers":[]}}' \
  --type=merge

# Verify deletion
kubectl get pvc -n <namespace>

Warning: Only remove finalizers when you are certain no pod is actively using the volume. Forcing deletion while a pod has an active mount can cause data corruption.

Volume Mount Failures

What it means

The PVC is bound but the pod cannot mount the volume. The pod stays in ContainerCreating state.

How to diagnose

kubectl describe pod <pod-name> -n <namespace>

# Look for Events like:
# Unable to attach or mount volumes
# timed out waiting for the condition
# Multi-Attach error for volume — volume is already exclusively attached to one node

kubectl get events -n <namespace> | grep -i "mount\|attach\|volume"

Multi-Attach Error

This is common with ReadWriteOnce volumes in cloud environments. The volume is still attached to a previous node (after a node failure or pod rescheduling) and cannot be attached to a new node simultaneously.

# Find which node the volume is attached to
kubectl get pv <pv-name> -o yaml | grep nodeAffinity

# Force detach by deleting the VolumeAttachment object
kubectl get volumeattachment
kubectl delete volumeattachment <attachment-name>

Quick Reference — Storage States

State	Meaning	First Action
PVC: Pending	No matching PV or provisioner issue	Check StorageClass and provisioner
PVC: Bound	Healthy — PV assigned	No action needed
PVC: Terminating	Stuck on finalizer	Remove finalizer if pod is gone
PV: Released	PV was used but PVC deleted	Reclaim or delete manually
PV: Failed	Provisioner error	Check provisioner logs
Pod: ContainerCreating	Volume not mounted	Check events for attach errors

⬆ Back to Table of Contents

Section 6 — Debugging Control Plane Failures

Control plane failures are the most severe class of Kubernetes issues. When the API server, etcd, or scheduler becomes degraded, the entire cluster stops responding to changes. Existing workloads may continue running (pods do not die immediately when the API server goes down), but you lose the ability to deploy, scale, or recover from failures.

In managed Kubernetes environments like AKS, EKS, or GKE, you have limited direct access to control plane components. This section covers what you can diagnose from within the cluster and what to escalate to your cloud provider.

The 5-Minute Control Plane Triage Checklist

# 1. Can you reach the API server at all?
kubectl cluster-info

# 2. Check control plane component health
kubectl get componentstatuses

# 3. Check system pods
kubectl get pods -n kube-system

# 4. Check API server response time
time kubectl get nodes

# 5. Check etcd health (self-managed clusters)
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

API Server Slow or Unresponsive

What it means

kubectl commands hang or time out. New deployments cannot be created. Existing pods continue running because kubelet operates independently, but the cluster cannot be managed.

How to diagnose

# Test API server response
kubectl get nodes --request-timeout=5s

# Check API server pod logs (self-managed)
kubectl logs -n kube-system kube-apiserver-<node>

# For AKS — check Azure portal for control plane health
# az aks show --resource-group <rg> --name <cluster> --query "provisioningState"

# Check for high API request rates
kubectl get --raw /metrics | grep apiserver_request_total

Common causes

etcd is slow or overloaded (API server is backed by etcd)
Too many LIST/WATCH requests flooding the API server (runaway controllers or operators)
Certificates have expired
Admission webhooks timing out and blocking all requests

Fix checklist

# Check for slow admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# A single failing webhook with no timeout set can block ALL API requests
# Set a failurePolicy and timeoutSeconds on all webhooks
# failurePolicy: Ignore   ← safe default for non-critical webhooks
# timeoutSeconds: 5

# For AKS — if API server is unresponsive, open a support ticket immediately
# This is a cloud provider responsibility in managed clusters

etcd Issues (Self-Managed Clusters)

What it means

etcd is the key-value store that holds all cluster state. If etcd is degraded, the API server cannot read or write cluster state. This is catastrophic.

How to diagnose

# Check etcd pod status
kubectl get pods -n kube-system | grep etcd

# Check etcd cluster health
etcdctl endpoint health --cluster

# Check etcd leader
etcdctl endpoint status --cluster -w table

# Check etcd disk latency (high disk I/O = slow etcd)
etcdctl check perf

Warning signs in etcd logs

took too long to execute
failed to send out heartbeat on time
server is likely overloaded

These messages indicate etcd is under I/O pressure. etcd is extremely sensitive to disk latency — it requires fast SSDs in production.

Fix checklist

# Check disk performance on etcd node
iostat -x 1 10

# etcd should be on dedicated fast SSD (not shared with other workloads)
# Recommended: <10ms disk write latency

# Compact and defragment etcd if it has grown large
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
etcdctl defrag

# Take an etcd snapshot before any risky operations
etcdctl snapshot save /backup/etcd-snapshot.db

Scheduler and Controller Manager Failures

What it means

The scheduler is not placing new pods. The controller manager is not reconciling deployments, replicasets, or other controllers.

How to diagnose

# Check scheduler and controller manager pods
kubectl get pods -n kube-system | grep -E "scheduler|controller"

# Check logs
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system kube-controller-manager-<node>

# Verify scheduler is active (HA clusters have leader election)
kubectl get lease -n kube-system

Quick Reference — Control Plane Components

Component	Role	What breaks without it
API Server	Central management endpoint	All kubectl commands fail
etcd	Cluster state storage	API server cannot read/write state
Scheduler	Pod placement decisions	New pods stay Pending forever
Controller Manager	Reconciles deployments, replicasets	Scaling and self-healing stop working
Cloud Controller	Cloud provider integration	Load balancers and node registration break

⬆ Back to Table of Contents

Section 7 — Real Production Incident Walkthroughs

Theory is useful. Production incidents are where you actually learn. This section walks through three real-world Kubernetes incident patterns — the kind that appear at 2 AM — covering the full journey from initial alert to root cause and resolution.

Each walkthrough follows the same format: what the alert looked like, how the investigation progressed, where the actual root cause was found, and what the fix was.

Incident 1 — CrashLoopBackOff After a Deployment

The alert

PagerDuty fires at 11:45 PM. The payment service pod count has dropped from 5 to 0 healthy pods. All 5 pods are in CrashLoopBackOff. The deployment was pushed 12 minutes ago.

Initial triage

kubectl get pods -n payments
# NAME                        READY   STATUS             RESTARTS   AGE
# payment-svc-7d9f6b-xk2p9   0/1     CrashLoopBackOff   4          8m
# payment-svc-7d9f6b-mn3q1   0/1     CrashLoopBackOff   4          8m

kubectl logs payment-svc-7d9f6b-xk2p9 --previous
# Error: failed to connect to database: connection refused
# dial tcp 10.0.1.45:5432: connect: connection refused

Investigation

The logs show a database connection failure. But the database has not changed — why would a new deployment break the DB connection?

kubectl describe pod payment-svc-7d9f6b-xk2p9 -n payments | grep -A20 Environment
# DB_HOST: postgres-svc
# DB_PORT: 5432
# DB_NAME: payments_prod
# DB_PASSWORD: <set to the key 'password' in secret 'db-credentials-v2'>

kubectl get secret db-credentials-v2 -n payments
# Error from server (NotFound): secrets "db-credentials-v2" not found

Root cause

The new deployment referenced db-credentials-v2 — a new secret that had been created in the infrastructure namespace but not in the payments namespace. The previous deployment used db-credentials-v1. The secret name was updated in the deployment manifest but the secret was never created in the correct namespace.

Fix

# Copy the secret to the correct namespace
kubectl get secret db-credentials-v2 -n infrastructure -o yaml | \
  sed 's/namespace: infrastructure/namespace: payments/' | \
  kubectl apply -f -

# Restart the deployment
kubectl rollout restart deployment/payment-svc -n payments

# Verify
kubectl get pods -n payments
# All 5 pods Running within 45 seconds

Time to resolution: 18 minutes Lesson: Secret references must be validated as part of the deployment pipeline before rollout. Add a pre-deploy check that verifies all referenced Secrets and ConfigMaps exist in the target namespace.

Incident 2 — Node NotReady Causing Cluster-Wide Scheduling Pressure

The alert

Monday morning. Prometheus alert fires: two nodes in the production cluster have entered NotReady state within 5 minutes of each other. Pod scheduling is failing across 12 namespaces. Restart counts are spiking on multiple services.

Initial triage

kubectl get nodes
# NAME           STATUS     ROLES   AGE
# aks-nodepool-001   NotReady   agent   45d
# aks-nodepool-002   NotReady   agent   45d
# aks-nodepool-003   Ready      agent   45d

kubectl describe node aks-nodepool-001 | grep -A5 Conditions
# MemoryPressure: False
# DiskPressure:   True
# Ready:          False

kubectl describe node aks-nodepool-002 | grep -A5 Conditions
# DiskPressure:   True
# Ready:          False

Investigation

Both nodes show DiskPressure. Two nodes hitting disk pressure simultaneously points to a shared root cause, not random failure.

# SSH into node 1
df -h
# /dev/sda1   99%   /

du -sh /var/lib/containerd/*
# 47G  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

# Check what images are stored
crictl images | wc -l
# 312 images

Root cause

A nightly batch job had been pulling a new 8GB ML model container image each run but never cleaning up previous versions. Over 3 weeks, 312 images had accumulated on each node, filling the disk. Both nodes hit the threshold at roughly the same time because both had been running since the same cluster upgrade.

Fix

# Immediate: clean up images on affected nodes
crictl rmi --prune
# Freed 43GB on each node

# Uncordon the nodes
kubectl uncordon aks-nodepool-001
kubectl uncordon aks-nodepool-002

# Long-term fix: add image garbage collection policy to kubelet config
# /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60

Time to resolution: 31 minutes Lesson: Image garbage collection thresholds should be set explicitly in kubelet config. The default thresholds (85% high, 80% low) are too conservative for nodes running batch jobs with large images.

Incident 3 — Intermittent 503 Errors from Ingress

The alert

Customer support reports intermittent 503 errors on the public API. The errors appear for 5–10 seconds and then resolve. This has been happening every 15–20 minutes for the past two hours. No deployment has occurred in the last 6 hours.

Initial triage

kubectl get pods -n api-gateway
# All pods Running — nothing obviously wrong

kubectl get ingress -n api-gateway
# Ingress exists, host and rules look correct

# Check Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100

# Spotting repeated entries:
# upstream timed out (110: Connection timed out) while reading response header
# connect() failed (111: Connection refused) while connecting to upstream

Investigation

The errors are upstream timeouts — the Ingress controller can reach the pods but the pods are not responding in time. The intermittent pattern (every 15–20 minutes) suggests something is triggering pod restarts or temporary unavailability on a schedule.

kubectl get pods -n api-gateway
# NAME                    READY   RESTARTS   AGE
# api-svc-6b8d9f-xp2k1   1/1     Running    8m     ← recently restarted
# api-svc-6b8d9f-mn7q2   1/1     Running    6m     ← recently restarted

kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway
# Liveness probe failed: HTTP probe failed with statuscode: 503
# Killing container with id ...: pod "api-svc..." container "api-svc" is unhealthy

# Check what the liveness probe is hitting
kubectl get pod api-svc-6b8d9f-xp2k1 -o yaml | grep -A15 livenessProbe
# path: /health
# periodSeconds: 10
# failureThreshold: 3
# initialDelaySeconds: 5

Root cause

The /health endpoint was performing a database health check on every probe call. Every 15–20 minutes, the database connection pool was briefly exhausted during a scheduled reporting job, causing the /health endpoint to return 503. After 3 consecutive failures (30 seconds), Kubernetes killed the pod and restarted it — during which time live traffic hit the other pods, which were also failing their probes.

Fix

# Short-term: Separate liveness from readiness probe
# Liveness → simple check (is the process alive?)
# Readiness → deep check (is the service ready for traffic?)

livenessProbe:
  httpGet:
    path: /ping      # returns 200 immediately, no DB check
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health    # full DB check — removes pod from LB if failing
    port: 8080
  periodSeconds: 10
  failureThreshold: 2

Time to resolution: 47 minutes Lesson: Liveness probes should never perform deep dependency checks. A liveness probe that calls a database can cause cascading pod restarts during any database hiccup. Use /ping or /alive for liveness and reserve deep checks for readiness probes.

Incident 4 — HPA Not Scaling During Traffic Spike

The alert

At 3:15 PM on a Friday, the e-commerce platform starts returning slow responses. API latency climbs from 120ms to 4,200ms. The SRE on call checks the dashboard and sees CPU at 94% on all pods. The HPA should have kicked in 20 minutes ago.

Initial triage

kubectl get hpa -n ecommerce
# NAME          REFERENCE             TARGETS         MINPODS   MAXPODS   REPLICAS
# api-hpa       Deployment/api-svc   <unknown>/50%   3         20        3

# TARGETS shows <unknown> -- HPA cannot read metrics

Investigation

kubectl describe hpa api-hpa -n ecommerce

# Conditions:
#   ScalingActive: False
#   Reason: FailedGetResourceMetric
#   Message: unable to get metrics for resource cpu: unable to fetch metrics
#   from resource metrics API: the server is currently unable to handle the request

# Check metrics-server
kubectl get pods -n kube-system | grep metrics-server
# metrics-server-7d9c8b-xp2k   0/1   CrashLoopBackOff   8   35m

kubectl logs -n kube-system metrics-server-7d9c8b-xp2k --previous
# E0308 15:01:22 Failed to scrape node "aks-nodepool-003"
# x509: certificate signed by unknown authority

Root cause

The metrics-server was failing its TLS verification against the kubelet API. A cluster certificate rotation had been performed two days earlier, but the metrics-server deployment had not been updated with the --kubelet-insecure-tls flag (acceptable in managed AKS) or the new CA certificate. The metrics-server had been in CrashLoopBackOff for two days — but because existing pods were handling the load, nobody noticed until traffic spiked.

Fix

# Immediate: patch metrics-server to skip TLS verification (AKS managed clusters)
kubectl patch deployment metrics-server -n kube-system --type=json   -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-",
  "value":"--kubelet-insecure-tls"}]'

# Verify metrics-server recovers
kubectl get pods -n kube-system | grep metrics-server
# Running after ~60 seconds

# HPA immediately starts reading metrics
kubectl get hpa -n ecommerce
# TARGETS: 94%/50% -- HPA begins scaling up within 30 seconds

# Manually scale to recover faster while HPA catches up
kubectl scale deployment api-svc --replicas=12 -n ecommerce

Time to resolution: 22 minutes Lesson: Monitor metrics-server health as a first-class concern. An HPA that cannot read metrics provides zero protection during traffic spikes. Add a Prometheus alert on metrics-server pod restarts and kube_horizontalpodautoscaler_status_condition for ScalingActive=False.

Incident 5 — StatefulSet Pods Stuck After Node Replacement

The alert

During a planned AKS node pool upgrade (rolling node replacement), 3 out of 5 pods in a Kafka StatefulSet enter Pending state and do not recover after 25 minutes. The node pool upgrade appears to have completed successfully.

Initial triage

kubectl get pods -n messaging -l app=kafka
# NAME      READY   STATUS    RESTARTS
# kafka-0   1/1     Running   0
# kafka-1   1/1     Running   0
# kafka-2   0/1     Pending   0
# kafka-3   0/1     Pending   0
# kafka-4   0/1     Pending   0

kubectl describe pod kafka-2 -n messaging
# Events:
#   Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Investigation

kubectl get pvc -n messaging
# NAME              STATUS    VOLUME   CAPACITY   ACCESS MODES
# data-kafka-0      Bound     ...      100Gi      RWO
# data-kafka-1      Bound     ...      100Gi      RWO
# data-kafka-2      Pending   <none>   100Gi      RWO
# data-kafka-3      Pending   <none>   100Gi      RWO
# data-kafka-4      Pending   <none>   100Gi      RWO

kubectl describe pvc data-kafka-2 -n messaging
# Events:
#   Warning ProvisioningFailed: storageclass.storage.k8s.io "premium-zrs" not found

Root cause

The old node pool used a StorageClass called premium-zrs (zone-redundant storage). During the node pool upgrade, the team had switched to a new node pool in a different availability zone configuration. The premium-zrs StorageClass had been removed as part of a cleanup task three weeks earlier — but because the existing PVCs were already bound, nobody noticed. When the StatefulSet pods were evicted and rescheduled during the upgrade, they attempted to provision new PVCs using the deleted StorageClass.

Fix

# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF

# Trigger PVC reconciliation by deleting and recreating stuck PVCs
# WARNING: Only safe if the underlying Azure disk still exists
kubectl delete pvc data-kafka-2 data-kafka-3 data-kafka-4 -n messaging

# StatefulSet controller automatically recreates PVCs
# Pods recover as PVCs bind to new or existing volumes
kubectl get pods -n messaging -w

Time to resolution: 41 minutes Lesson: StorageClass deletion is irreversible and silently breaks StatefulSets that rely on it for new volume provisioning. Before deleting any StorageClass, audit all PVCs and StatefulSets that reference it. Add a pre-deletion check to your runbooks: kubectl get pvc --all-namespaces -o json | jq '.items[] | select(.spec.storageClassName=="<class>") | .metadata.name'

Summary — Incident Patterns and Prevention

Incident	Root Cause	Prevention
CrashLoopBackOff after deploy	Secret missing in target namespace	Pre-deploy secret validation in CI pipeline
Node DiskPressure	Accumulated container images	Set kubelet image GC thresholds explicitly
Intermittent 503 from Ingress	Liveness probe doing DB health check	Separate liveness and readiness probe paths
HPA not scaling during spike	metrics-server in CrashLoopBackOff	Alert on metrics-server health, not just HPA
StatefulSet stuck after node replacement	StorageClass deleted mid-lifecycle	Audit StorageClass references before deletion

Appendix — Production Kubernetes Debugging Cheatsheet

Use this section as your rapid-reference during incidents. Every command you need, organized by failure type.

Pod Failures

# Get pod status across all namespaces
kubectl get pods --all-namespaces | grep -v Running

# Describe pod (events, state, conditions)
kubectl describe pod <pod-name> -n <namespace>

# Logs from running container
kubectl logs <pod-name> -n <namespace>

# Logs from previously crashed container
kubectl logs <pod-name> -n <namespace> --previous

# Follow logs in real time
kubectl logs -f <pod-name> -n <namespace>

# Multi-pod log tailing with stern
stern <deployment-name> -n <namespace>

# Check exit code of last container run
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# Get all events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Node Issues

# List all nodes with status
kubectl get nodes

# Describe node (conditions, allocated resources, events)
kubectl describe node <node-name>

# Check resource usage per node
kubectl top nodes

# Check all system pods on a specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# Cordon node (stop new scheduling)
kubectl cordon <node-name>

# Drain node (evict all pods safely)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Uncordon node (return to schedulable)
kubectl uncordon <node-name>

# Check kubelet status on node (run via SSH)
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check disk usage on node (run via SSH)
df -h
du -sh /var/lib/containerd/*

# Clean up unused images on node (run via SSH)
crictl rmi --prune

Networking and DNS

# Test pod-to-pod connectivity
kubectl exec -it <source-pod> -n <namespace> -- curl http://<target-pod-ip>:<port>

# Test Service reachability by ClusterIP
kubectl exec -it <pod> -n <namespace> -- curl http://<cluster-ip>:<port>

# Test DNS resolution
kubectl exec -it <pod> -n <namespace> -- nslookup kubernetes.default
kubectl exec -it <pod> -n <namespace> -- nslookup <service>.<namespace>.svc.cluster.local

# Check Service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check Service selector vs pod labels
kubectl get svc <service-name> -o yaml | grep selector -A5
kubectl get pods --show-labels -n <namespace>

# Check CoreDNS pods and logs
kubectl get pods -n kube-system | grep coredns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check NetworkPolicies affecting a namespace
kubectl get networkpolicy -n <namespace>

# Check Ingress and its address
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>

# Check TLS certificate expiry
kubectl get secret <tls-secret> -n <namespace>   -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

Resource and Scheduling

# Why is pod Pending? (always check Events section)
kubectl describe pod <pod-name> -n <namespace>

# Check node allocated resources
kubectl describe nodes | grep -A8 "Allocated resources"

# Top resource consumers
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>

# Check ResourceQuota
kubectl describe resourcequota -n <namespace>

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check metrics-server health
kubectl get pods -n kube-system | grep metrics-server
kubectl top nodes   # if this fails, metrics-server is down

Storage

# Check PVC status
kubectl get pvc --all-namespaces

# Describe stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check PV status
kubectl get pv

# Check StorageClass
kubectl get storageclass

# Find all PVCs using a specific StorageClass
kubectl get pvc --all-namespaces -o json |   jq '.items[] | select(.spec.storageClassName=="<class>") | .metadata.name'

# Remove stuck PVC finalizer (force delete)
kubectl patch pvc <pvc-name> -n <namespace>   -p '{"metadata":{"finalizers":[]}}' --type=merge

# Check VolumeAttachments (for multi-attach errors)
kubectl get volumeattachment
kubectl delete volumeattachment <attachment-name>

Control Plane

# Check API server connectivity
kubectl cluster-info

# Check component statuses
kubectl get componentstatuses

# Check system pods
kubectl get pods -n kube-system

# Check API server response time
time kubectl get nodes

# Check Cluster Autoscaler decisions
kubectl get events -n kube-system | grep -i autoscal
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# Check admission webhooks (can block all API requests if timing out)
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# etcd health check (self-managed clusters)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table

Pod Exit Code Quick Reference

Exit Code	Meaning	First Check
0	Clean exit	Liveness/readiness probe config
1	Application error	`kubectl logs --previous`
137	OOMKilled	Increase memory limit
139	Segmentation fault	Application or library bug
143	SIGTERM received	Check preStop hooks
255	Entrypoint not found	Check image CMD/ENTRYPOINT

Node Condition Quick Reference

Condition	Status	Meaning	First Action
Ready	True	Node healthy	No action
Ready	False	Kubelet reporting failure	`systemctl status kubelet`
Ready	Unknown	Node unreachable (40s timeout)	Check node connectivity
MemoryPressure	True	Low on memory	`kubectl top node`, find memory hog
DiskPressure	True	Low on disk	`df -h` on node, `crictl rmi --prune`
PIDPressure	True	Too many processes	`ps aux` on node
NetworkUnavailable	True	CNI not configured	Check CNI plugin pods

Debugging Flow — Production Outage Decision Tree

Alert fires
    |
    v
kubectl get pods --all-namespaces | grep -v Running
    |
    +-- Pods failing? ---------> kubectl describe pod / kubectl logs --previous
    |                            Check exit code, check events, check secrets/configmaps
    |
    +-- Nodes NotReady? -------> kubectl describe node
    |                            Check: MemoryPressure / DiskPressure / kubelet status
    |
    +-- Pods Pending? ---------> kubectl describe pod (read Events)
    |                            Insufficient CPU/memory? Taints? PVC unbound?
    |
    +-- Network timeouts? -----> kubectl get endpoints <svc>
    |                            Selector match? CoreDNS healthy? NetworkPolicy?
    |
    +-- All pods Running but
        app is slow/erroring? -> kubectl top pods (CPU throttling?)
                                 kubectl describe hpa (scaling blocked?)
                                 kubectl logs (application-level errors?)

Return to the Kubernetes Guide for the full topic cluster.

Explore related labs: Kubernetes Troubleshooting Labs

Closing — Your Debugging Mindset

Every Kubernetes incident you work through teaches you something the next one will test. The engineers who debug fastest are not the ones who have memorised the most commands — they are the ones who have a systematic approach and follow it consistently under pressure.

The framework is always the same:

Observe — what is the symptom? What layer is failing?
Isolate — narrow the problem to one component
Diagnose — gather evidence before making changes
Fix — apply the smallest change that resolves the issue
Verify — confirm the fix and check for side effects
Document — write the postmortem so the team learns

Use this handbook as your starting point. Over time, your own production incidents will fill in the gaps that no handbook can cover.

Return to the Kubernetes Guide for the full topic cluster.

Explore related labs: Kubernetes Troubleshooting Labs