“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”
Why Real Examples Matter
Kubernetes documentation teaches you what each component does. Tutorials show you how to deploy applications. Neither prepares you for a 2 AM alert where the symptom and the root cause are three layers apart.
Real production incidents are messy. The first thing you see is never the actual problem. A pod shows CrashLoopBackOff — but the real cause is a deleted secret in a different namespace. A service is returning 503 — but the root cause is a liveness probe hitting a database health check that is throttled by a scheduled report. Nodes enter DiskPressure — but the cause is a batch job that has been silently accumulating container images for three weeks.
This guide documents seven real Kubernetes production incidents in full detail — the alert, the investigation path, the wrong turns, the root cause, the fix, and the lesson. Each one is a pattern you will encounter in your own clusters.
How to Read These Examples
Each incident follows the same structure:
- The alert — what triggered the on-call response
- Initial triage — the first commands run and what they showed
- Investigation — the path from symptom to root cause, including wrong turns
- Root cause — what was actually wrong
- Fix — the commands that resolved it
- Time to resolution — how long it took
- Lesson — what to change to prevent recurrence
Incident 1 — CrashLoopBackOff After Deployment (Missing Secret)
The alert: PagerDuty fires at 11:45 PM. The payment service pod count has dropped from 5 to 0 healthy pods. All 5 pods are in CrashLoopBackOff. The deployment was pushed 12 minutes ago.
Initial triage:
kubectl get pods -n payments
# payment-svc-7d9f6b-xk2p9 0/1 CrashLoopBackOff 4 8m
kubectl logs payment-svc-7d9f6b-xk2p9 -n payments --previous
# Error: failed to connect to database: connection refused
# dial tcp 10.0.1.45:5432: connect: connection refused
Database connection failure. But the database had not changed — why would a new deployment break the DB connection?
Investigation:
kubectl describe pod payment-svc-7d9f6b-xk2p9 -n payments | grep -A20 Environment
# DB_HOST: postgres-svc
# DB_PORT: 5432
# DB_PASSWORD: <set to the key 'password' in secret 'db-credentials-v2'>
kubectl get secret db-credentials-v2 -n payments
# Error from server (NotFound): secrets "db-credentials-v2" not found
Wrong turn: The first instinct was to check the database itself — querying the PostgreSQL pod, checking connection logs. Spent 4 minutes there before realizing the database was healthy.
Root cause: The new deployment bumped the secret reference from db-credentials-v1 to db-credentials-v2. The new secret had been created in the infrastructure namespace for testing but was never deployed to the payments namespace.
Fix:
kubectl get secret db-credentials-v2 -n infrastructure -o yaml | \
sed 's/namespace: infrastructure/namespace: payments/' | \
kubectl apply -f -
kubectl rollout restart deployment/payment-svc -n payments
kubectl get pods -n payments -w
# All 5 pods Running within 45 seconds
Time to resolution: 18 minutes.
Lesson: Secret references must be validated before rollout. Add a pre-deployment check to your CI pipeline:
# Pre-deploy validation script
for secret in $(kubectl get deployment $APP -o json | jq -r '.. | .secretKeyRef?.name // empty' | sort -u); do
kubectl get secret $secret -n $NAMESPACE > /dev/null 2>&1 || \
echo "ERROR: Secret $secret not found in namespace $NAMESPACE"
done
Incident 2 — Node DiskPressure Cascade (Accumulated Images)
The alert: Monday morning. Prometheus fires two simultaneous alerts: KubeNodeNotReady. Two out of five production nodes enter NotReady. Pod scheduling failures cascade across 12 namespaces.
Initial triage:
kubectl get nodes
# aks-nodepool-001 NotReady agent 45d
# aks-nodepool-002 NotReady agent 45d
kubectl describe node aks-nodepool-001 | grep -A5 Conditions
# DiskPressure: True
kubectl describe node aks-nodepool-002 | grep -A5 Conditions
# DiskPressure: True
Both nodes with DiskPressure simultaneously pointed to a shared root cause, not independent failures.
Investigation:
# SSH into node 1
df -h
# /dev/sda1 99% /
du -sh /var/lib/containerd/*
# 47G /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
crictl images | wc -l
# 312 images
# Check when the oldest image was pulled
crictl images | tail -5
# Most images are from the same batch job, pulled daily for 3 weeks
Wrong turn: Initially suspected a logging issue — checked /var/log/pods first. Logs were large but not the primary cause. The image directory was 10x larger.
Root cause: A nightly ML inference batch job pulled a new 8GB model container image each run without ever cleaning up previous versions. Over 22 days, 312 image versions had accumulated on both nodes. Both nodes were provisioned on the same day during a cluster upgrade and hit the 85% disk threshold at roughly the same time.
Fix:
# Immediate cleanup on both nodes
crictl rmi --prune
# Freed 43GB on node 1, 41GB on node 2
kubectl uncordon aks-nodepool-001
kubectl uncordon aks-nodepool-002
# Verify nodes are accepting pods
kubectl get nodes
# Both Ready
Long-term fix:
# /var/lib/kubelet/config.yaml — lower GC thresholds
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60
# Add post-job image cleanup to the batch job manifest
spec:
template:
spec:
containers:
- name: ml-inference
lifecycle:
preStop:
exec:
command: ["crictl", "rmi", "--prune"]
Time to resolution: 31 minutes.
Lesson: Default kubelet image GC thresholds (85%/80%) are too conservative for nodes running batch jobs with large images. Lower them in every production cluster. Add a Prometheus alert on node disk usage at 70% — not 85%.
Incident 3 — Intermittent 503 Caused by Liveness Probe
The alert: Customer support reports intermittent 503 errors on the public API. Errors appear for 5–10 seconds every 15–20 minutes. No deployment has occurred in the past 6 hours.
Initial triage:
kubectl get pods -n api-gateway
# All pods Running — nothing obviously wrong
kubectl get ingress -n api-gateway
# Ingress exists, looks correct
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100
# upstream timed out (110: Connection timed out) while reading response header
# connect() failed (111: Connection refused) while connecting to upstream
Upstream timeouts — the Ingress controller can reach the pods but pods are not responding.
Investigation:
kubectl get pods -n api-gateway
# api-svc-6b8d9f-xp2k1 1/1 Running 2 14m <- 2 recent restarts
# api-svc-6b8d9f-mn7q2 1/1 Running 2 12m
kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway | grep -A5 "Last State"
# Reason: Completed
# Exit Code: 143 <- SIGTERM, Kubernetes killed it
kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway | grep -A10 livenessProbe
# httpGet path: /health
# periodSeconds: 10
# failureThreshold: 3
# initialDelaySeconds: 5
kubectl logs api-svc-6b8d9f-xp2k1 -n api-gateway --previous | tail -20
# [INFO] GET /health 503 Service Unavailable 1240ms
# [INFO] Database connection pool exhausted (12/12 connections in use)
Wrong turn: Spent 8 minutes investigating the Ingress controller configuration, checking SSL settings and upstream timeouts. The issue was downstream, not in the Ingress layer.
Root cause: The /health endpoint performed a database connection check on every liveness probe call. Every 15–20 minutes, a scheduled reporting job exhausted the database connection pool for 5–10 seconds. During this window, /health returned 503. After 3 consecutive failures (30 seconds), Kubernetes killed the pod and restarted it. During the restart window, traffic hit remaining pods which were also failing their health checks.
Fix:
# Separate liveness from readiness probe
livenessProbe:
httpGet:
path: /ping # returns 200 immediately, no DB check
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /health # full DB check — removes pod from LB if failing
port: 8080
periodSeconds: 10
failureThreshold: 2
Time to resolution: 47 minutes.
Lesson: Liveness probes must never check external dependencies. A liveness probe that calls a database will restart your pods during any database hiccup. Add this rule to your Kubernetes deployment standards and enforce it in policy (OPA Gatekeeper or Kyverno).
Incident 4 — HPA Silent Failure During Traffic Spike
The alert: Friday 3:15 PM. API latency climbs from 120ms to 4,200ms. CPU on all pods is at 94%. HPA should have scaled up 20 minutes ago.
Initial triage:
kubectl get hpa -n ecommerce
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# api-hpa Deployment/api <unknown>/50% 3 20 3
# <unknown> means HPA cannot read metrics
Investigation:
kubectl describe hpa api-hpa -n ecommerce
# Conditions:
# ScalingActive: False
# Reason: FailedGetResourceMetric
# Message: unable to get metrics for resource cpu
kubectl get pods -n kube-system | grep metrics-server
# metrics-server-7d9c8b-xp2k 0/1 CrashLoopBackOff 8 35m
kubectl logs -n kube-system metrics-server-7d9c8b-xp2k --previous
# E0308 Failed to scrape node "aks-nodepool-003"
# x509: certificate signed by unknown authority
Wrong turn: Initially checked the HPA configuration itself, suspecting a misconfigured target utilization. The HPA config was correct — the problem was the metrics source, not the HPA.
Root cause: A cluster certificate rotation had been performed two days earlier. The metrics-server had not been updated with the --kubelet-insecure-tls flag. It had been in CrashLoopBackOff for two days, silently. Because existing workloads were handling normal traffic, nobody noticed. The HPA appeared healthy on dashboards because it showed a replica count — not that its scaling was non-functional.
Fix:
# Immediate: patch metrics-server to bypass TLS verification
kubectl patch deployment metrics-server -n kube-system --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-",
"value":"--kubelet-insecure-tls"}]'
# Verify metrics-server recovers
kubectl get pods -n kube-system | grep metrics-server
# Running after 60 seconds
# HPA immediately reads metrics and begins scaling
kubectl get hpa -n ecommerce
# TARGETS: 94%/50%
# Manually scale to recover faster while HPA catches up
kubectl scale deployment api -n ecommerce --replicas=12
# Monitor recovery
kubectl get hpa -n ecommerce -w
Time to resolution: 22 minutes.
Lesson: Add these two Prometheus alerts:
# Alert 1: metrics-server is down
- alert: MetricsServerDown
expr: absent(up{job="metrics-server"} == 1)
for: 5m
# Alert 2: HPA scaling is disabled
- alert: HPAScalingDisabled
expr: kube_horizontalpodautoscaler_status_condition{
condition="ScalingActive", status="false"} == 1
for: 10m
An HPA that cannot scale provides zero protection. Alert on the condition, not just on the symptom.
Incident 5 — StatefulSet Stuck After Node Pool Upgrade
The alert: After a planned AKS node pool upgrade, 3 of 5 Kafka pods enter Pending and do not recover after 25 minutes.
Initial triage:
kubectl get pods -n messaging -l app=kafka
# kafka-0 Running
# kafka-1 Running
# kafka-2 Pending
# kafka-3 Pending
# kafka-4 Pending
kubectl describe pod kafka-2 -n messaging
# Warning FailedScheduling: pod has unbound immediate PersistentVolumeClaims
kubectl get pvc -n messaging
# data-kafka-2 Pending
# data-kafka-3 Pending
# data-kafka-4 Pending
kubectl describe pvc data-kafka-2 -n messaging
# ProvisioningFailed: storageclass.storage.k8s.io "premium-zrs" not found
Wrong turn: Initially suspected the node pool upgrade had changed something about disk attachment. Spent 10 minutes checking VolumeAttachments and node labels before realizing the StorageClass was simply missing.
Root cause: The premium-zrs StorageClass had been deleted during an infrastructure cleanup three weeks earlier. Existing bound PVCs were unaffected — but when the upgrade evicted and rescheduled the StatefulSet pods, they needed to provision new PVCs. The provisioner could not find the StorageClass.
Fix:
# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF
# PVCs bind within 2 minutes
kubectl get pvc -n messaging -w
# Pods recover automatically
kubectl get pods -n messaging
# All 5 Running within 5 minutes
Time to resolution: 41 minutes.
Lesson: Add a pre-deletion check to your runbooks for any StorageClass removal:
# Run this before deleting any StorageClass
kubectl get pvc --all-namespaces -o json | \
jq --arg sc "premium-zrs" \
'.items[] | select(.spec.storageClassName==$sc) |
{namespace: .metadata.namespace, pvc: .metadata.name}'
StatefulSets that currently have bound PVCs will survive the deletion — but any future scaling, rescheduling, or new pod provisioning will fail silently until you are debugging it at 2 AM during an upgrade.
Incident 6 — DNS Failure Caused by NetworkPolicy
The alert: After a security team deploys new NetworkPolicies across all namespaces for PCI DSS compliance, the payments microservice cluster begins failing all inter-service calls. The failure is total — no service can reach any other service.
Initial triage:
kubectl get pods -n payments
# All pods Running
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- curl http://fraud-detection-svc
# curl: (6) Could not resolve host: fraud-detection-svc
Could not resolve host — DNS failure.
Investigation:
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# ;; connection timed out; no servers could be reached
CoreDNS itself is unreachable from the payments namespace.
kubectl get networkpolicy -n payments
# pci-default-deny <none> 5m
kubectl describe networkpolicy pci-default-deny -n payments
# PolicyTypes: Ingress, Egress
# Egress: <none> <- all egress denied including UDP/53 to CoreDNS
Wrong turn: Initially restarted CoreDNS pods thinking they were unhealthy. CoreDNS was perfectly healthy — the issue was that the pods in payments could not reach it.
Root cause: The new default-deny egress policy blocked all outbound traffic from payments pods — including DNS queries to CoreDNS on UDP port 53. Without DNS, every hostname lookup fails, making all service-to-service communication impossible regardless of what other NetworkPolicy rules allow.
Fix:
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: payments
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
EOF
# Test DNS immediately
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1
Time to resolution: 11 minutes.
Lesson: Every default-deny egress NetworkPolicy must include a DNS allowance as a standard first rule. Add this to your NetworkPolicy templates and make it a required check in your security policy review process. Without DNS, no application in that namespace can function regardless of how well you configure everything else.
Incident 7 — Memory Leak Causing Gradual Node Exhaustion
The alert: No single alert — a senior engineer notices on Monday morning that node memory usage has been climbing steadily for 9 days on a cluster monitoring dashboard. Three nodes are at 89% memory utilization. One more week and they will hit MemoryPressure.
Initial triage:
kubectl top nodes
# aks-nodepool-001 CPU: 34% Memory: 89%
# aks-nodepool-002 CPU: 31% Memory: 87%
# aks-nodepool-003 CPU: 28% Memory: 85%
kubectl top pods --all-namespaces --sort-by=memory | head -10
# analytics-worker-6b8d-xp2k1 analytics 0/1 2.1Gi
# analytics-worker-6b8d-mn3q2 analytics 0/1 1.9Gi
# analytics-worker-6b8d-kx9p3 analytics 0/1 1.8Gi
Three analytics workers consuming 6Gi total between them — and the numbers are growing.
Investigation:
# Check their memory limits
kubectl get pod analytics-worker-6b8d-xp2k1 -o yaml | grep -A10 resources
# resources:
# requests:
# memory: 512Mi
# limits: <- no memory limit set
# cpu: 500m
# Check how long they have been running
kubectl get pods -n analytics
# analytics-worker-6b8d-xp2k1 Running 9d <- 9 days without restart
# Check memory growth over time via Prometheus
# container_memory_working_set_bytes{namespace="analytics"} increasing linearly for 9 days
Root cause: An analytics worker that processed streaming data had a Python memory leak — objects were being appended to an in-memory list without ever being flushed. The pod had no memory limit, so Kubernetes could not OOMKill it. It grew slowly from 512Mi to over 2Gi over 9 days. With no memory limit, the pod would continue growing until the node itself ran out of memory.
Fix:
# Immediate: restart the analytics workers to reclaim memory
kubectl rollout restart deployment/analytics-worker -n analytics
# Verify memory returns to baseline
kubectl top pods -n analytics
# All three pods now consuming ~520Mi
# Add memory limits to prevent recurrence
kubectl edit deployment analytics-worker -n analytics
# Add:
# limits:
# memory: 1Gi <- will OOMKill and restart before consuming node memory
Long-term fix: The Python application needed a proper fix — using a deque with maxlen instead of a list, or periodic flushing. Memory limits buy time but do not fix the underlying leak.
Time to resolution: 25 minutes (for immediate fix), several days for application fix.
Lesson: Pods without memory limits are a cluster reliability risk. Enforce memory limits via LimitRange or OPA policy in every production namespace. A weekly review of kubectl top pods --sort-by=memory catches memory leaks before they become 2 AM incidents.
# Weekly check: find pods with growing memory that have no limits
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits.memory == null) |
{namespace: .metadata.namespace, name: .metadata.name}'
Patterns Across All Incidents
Looking at these seven incidents together, four patterns emerge:
The symptom is always one layer above the root cause. CrashLoopBackOff was a missing secret. 503 errors were a liveness probe. DNS failure was a NetworkPolicy. The instinct to investigate the symptom layer first costs time — train yourself to ask “what could cause this symptom from one layer below?” before diving in.
Gradual failures are the hardest to catch. The DiskPressure cascade and the memory leak both built up over days or weeks. Point-in-time monitoring is not enough — track trends. A node at 70% disk usage is not an alert, but disk usage trending up 3% per day is.
Silent failures are more dangerous than loud ones. The HPA was not scaling for two days. The StorageClass was missing for three weeks. Neither produced an alert until an incident triggered the path that needed them. Audit your cluster’s assumed capabilities regularly — not just its current health.
Every incident has a preventable postmortem action. Missing secret validation belongs in the CI pipeline. DNS egress belongs in every NetworkPolicy template. Memory limits belong in every LimitRange. The goal of the postmortem is not to explain what happened — it is to make the same incident impossible to repeat.
Quick Reference — Incident Diagnosis Starting Points
| Symptom | First command | Most likely layer |
|---|---|---|
| CrashLoopBackOff | kubectl logs --previous | Application config, missing secret |
| Pods Pending | kubectl describe pod Events | Scheduling, resource, PVC |
| Node NotReady | kubectl describe node Conditions | Kubelet, disk, memory, CNI |
| DNS failure | nslookup kubernetes.default from pod | CoreDNS, NetworkPolicy |
| 503 from Ingress | Ingress controller logs | Backend health, probe config |
| HPA not scaling | kubectl get hpa TARGETS column | metrics-server health |
| StatefulSet stuck | kubectl get pvc STATUS column | StorageClass, provisioner |
| Slow API server | time kubectl get nodes | Admission webhooks, etcd |
Related guides:

