Real Kubernetes Production Debugging Examples

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

Why Real Examples Matter

Kubernetes documentation teaches you what each component does. Tutorials show you how to deploy applications. Neither prepares you for a 2 AM alert where the symptom and the root cause are three layers apart.

Real production incidents are messy. The first thing you see is never the actual problem. A pod shows CrashLoopBackOff — but the real cause is a deleted secret in a different namespace. A service is returning 503 — but the root cause is a liveness probe hitting a database health check that is throttled by a scheduled report. Nodes enter DiskPressure — but the cause is a batch job that has been silently accumulating container images for three weeks.

This guide documents seven real Kubernetes production incidents in full detail — the alert, the investigation path, the wrong turns, the root cause, the fix, and the lesson. Each one is a pattern you will encounter in your own clusters.


How to Read These Examples

Each incident follows the same structure:

  • The alert — what triggered the on-call response
  • Initial triage — the first commands run and what they showed
  • Investigation — the path from symptom to root cause, including wrong turns
  • Root cause — what was actually wrong
  • Fix — the commands that resolved it
  • Time to resolution — how long it took
  • Lesson — what to change to prevent recurrence

Incident 1 — CrashLoopBackOff After Deployment (Missing Secret)

The alert: PagerDuty fires at 11:45 PM. The payment service pod count has dropped from 5 to 0 healthy pods. All 5 pods are in CrashLoopBackOff. The deployment was pushed 12 minutes ago.

Initial triage:

kubectl get pods -n payments
# payment-svc-7d9f6b-xk2p9   0/1   CrashLoopBackOff   4   8m

kubectl logs payment-svc-7d9f6b-xk2p9 -n payments --previous
# Error: failed to connect to database: connection refused
# dial tcp 10.0.1.45:5432: connect: connection refused

Database connection failure. But the database had not changed — why would a new deployment break the DB connection?

Investigation:

kubectl describe pod payment-svc-7d9f6b-xk2p9 -n payments | grep -A20 Environment

# DB_HOST:     postgres-svc
# DB_PORT:     5432
# DB_PASSWORD: <set to the key 'password' in secret 'db-credentials-v2'>

kubectl get secret db-credentials-v2 -n payments
# Error from server (NotFound): secrets "db-credentials-v2" not found

Wrong turn: The first instinct was to check the database itself — querying the PostgreSQL pod, checking connection logs. Spent 4 minutes there before realizing the database was healthy.

Root cause: The new deployment bumped the secret reference from db-credentials-v1 to db-credentials-v2. The new secret had been created in the infrastructure namespace for testing but was never deployed to the payments namespace.

Fix:

kubectl get secret db-credentials-v2 -n infrastructure -o yaml | \
  sed 's/namespace: infrastructure/namespace: payments/' | \
  kubectl apply -f -

kubectl rollout restart deployment/payment-svc -n payments

kubectl get pods -n payments -w
# All 5 pods Running within 45 seconds

Time to resolution: 18 minutes.

Lesson: Secret references must be validated before rollout. Add a pre-deployment check to your CI pipeline:

# Pre-deploy validation script
for secret in $(kubectl get deployment $APP -o json | jq -r '.. | .secretKeyRef?.name // empty' | sort -u); do
  kubectl get secret $secret -n $NAMESPACE > /dev/null 2>&1 || \
    echo "ERROR: Secret $secret not found in namespace $NAMESPACE"
done

Incident 2 — Node DiskPressure Cascade (Accumulated Images)

The alert: Monday morning. Prometheus fires two simultaneous alerts: KubeNodeNotReady. Two out of five production nodes enter NotReady. Pod scheduling failures cascade across 12 namespaces.

Initial triage:

kubectl get nodes
# aks-nodepool-001   NotReady   agent   45d
# aks-nodepool-002   NotReady   agent   45d

kubectl describe node aks-nodepool-001 | grep -A5 Conditions
# DiskPressure: True

kubectl describe node aks-nodepool-002 | grep -A5 Conditions
# DiskPressure: True

Both nodes with DiskPressure simultaneously pointed to a shared root cause, not independent failures.

Investigation:

# SSH into node 1
df -h
# /dev/sda1   99%   /

du -sh /var/lib/containerd/*
# 47G   /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

crictl images | wc -l
# 312 images

# Check when the oldest image was pulled
crictl images | tail -5
# Most images are from the same batch job, pulled daily for 3 weeks

Wrong turn: Initially suspected a logging issue — checked /var/log/pods first. Logs were large but not the primary cause. The image directory was 10x larger.

Root cause: A nightly ML inference batch job pulled a new 8GB model container image each run without ever cleaning up previous versions. Over 22 days, 312 image versions had accumulated on both nodes. Both nodes were provisioned on the same day during a cluster upgrade and hit the 85% disk threshold at roughly the same time.

Fix:

# Immediate cleanup on both nodes
crictl rmi --prune
# Freed 43GB on node 1, 41GB on node 2

kubectl uncordon aks-nodepool-001
kubectl uncordon aks-nodepool-002

# Verify nodes are accepting pods
kubectl get nodes
# Both Ready

Long-term fix:

# /var/lib/kubelet/config.yaml — lower GC thresholds
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60
# Add post-job image cleanup to the batch job manifest
spec:
  template:
    spec:
      containers:
      - name: ml-inference
        lifecycle:
          preStop:
            exec:
              command: ["crictl", "rmi", "--prune"]

Time to resolution: 31 minutes.

Lesson: Default kubelet image GC thresholds (85%/80%) are too conservative for nodes running batch jobs with large images. Lower them in every production cluster. Add a Prometheus alert on node disk usage at 70% — not 85%.


Incident 3 — Intermittent 503 Caused by Liveness Probe

The alert: Customer support reports intermittent 503 errors on the public API. Errors appear for 5–10 seconds every 15–20 minutes. No deployment has occurred in the past 6 hours.

Initial triage:

kubectl get pods -n api-gateway
# All pods Running — nothing obviously wrong

kubectl get ingress -n api-gateway
# Ingress exists, looks correct

kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100
# upstream timed out (110: Connection timed out) while reading response header
# connect() failed (111: Connection refused) while connecting to upstream

Upstream timeouts — the Ingress controller can reach the pods but pods are not responding.

Investigation:

kubectl get pods -n api-gateway
# api-svc-6b8d9f-xp2k1   1/1   Running   2   14m  <- 2 recent restarts
# api-svc-6b8d9f-mn7q2   1/1   Running   2   12m

kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway | grep -A5 "Last State"
# Reason: Completed
# Exit Code: 143    <- SIGTERM, Kubernetes killed it

kubectl describe pod api-svc-6b8d9f-xp2k1 -n api-gateway | grep -A10 livenessProbe
# httpGet path: /health
# periodSeconds: 10
# failureThreshold: 3
# initialDelaySeconds: 5

kubectl logs api-svc-6b8d9f-xp2k1 -n api-gateway --previous | tail -20
# [INFO] GET /health 503 Service Unavailable 1240ms
# [INFO] Database connection pool exhausted (12/12 connections in use)

Wrong turn: Spent 8 minutes investigating the Ingress controller configuration, checking SSL settings and upstream timeouts. The issue was downstream, not in the Ingress layer.

Root cause: The /health endpoint performed a database connection check on every liveness probe call. Every 15–20 minutes, a scheduled reporting job exhausted the database connection pool for 5–10 seconds. During this window, /health returned 503. After 3 consecutive failures (30 seconds), Kubernetes killed the pod and restarted it. During the restart window, traffic hit remaining pods which were also failing their health checks.

Fix:

# Separate liveness from readiness probe
livenessProbe:
  httpGet:
    path: /ping           # returns 200 immediately, no DB check
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health         # full DB check — removes pod from LB if failing
    port: 8080
  periodSeconds: 10
  failureThreshold: 2

Time to resolution: 47 minutes.

Lesson: Liveness probes must never check external dependencies. A liveness probe that calls a database will restart your pods during any database hiccup. Add this rule to your Kubernetes deployment standards and enforce it in policy (OPA Gatekeeper or Kyverno).


Incident 4 — HPA Silent Failure During Traffic Spike

The alert: Friday 3:15 PM. API latency climbs from 120ms to 4,200ms. CPU on all pods is at 94%. HPA should have scaled up 20 minutes ago.

Initial triage:

kubectl get hpa -n ecommerce
# NAME      REFERENCE         TARGETS         MINPODS   MAXPODS   REPLICAS
# api-hpa   Deployment/api    <unknown>/50%   3         20        3

# <unknown> means HPA cannot read metrics

Investigation:

kubectl describe hpa api-hpa -n ecommerce
# Conditions:
#   ScalingActive: False
#   Reason: FailedGetResourceMetric
#   Message: unable to get metrics for resource cpu

kubectl get pods -n kube-system | grep metrics-server
# metrics-server-7d9c8b-xp2k   0/1   CrashLoopBackOff   8   35m

kubectl logs -n kube-system metrics-server-7d9c8b-xp2k --previous
# E0308 Failed to scrape node "aks-nodepool-003"
# x509: certificate signed by unknown authority

Wrong turn: Initially checked the HPA configuration itself, suspecting a misconfigured target utilization. The HPA config was correct — the problem was the metrics source, not the HPA.

Root cause: A cluster certificate rotation had been performed two days earlier. The metrics-server had not been updated with the --kubelet-insecure-tls flag. It had been in CrashLoopBackOff for two days, silently. Because existing workloads were handling normal traffic, nobody noticed. The HPA appeared healthy on dashboards because it showed a replica count — not that its scaling was non-functional.

Fix:

# Immediate: patch metrics-server to bypass TLS verification
kubectl patch deployment metrics-server -n kube-system --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-",
  "value":"--kubelet-insecure-tls"}]'

# Verify metrics-server recovers
kubectl get pods -n kube-system | grep metrics-server
# Running after 60 seconds

# HPA immediately reads metrics and begins scaling
kubectl get hpa -n ecommerce
# TARGETS: 94%/50%

# Manually scale to recover faster while HPA catches up
kubectl scale deployment api -n ecommerce --replicas=12

# Monitor recovery
kubectl get hpa -n ecommerce -w

Time to resolution: 22 minutes.

Lesson: Add these two Prometheus alerts:

# Alert 1: metrics-server is down
- alert: MetricsServerDown
  expr: absent(up{job="metrics-server"} == 1)
  for: 5m

# Alert 2: HPA scaling is disabled
- alert: HPAScalingDisabled
  expr: kube_horizontalpodautoscaler_status_condition{
    condition="ScalingActive", status="false"} == 1
  for: 10m

An HPA that cannot scale provides zero protection. Alert on the condition, not just on the symptom.


Incident 5 — StatefulSet Stuck After Node Pool Upgrade

The alert: After a planned AKS node pool upgrade, 3 of 5 Kafka pods enter Pending and do not recover after 25 minutes.

Initial triage:

kubectl get pods -n messaging -l app=kafka
# kafka-0   Running
# kafka-1   Running
# kafka-2   Pending
# kafka-3   Pending
# kafka-4   Pending

kubectl describe pod kafka-2 -n messaging
# Warning FailedScheduling: pod has unbound immediate PersistentVolumeClaims

kubectl get pvc -n messaging
# data-kafka-2   Pending
# data-kafka-3   Pending
# data-kafka-4   Pending

kubectl describe pvc data-kafka-2 -n messaging
# ProvisioningFailed: storageclass.storage.k8s.io "premium-zrs" not found

Wrong turn: Initially suspected the node pool upgrade had changed something about disk attachment. Spent 10 minutes checking VolumeAttachments and node labels before realizing the StorageClass was simply missing.

Root cause: The premium-zrs StorageClass had been deleted during an infrastructure cleanup three weeks earlier. Existing bound PVCs were unaffected — but when the upgrade evicted and rescheduled the StatefulSet pods, they needed to provision new PVCs. The provisioner could not find the StorageClass.

Fix:

# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF

# PVCs bind within 2 minutes
kubectl get pvc -n messaging -w

# Pods recover automatically
kubectl get pods -n messaging
# All 5 Running within 5 minutes

Time to resolution: 41 minutes.

Lesson: Add a pre-deletion check to your runbooks for any StorageClass removal:

# Run this before deleting any StorageClass
kubectl get pvc --all-namespaces -o json | \
  jq --arg sc "premium-zrs" \
  '.items[] | select(.spec.storageClassName==$sc) |
  {namespace: .metadata.namespace, pvc: .metadata.name}'

StatefulSets that currently have bound PVCs will survive the deletion — but any future scaling, rescheduling, or new pod provisioning will fail silently until you are debugging it at 2 AM during an upgrade.


Incident 6 — DNS Failure Caused by NetworkPolicy

The alert: After a security team deploys new NetworkPolicies across all namespaces for PCI DSS compliance, the payments microservice cluster begins failing all inter-service calls. The failure is total — no service can reach any other service.

Initial triage:

kubectl get pods -n payments
# All pods Running

kubectl exec -it payment-svc-7d9f-xp2k -n payments -- curl http://fraud-detection-svc
# curl: (6) Could not resolve host: fraud-detection-svc

Could not resolve host — DNS failure.

Investigation:

kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# ;; connection timed out; no servers could be reached

CoreDNS itself is unreachable from the payments namespace.

kubectl get networkpolicy -n payments
# pci-default-deny   <none>   5m

kubectl describe networkpolicy pci-default-deny -n payments
# PolicyTypes: Ingress, Egress
# Egress: <none>   <- all egress denied including UDP/53 to CoreDNS

Wrong turn: Initially restarted CoreDNS pods thinking they were unhealthy. CoreDNS was perfectly healthy — the issue was that the pods in payments could not reach it.

Root cause: The new default-deny egress policy blocked all outbound traffic from payments pods — including DNS queries to CoreDNS on UDP port 53. Without DNS, every hostname lookup fails, making all service-to-service communication impossible regardless of what other NetworkPolicy rules allow.

Fix:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: payments
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
EOF

# Test DNS immediately
kubectl exec -it payment-svc-7d9f-xp2k -n payments -- nslookup kubernetes.default
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1

Time to resolution: 11 minutes.

Lesson: Every default-deny egress NetworkPolicy must include a DNS allowance as a standard first rule. Add this to your NetworkPolicy templates and make it a required check in your security policy review process. Without DNS, no application in that namespace can function regardless of how well you configure everything else.


Incident 7 — Memory Leak Causing Gradual Node Exhaustion

The alert: No single alert — a senior engineer notices on Monday morning that node memory usage has been climbing steadily for 9 days on a cluster monitoring dashboard. Three nodes are at 89% memory utilization. One more week and they will hit MemoryPressure.

Initial triage:

kubectl top nodes
# aks-nodepool-001   CPU: 34%   Memory: 89%
# aks-nodepool-002   CPU: 31%   Memory: 87%
# aks-nodepool-003   CPU: 28%   Memory: 85%

kubectl top pods --all-namespaces --sort-by=memory | head -10
# analytics-worker-6b8d-xp2k1   analytics   0/1   2.1Gi
# analytics-worker-6b8d-mn3q2   analytics   0/1   1.9Gi
# analytics-worker-6b8d-kx9p3   analytics   0/1   1.8Gi

Three analytics workers consuming 6Gi total between them — and the numbers are growing.

Investigation:

# Check their memory limits
kubectl get pod analytics-worker-6b8d-xp2k1 -o yaml | grep -A10 resources
# resources:
#   requests:
#     memory: 512Mi
#   limits:          <- no memory limit set
#     cpu: 500m

# Check how long they have been running
kubectl get pods -n analytics
# analytics-worker-6b8d-xp2k1   Running   9d    <- 9 days without restart

# Check memory growth over time via Prometheus
# container_memory_working_set_bytes{namespace="analytics"} increasing linearly for 9 days

Root cause: An analytics worker that processed streaming data had a Python memory leak — objects were being appended to an in-memory list without ever being flushed. The pod had no memory limit, so Kubernetes could not OOMKill it. It grew slowly from 512Mi to over 2Gi over 9 days. With no memory limit, the pod would continue growing until the node itself ran out of memory.

Fix:

# Immediate: restart the analytics workers to reclaim memory
kubectl rollout restart deployment/analytics-worker -n analytics

# Verify memory returns to baseline
kubectl top pods -n analytics
# All three pods now consuming ~520Mi

# Add memory limits to prevent recurrence
kubectl edit deployment analytics-worker -n analytics
# Add:
# limits:
#   memory: 1Gi    <- will OOMKill and restart before consuming node memory

Long-term fix: The Python application needed a proper fix — using a deque with maxlen instead of a list, or periodic flushing. Memory limits buy time but do not fix the underlying leak.

Time to resolution: 25 minutes (for immediate fix), several days for application fix.

Lesson: Pods without memory limits are a cluster reliability risk. Enforce memory limits via LimitRange or OPA policy in every production namespace. A weekly review of kubectl top pods --sort-by=memory catches memory leaks before they become 2 AM incidents.

# Weekly check: find pods with growing memory that have no limits
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits.memory == null) |
  {namespace: .metadata.namespace, name: .metadata.name}'

Patterns Across All Incidents

Looking at these seven incidents together, four patterns emerge:

The symptom is always one layer above the root cause. CrashLoopBackOff was a missing secret. 503 errors were a liveness probe. DNS failure was a NetworkPolicy. The instinct to investigate the symptom layer first costs time — train yourself to ask “what could cause this symptom from one layer below?” before diving in.

Gradual failures are the hardest to catch. The DiskPressure cascade and the memory leak both built up over days or weeks. Point-in-time monitoring is not enough — track trends. A node at 70% disk usage is not an alert, but disk usage trending up 3% per day is.

Silent failures are more dangerous than loud ones. The HPA was not scaling for two days. The StorageClass was missing for three weeks. Neither produced an alert until an incident triggered the path that needed them. Audit your cluster’s assumed capabilities regularly — not just its current health.

Every incident has a preventable postmortem action. Missing secret validation belongs in the CI pipeline. DNS egress belongs in every NetworkPolicy template. Memory limits belong in every LimitRange. The goal of the postmortem is not to explain what happened — it is to make the same incident impossible to repeat.


Quick Reference — Incident Diagnosis Starting Points

SymptomFirst commandMost likely layer
CrashLoopBackOffkubectl logs --previousApplication config, missing secret
Pods Pendingkubectl describe pod EventsScheduling, resource, PVC
Node NotReadykubectl describe node ConditionsKubelet, disk, memory, CNI
DNS failurenslookup kubernetes.default from podCoreDNS, NetworkPolicy
503 from IngressIngress controller logsBackend health, probe config
HPA not scalingkubectl get hpa TARGETS columnmetrics-server health
StatefulSet stuckkubectl get pvc STATUS columnStorageClass, provisioner
Slow API servertime kubectl get nodesAdmission webhooks, etcd

Related guides:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top