How to Debug CrashLoopBackOff in Kubernetes

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

What is CrashLoopBackOff?

If you have worked with Kubernetes for more than a week, you have seen it. A pod that should be running instead shows this:

NAME                        READY   STATUS             RESTARTS   AGE
payment-svc-7d9f6b-xk2p9   0/1     CrashLoopBackOff   6          12m

CrashLoopBackOff is not an error in itself — it is Kubernetes telling you that your container keeps starting and immediately crashing, and that it has applied a backoff delay before trying again.

The “BackOff” part is key. Kubernetes uses an exponential backoff strategy between each restart attempt:

Restart AttemptWait Before Next Restart
1st10 seconds
2nd20 seconds
3rd40 seconds
4th80 seconds
5th+Up to 5 minutes

This is why CrashLoopBackOff pods can sit broken for a long time without anyone noticing — restart attempts slow down dramatically, so the pod looks “stable” in a broken state.


Why CrashLoopBackOff Happens

CrashLoopBackOff is a symptom, not a root cause. The actual problem is always inside the container. Here are the most common reasons, roughly in order of frequency in production:

1. Application error on startup The most common cause. Your application starts, hits an unhandled exception, and exits with a non-zero code. This could be a missing config file, a failed database connection, or a bug in initialization code.

2. Missing or incorrect environment variables The application expects DATABASE_URL but the pod spec references a Secret that does not exist, or references the wrong key name. The app fails the moment it reads the missing variable.

3. Incorrect container entrypoint or command The CMD or ENTRYPOINT in the Dockerfile — or the command/args in the pod spec — points to a binary that does not exist in the image, or passes arguments the binary does not accept.

4. OOMKilled on startup The memory limit is too low for the application to even initialise. The Linux kernel kills the container before it finishes starting. This is especially common with Java applications that have a large JVM heap initialization.

5. Liveness probe failing too early The liveness probe starts checking before the application has finished starting up. Kubernetes kills the container as “unhealthy” before it has had a chance to become healthy. This is a configuration problem, not an application problem.

6. Init container failure If you have init containers, and one of them fails, the main container never starts. The pod keeps restarting the init container, which surfaces as CrashLoopBackOff on the main container.


How to Diagnose CrashLoopBackOff — Step by Step

Step 1 — Confirm the Status and Restart Count

bash

kubectl get pods -n <namespace>

Look at two things: the STATUS column and the RESTARTS column. A high restart count with a recent AGE tells you the pod is crashing fast and often.

Step 2 — Check the Exit Code

The exit code tells you how the container died. This is the fastest way to narrow down the cause.

bash

kubectl describe pod <pod-name> -n <namespace>

Look for the Last State section in the output:

Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Sun, 08 Mar 2026 02:00:00 +0000
  Finished:     Sun, 08 Mar 2026 02:00:02 +0000

What exit codes mean:

Exit CodeMeaningWhere to Look Next
1Application errorkubectl logs --previous
137OOMKilled — killed by kernelIncrease memory limit
139Segmentation faultApplication or library bug
143SIGTERM receivedCheck preStop hooks and liveness probe
255Entrypoint not foundCheck image CMD or pod spec command

A container that exits in under 2 seconds with exit code 1 almost always means the application failed on startup — go straight to logs.

Step 3 — Read the Previous Container Logs

This is the most important step. The logs from the crashed container tell you exactly what went wrong.

bash

# Logs from the previously crashed container
kubectl logs <pod-name> -n <namespace> --previous

Without --previous, you get logs from the current container run — which may be empty if the container just started and crashed before logging anything.

bash

# For pods with multiple containers
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

# Tail the last 50 lines only
kubectl logs <pod-name> -n <namespace> --previous --tail=50

Step 4 — Read Pod Events

Events give you context that logs cannot — things Kubernetes observed about the pod from the outside.

bash

kubectl describe pod <pod-name> -n <namespace>

Scroll to the bottom and read the Events section:

Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Created  13m (x4 over 13m)  kubelet  Created container payment-svc
  Normal   Started  13m (x4 over 13m)  kubelet  Started container payment-svc
  Warning  BackOff  2m (x8 over 12m)   kubelet  Back-off restarting failed container

The (x4 over 13m) annotation shows the container has been created and started 4 times in 13 minutes. The Back-off restarting failed container warning confirms CrashLoopBackOff.

Step 5 — Check Referenced Secrets and ConfigMaps

A very common production cause: the pod references a Secret or ConfigMap that does not exist in the same namespace.

bash

# Check what the pod references
kubectl get pod <pod-name> -n <namespace> -o yaml | \
  grep -E "secretKeyRef|configMapKeyRef|secretName|configMapName"

# Verify each one exists in the correct namespace
kubectl get secret <secret-name> -n <namespace>
kubectl get configmap <configmap-name> -n <namespace>

If a referenced Secret does not exist, the pod will fail immediately with an event like:

Warning  Failed  Error: secret "db-credentials-v2" not found

Debugging Init Container Failures

If your pod has init containers, check them separately. Init container failures show a slightly different status:

bash

kubectl get pod <pod-name> -n <namespace>
# STATUS: Init:CrashLoopBackOff  or  Init:Error

# Get init container logs
kubectl logs <pod-name> -n <namespace> -c <init-container-name> --previous

# Describe shows init container state separately
kubectl describe pod <pod-name> -n <namespace>
# Look for the: Init Containers section at the top

Common init container failure causes:

  • Database migration script fails — DB not reachable yet, or wrong credentials
  • Wait-for script times out waiting for a dependency service
  • File permission setup fails due to wrong user context

Fixing the Most Common CrashLoopBackOff Causes

Fix 1 — Missing Environment Variable or Secret

bash

# Find what the app is complaining about
kubectl logs <pod-name> --previous | grep -i "env\|variable\|not found\|undefined"

# Check what keys the secret actually has
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data}' | python3 -m json.tool

# Re-create the secret with the correct keys
kubectl create secret generic db-credentials \
  --from-literal=password=mypassword \
  --from-literal=username=myuser \
  -n <namespace>

Fix 2 — OOMKilled (Exit Code 137)

bash

# Confirm OOMKilled
kubectl describe pod <pod-name> | grep -A3 "Last State"
# Reason: OOMKilled

# Check current memory limit
kubectl get pod <pod-name> -o yaml | grep -A5 resources

# Patch deployment with higher memory limit
kubectl patch deployment <deployment-name> -n <namespace> --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"512Mi"}]'

For Java applications specifically: Java’s JVM allocates heap aggressively on startup. If your container memory limit is 256Mi but the JVM tries to allocate 512Mi of heap, the container gets OOMKilled before the application even starts.

yaml

# Set JVM max heap below your container memory limit
env:
- name: JAVA_OPTS
  value: "-Xms128m -Xmx384m"   # For a 512Mi container limit

# Or use JVM container awareness (Java 11+)
- name: JAVA_OPTS
  value: "-XX:MaxRAMPercentage=75.0"   # Use 75% of container memory for heap

Fix 3 — Liveness Probe Killing the Container Too Early

bash

# Check current liveness probe config
kubectl get pod <pod-name> -o yaml | grep -A10 livenessProbe

If initialDelaySeconds is too low, Kubernetes starts checking before the app is ready and kills it:

yaml

# Too aggressive -- kills app before it can start
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5      # Not enough for most apps
  periodSeconds: 10
  failureThreshold: 3

# Better -- give the app time to initialize
livenessProbe:
  httpGet:
    path: /ping               # Simple endpoint, no DB check
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3

# Best for apps with variable startup time -- use startupProbe
startupProbe:
  httpGet:
    path: /ping
    port: 8080
  failureThreshold: 30        # 30 x 10s = up to 5 minutes to start
  periodSeconds: 10

Important: Use a startupProbe for applications that have long or variable startup times — Java apps, apps that run DB migrations on boot, or anything that loads large datasets into memory. The startup probe disables the liveness probe until it passes, preventing premature kills.

Fix 4 — Wrong Container Entrypoint (Exit Code 255)

bash

# Check what the image actually defines as its entrypoint
docker inspect <image-name>:<tag> | jq '.[0].Config.Entrypoint'
docker inspect <image-name>:<tag> | jq '.[0].Config.Cmd'

# Check what your pod spec is overriding it with
kubectl get pod <pod-name> -o yaml | grep -A5 -E "command:|args:"

# Debug by running the image interactively
docker run -it --entrypoint /bin/sh <image-name>:<tag>
# Then manually run the command to see what error you get

Real Production Example — CrashLoopBackOff After Namespace Migration

The situation: A team migrated a payment microservice from the staging namespace to production. Within minutes of the rollout, all pods entered CrashLoopBackOff with restart counts climbing fast.

Diagnosis:

bash

kubectl logs payment-svc-7d9f6b-xk2p9 -n production --previous
# Error: failed to connect to database: connection refused
# dial tcp 10.0.1.45:5432: connect: connection refused

kubectl describe pod payment-svc-7d9f6b-xk2p9 -n production | grep -A5 "Environment"
# DB_PASSWORD: <set to the key 'password' in secret 'db-credentials'>

kubectl get secret db-credentials -n production
# Error from server (NotFound): secrets "db-credentials" not found

Root cause: The Secret db-credentials existed in staging but had never been created in production. The migration checklist covered the Deployment and Service manifests — but not the Secrets they depended on.

Fix:

bash

# Copy the secret from staging to production
kubectl get secret db-credentials -n staging -o yaml | \
  sed 's/namespace: staging/namespace: production/' | \
  kubectl apply -f -

# Restart the deployment
kubectl rollout restart deployment/payment-svc -n production

# Verify pods recover
kubectl get pods -n production -w
# All 5 pods Running within 45 seconds

Prevention: Add a pre-deployment validation step to your CI pipeline that checks all referenced Secrets and ConfigMaps exist in the target namespace before applying manifests.


CrashLoopBackOff Debugging Cheatsheet

bash

# 1. Check status and restart count
kubectl get pods -n <namespace>

# 2. Get exit code from last crash
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"

# 3. Read logs from crashed container
kubectl logs <pod-name> -n <namespace> --previous

# 4. Read pod events
kubectl describe pod <pod-name> -n <namespace>

# 5. Check secrets and configmaps exist in namespace
kubectl get secret <name> -n <namespace>
kubectl get configmap <name> -n <namespace>

# 6. Check init container logs separately
kubectl logs <pod-name> -c <init-container-name> --previous -n <namespace>

# 7. Force a fresh restart after fixing
kubectl rollout restart deployment/<deployment-name> -n <namespace>

Summary

CrashLoopBackOff always has a root cause inside the container. The debugging process is always the same: check the exit code first, then read the previous logs, then read events, then check configuration.

The five most common fixes in production:

  1. Create the missing Secret or ConfigMap in the correct namespace
  2. Increase the memory limit for OOMKilled containers
  3. Add initialDelaySeconds or a startupProbe to liveness probe config
  4. Fix environment variable references pointing to non-existent Secret keys
  5. Correct the container entrypoint or command in the pod spec

This guide is part of the Production Kubernetes Debugging Handbook.

Next in the series: How to Fix Kubernetes Node NotReady

Back to the Kubernetes Guide

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top