Debugging Kubernetes Storage & Persistent Volume Issues Prod

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

Why Storage Failures Are Different

Most Kubernetes failures are stateless — a crashing pod gets replaced, a misconfigured deployment gets rolled back, and the cluster recovers. Storage failures are different. They are stateful, they block pod startup, and they carry real data loss risk if handled incorrectly.

A PersistentVolumeClaim stuck in Pending means the pod waiting for it cannot start. A ReadWriteOnce volume still attached to a dead node means the replacement pod cannot mount it. A PVC stuck in Terminating can block namespace deletion for hours.

These failures require a different approach — slower, more deliberate, with verification at each step before taking destructive actions.

This guide walks through every storage failure pattern you will encounter in production, from PVC provisioning failures to StatefulSet volume issues to multi-attach errors.

How Kubernetes Storage Works

A clear mental model prevents most storage debugging mistakes.

Pod spec references a PVC by name
        |
        v
PersistentVolumeClaim (PVC)
  Requests storage: 50Gi, ReadWriteOnce, StorageClass: premium-ssd
        |
        v
StorageClass
  Provisioner: disk.csi.azure.com
  Parameters: skuName: Premium_LRS
        |
        v
Provisioner creates the physical storage (Azure Disk, AWS EBS, GCP PD)
        |
        v
PersistentVolume (PV)
  Bound to the PVC
        |
        v
CSI driver mounts the volume onto the node
        |
        v
Volume mounted into the container at the specified path

Key concepts:

PVC is what your pod references. It is a request for storage.
PV is the actual storage resource. It can be provisioned dynamically (StorageClass) or manually.
StorageClass defines how PVs are provisioned and what type of storage to use.
Binding is the one-to-one relationship between a PVC and a PV.
AccessModes define how the volume can be mounted: ReadWriteOnce (one node), ReadOnlyMany (many nodes, read only), ReadWriteMany (many nodes, read/write).

Step 1 — The 5-Minute Storage Triage Checklist

bash

# 1. Check PVC status across all namespaces
kubectl get pvc --all-namespaces

# 2. Check PV status
kubectl get pv

# 3. Check available StorageClasses
kubectl get storageclass

# 4. Describe the stuck PVC for events
kubectl describe pvc <pvc-name> -n <namespace>

# 5. Check pod events for volume mount errors
kubectl describe pod <pod-name> -n <namespace> | grep -A20 Events

PVC Status meanings:

Status	Meaning
Pending	No PV bound yet — provisioner is working or has failed
Bound	Healthy — PVC is bound to a PV
Lost	The bound PV no longer exists
Terminating	Deletion in progress — may be stuck on a finalizer

PV Status meanings:

Status	Meaning
Available	PV exists but not yet bound to any PVC
Bound	PV is bound to a PVC — healthy
Released	PVC was deleted but PV still exists with old data
Failed	Provisioner encountered an error

Cause 1 — PVC Stuck in Pending

Symptom:

bash

kubectl get pvc -n <namespace>
# NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
# data-postgres   Pending   <none>   100Gi      RWO            premium-ssd

The PVC has been created but no PV has been bound to it. The pod waiting for this PVC stays in ContainerCreating or Pending.

How to Diagnose

bash

kubectl describe pvc <pvc-name> -n <namespace>

# Look for Events at the bottom:
# storageclass.storage.k8s.io "premium-ssd" not found
# no persistent volumes available for this claim and no storage class is set
# waiting for a volume to be created either by the external provisioner
# or manually by a system administrator

Step 1: Does the StorageClass exist?

bash

kubectl get storageclass
# If your StorageClass is not listed, it was deleted or never created

Step 2: Is the provisioner running?

bash

kubectl get pods -n kube-system | grep -i "provisioner\|csi"

# For AKS — check the Azure Disk CSI driver
kubectl get pods -n kube-system | grep csi-azuredisk

Step 3: Does the PVC request match what the StorageClass can provide?

bash

kubectl describe storageclass <storageclass-name>
# Check: AllowVolumeExpansion, VolumeBindingMode, Provisioner

# VolumeBindingMode: WaitForFirstConsumer
# means PV is not created until a pod is actually scheduled
# The PVC will stay Pending until you create a pod that uses it

Common Causes and Fixes

StorageClass deleted or named incorrectly:

bash

# List available StorageClasses
kubectl get storageclass

# In AKS, default StorageClasses:
# managed-csi         <- Standard SSD (default)
# managed-csi-premium <- Premium SSD
# azurefile-csi       <- Azure Files (ReadWriteMany)
# azurefile-csi-premium

# Fix: update the PVC to use an existing StorageClass
# (you must delete and recreate the PVC — StorageClass is immutable after creation)
kubectl delete pvc <pvc-name> -n <namespace>
kubectl apply -f pvc-with-correct-storageclass.yaml

WaitForFirstConsumer binding mode — expected behavior:

bash

# If VolumeBindingMode is WaitForFirstConsumer, the PVC intentionally stays
# Pending until a pod using it is scheduled to a node.
# This is normal — create the pod and the PVC will bind.
kubectl describe storageclass <name> | grep VolumeBindingMode
# VolumeBindingMode: WaitForFirstConsumer   <- this is expected behavior

Provisioner pod not running:

bash

kubectl logs -n kube-system <provisioner-pod>
# Look for authentication errors, quota errors, or API errors

# Restart the provisioner
kubectl rollout restart deployment/<provisioner-deployment> -n kube-system

Cloud quota exhausted:

In AKS, disk provisioning fails silently when your Azure subscription has hit its managed disk quota. Check Azure portal or CLI:

bash

az vm list-usage --location eastus --query "[?name.value=='Disks']"

Cause 2 — PVC Stuck in Terminating

Symptom:

bash

kubectl get pvc -n <namespace>
# NAME            STATUS        VOLUME          CAPACITY
# data-postgres   Terminating   pvc-7d9f-xp2k   100Gi

The PVC was deleted but is stuck and will not go away. This commonly blocks namespace deletion.

How to Diagnose

bash

kubectl describe pvc <pvc-name> -n <namespace>

# Look for Finalizers:
# Finalizers: [kubernetes.io/pvc-protection]

What is happening: Kubernetes uses a pvc-protection finalizer to prevent a PVC from being deleted while a pod is actively using it. If the pod has been deleted but the PVC is still stuck, the finalizer was not removed automatically — usually because the pod is stuck in Terminating too.

bash

# Check if any pod is still using the PVC
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") |
  .metadata.name'

Fix

bash

# Step 1: Confirm no pod is actively using the PVC
# Only proceed if all pods using this PVC are fully terminated

# Step 2: Remove the finalizer to force deletion
kubectl patch pvc <pvc-name> -n <namespace> \
  -p '{"metadata":{"finalizers":[]}}' \
  --type=merge

# Step 3: Verify deletion completes
kubectl get pvc -n <namespace>

Warning: Only remove the pvc-protection finalizer when you are certain no pod has an active mount on the volume. Forcing PVC deletion while a pod is mounting it can cause data corruption and filesystem errors inside the container.

Cause 3 — Volume Mount Failure (ContainerCreating Forever)

Symptom:

bash

kubectl get pods -n <namespace>
# NAME           READY   STATUS              RESTARTS
# db-pod-xp2k1   0/1     ContainerCreating   0          15m

The PVC is Bound but the pod cannot mount the volume. It sits in ContainerCreating without progressing.

How to Diagnose

bash

kubectl describe pod <pod-name> -n <namespace>

# Look for Events like:
# Warning  FailedMount   Unable to attach or mount volumes:
#          timed out waiting for the condition
# Warning  FailedAttach  Multi-Attach error for volume "pvc-7d9f-xp2k":
#          Volume is already exclusively attached to one node and
#          can't be attached to another

Multi-Attach Error

This is the most common mount failure in production. A ReadWriteOnce volume can only be attached to one node at a time. If the pod was rescheduled to a different node after a node failure, the volume is still attached to the old (possibly dead) node.

bash

# Find which node the volume is currently attached to
kubectl get pv <pv-name> -o yaml | grep -A10 nodeAffinity

# Check VolumeAttachments
kubectl get volumeattachment

# Find the attachment for your volume
kubectl get volumeattachment -o json | \
  jq '.items[] | select(.spec.source.persistentVolumeName=="<pv-name>") | 
  {name: .metadata.name, node: .spec.nodeName, attached: .status.attached}'

Fix:

bash

# Delete the stale VolumeAttachment to force re-attach on the new node
kubectl delete volumeattachment <attachment-name>

# The pod should proceed to mount the volume within 30-60 seconds
kubectl get pods -n <namespace> -w

Warning: Only delete a VolumeAttachment when you are certain the old node is no longer running — either it is NotReady, drained, or deleted. Deleting an active attachment can corrupt the filesystem.

CSI Driver Not Running on the Node

bash

# Check if the CSI driver DaemonSet pod is running on the affected node
kubectl get pods -n kube-system -o wide | grep csi | grep <node-name>

# If missing, restart the DaemonSet
kubectl rollout restart daemonset/<csi-daemonset> -n kube-system

Cause 4 — StatefulSet Volume Issues

StatefulSets manage their own PVCs via volumeClaimTemplates. Each pod in the StatefulSet gets its own PVC named <template-name>-<statefulset-name>-<ordinal>.

PVC Not Created for a StatefulSet Pod

bash

# Check if PVCs exist for all StatefulSet pods
kubectl get pvc -n <namespace> | grep <statefulset-name>

# If kafka-3 and kafka-4 PVCs are missing but kafka-0, kafka-1, kafka-2 exist:
# The StatefulSet controller could not provision new PVCs
# Usually caused by missing StorageClass or quota exhaustion

Recovering a StatefulSet After PVC Loss

If a PVC is accidentally deleted for a StatefulSet pod, the pod will be stuck in Pending. The StatefulSet controller will try to recreate the PVC but if the underlying data is gone, it creates a fresh empty volume.

bash

# Check which PVCs are missing
kubectl get pvc -n <namespace>

# If the underlying disk still exists (cloud snapshot or manually retained PV),
# create a PV that references the existing disk and pre-bind it to the PVC

# Example: restore from Azure Disk snapshot
az snapshot create --resource-group <rg> --name <snapshot> \
  --source <original-disk-id>

az disk create --resource-group <rg> --name <new-disk> \
  --source <snapshot-id>

Scaling Down Does Not Delete PVCs

This is intentional. When you scale down a StatefulSet, its PVCs are not deleted automatically. Kubernetes protects StatefulSet data by retaining PVCs even when pods are gone.

bash

# After scaling down, orphaned PVCs remain
kubectl get pvc -n <namespace>
# data-kafka-4   Bound   ...   # pod kafka-4 is gone but PVC remains

# To clean up manually after confirming data is no longer needed
kubectl delete pvc data-kafka-4 -n <namespace>

Cause 5 — PV Released But Not Reclaimed

Symptom:

bash

kubectl get pv
# NAME           STATUS     CLAIM                    STORAGECLASS
# pv-legacy-01   Released   payments/data-postgres   manual

The PVC was deleted but the PV still exists in Released state. A new PVC cannot bind to it even though the PV is available.

Why: The PV has a claimRef pointing to the old PVC. Kubernetes will not bind it to a new PVC automatically to prevent accidental data access.

Fix

bash

# Option 1: Delete the PV and let the provisioner create a new one
kubectl delete pv <pv-name>

# Option 2: Manually patch the PV to remove the claimRef and make it Available again
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'

# The PV status changes to Available and can bind to a new PVC
kubectl get pv <pv-name>
# STATUS: Available

Real Production Example — StatefulSet Stuck After Node Pool Upgrade

Scenario: After a planned AKS node pool upgrade, 3 of 5 Kafka pods enter Pending state and do not recover after 25 minutes.

bash

kubectl get pvc -n messaging
# data-kafka-2   Pending   <none>   100Gi   RWO   premium-zrs
# data-kafka-3   Pending   <none>   100Gi   RWO   premium-zrs
# data-kafka-4   Pending   <none>   100Gi   RWO   premium-zrs

kubectl describe pvc data-kafka-2 -n messaging
# Events:
#   Warning ProvisioningFailed:
#   storageclass.storage.k8s.io "premium-zrs" not found

The premium-zrs StorageClass had been deleted during a cleanup task three weeks earlier. Existing PVCs were already bound so nobody noticed — until the node pool upgrade evicted the StatefulSet pods and they tried to provision new PVCs.

bash

# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF

# PVCs bind to newly provisioned PVs within 2 minutes
kubectl get pvc -n messaging -w
# data-kafka-2   Bound   pvc-new-001   100Gi   RWO   premium-zrs

# Pods recover automatically once PVCs are bound
kubectl get pods -n messaging

Time to resolution: 41 minutes. Lesson: Before deleting any StorageClass, audit all PVCs that reference it. Existing bound PVCs are not affected — but StatefulSets that need to provision new PVCs during node replacements or scaling will break silently.

bash

# Pre-deletion check — always run this before removing a StorageClass
kubectl get pvc --all-namespaces -o json | \
  jq '.items[] | select(.spec.storageClassName=="<class-to-delete>") |
  {namespace: .metadata.namespace, name: .metadata.name}'

Quick Reference

bash

# Check all PVC status
kubectl get pvc --all-namespaces

# Describe stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check PV status
kubectl get pv

# Check StorageClasses
kubectl get storageclass

# Check VolumeAttachments (for multi-attach errors)
kubectl get volumeattachment

# Force delete stuck PVC (remove finalizer)
kubectl patch pvc <pvc-name> -n <namespace> \
  -p '{"metadata":{"finalizers":[]}}' --type=merge

# Find pods using a specific PVC
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc>") |
  .metadata.name'

# Audit PVCs using a StorageClass before deleting it
kubectl get pvc --all-namespaces -o json | \
  jq '.items[] | select(.spec.storageClassName=="<class>") |
  {namespace: .metadata.namespace, name: .metadata.name}'

# Delete stale VolumeAttachment
kubectl delete volumeattachment <attachment-name>

Summary

Storage failures require caution because they involve real data. The diagnosis path:

PVC Pending — check StorageClass exists, provisioner is running, no quota exhaustion
PVC Terminating — check no pod is actively using the volume, then remove the finalizer
ContainerCreating forever — check for Multi-Attach error, delete stale VolumeAttachment
StatefulSet stuck — check PVC names and StorageClass availability
PV Released — patch claimRef to null to make it Available again

Always verify no pod is actively using a volume before taking destructive action. One wrong kubectl delete on a PVC with active data is not recoverable without a backup.

Related guides:

Post Views: 124

Debugging Kubernetes Storage (PV/PVC)

Why Storage Failures Are Different

How Kubernetes Storage Works

Step 1 — The 5-Minute Storage Triage Checklist

Cause 1 — PVC Stuck in Pending

How to Diagnose

Common Causes and Fixes

Cause 2 — PVC Stuck in Terminating

How to Diagnose

Fix

Cause 3 — Volume Mount Failure (ContainerCreating Forever)

How to Diagnose

Multi-Attach Error

CSI Driver Not Running on the Node

Cause 4 — StatefulSet Volume Issues

PVC Not Created for a StatefulSet Pod

Recovering a StatefulSet After PVC Loss

Scaling Down Does Not Delete PVCs

Cause 5 — PV Released But Not Reclaimed

Fix

Real Production Example — StatefulSet Stuck After Node Pool Upgrade

Quick Reference

Summary

About The Author

Shamsher Khan

Leave a Comment Cancel Reply

Why Storage Failures Are Different

How Kubernetes Storage Works

Step 1 — The 5-Minute Storage Triage Checklist

Cause 1 — PVC Stuck in Pending

How to Diagnose

Common Causes and Fixes

Cause 2 — PVC Stuck in Terminating

How to Diagnose

Fix

Cause 3 — Volume Mount Failure (ContainerCreating Forever)

How to Diagnose

Multi-Attach Error

CSI Driver Not Running on the Node

Cause 4 — StatefulSet Volume Issues

PVC Not Created for a StatefulSet Pod

Recovering a StatefulSet After PVC Loss

Scaling Down Does Not Delete PVCs

Cause 5 — PV Released But Not Reclaimed

Fix

Real Production Example — StatefulSet Stuck After Node Pool Upgrade

Quick Reference

Summary

About The Author

Shamsher Khan

Related Posts

Leave a Comment Cancel Reply