“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”
Why Storage Failures Are Different
Most Kubernetes failures are stateless — a crashing pod gets replaced, a misconfigured deployment gets rolled back, and the cluster recovers. Storage failures are different. They are stateful, they block pod startup, and they carry real data loss risk if handled incorrectly.
A PersistentVolumeClaim stuck in Pending means the pod waiting for it cannot start. A ReadWriteOnce volume still attached to a dead node means the replacement pod cannot mount it. A PVC stuck in Terminating can block namespace deletion for hours.
These failures require a different approach — slower, more deliberate, with verification at each step before taking destructive actions.
This guide walks through every storage failure pattern you will encounter in production, from PVC provisioning failures to StatefulSet volume issues to multi-attach errors.
How Kubernetes Storage Works
A clear mental model prevents most storage debugging mistakes.
Pod spec references a PVC by name
|
v
PersistentVolumeClaim (PVC)
Requests storage: 50Gi, ReadWriteOnce, StorageClass: premium-ssd
|
v
StorageClass
Provisioner: disk.csi.azure.com
Parameters: skuName: Premium_LRS
|
v
Provisioner creates the physical storage (Azure Disk, AWS EBS, GCP PD)
|
v
PersistentVolume (PV)
Bound to the PVC
|
v
CSI driver mounts the volume onto the node
|
v
Volume mounted into the container at the specified path
Key concepts:
- PVC is what your pod references. It is a request for storage.
- PV is the actual storage resource. It can be provisioned dynamically (StorageClass) or manually.
- StorageClass defines how PVs are provisioned and what type of storage to use.
- Binding is the one-to-one relationship between a PVC and a PV.
- AccessModes define how the volume can be mounted:
ReadWriteOnce(one node),ReadOnlyMany(many nodes, read only),ReadWriteMany(many nodes, read/write).
Step 1 — The 5-Minute Storage Triage Checklist
bash
# 1. Check PVC status across all namespaces
kubectl get pvc --all-namespaces
# 2. Check PV status
kubectl get pv
# 3. Check available StorageClasses
kubectl get storageclass
# 4. Describe the stuck PVC for events
kubectl describe pvc <pvc-name> -n <namespace>
# 5. Check pod events for volume mount errors
kubectl describe pod <pod-name> -n <namespace> | grep -A20 Events
PVC Status meanings:
| Status | Meaning |
|---|---|
| Pending | No PV bound yet — provisioner is working or has failed |
| Bound | Healthy — PVC is bound to a PV |
| Lost | The bound PV no longer exists |
| Terminating | Deletion in progress — may be stuck on a finalizer |
PV Status meanings:
| Status | Meaning |
|---|---|
| Available | PV exists but not yet bound to any PVC |
| Bound | PV is bound to a PVC — healthy |
| Released | PVC was deleted but PV still exists with old data |
| Failed | Provisioner encountered an error |
Cause 1 — PVC Stuck in Pending
Symptom:
bash
kubectl get pvc -n <namespace>
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# data-postgres Pending <none> 100Gi RWO premium-ssd
The PVC has been created but no PV has been bound to it. The pod waiting for this PVC stays in ContainerCreating or Pending.
How to Diagnose
bash
kubectl describe pvc <pvc-name> -n <namespace>
# Look for Events at the bottom:
# storageclass.storage.k8s.io "premium-ssd" not found
# no persistent volumes available for this claim and no storage class is set
# waiting for a volume to be created either by the external provisioner
# or manually by a system administrator
Step 1: Does the StorageClass exist?
bash
kubectl get storageclass
# If your StorageClass is not listed, it was deleted or never created
Step 2: Is the provisioner running?
bash
kubectl get pods -n kube-system | grep -i "provisioner\|csi"
# For AKS — check the Azure Disk CSI driver
kubectl get pods -n kube-system | grep csi-azuredisk
Step 3: Does the PVC request match what the StorageClass can provide?
bash
kubectl describe storageclass <storageclass-name>
# Check: AllowVolumeExpansion, VolumeBindingMode, Provisioner
# VolumeBindingMode: WaitForFirstConsumer
# means PV is not created until a pod is actually scheduled
# The PVC will stay Pending until you create a pod that uses it
Common Causes and Fixes
StorageClass deleted or named incorrectly:
bash
# List available StorageClasses
kubectl get storageclass
# In AKS, default StorageClasses:
# managed-csi <- Standard SSD (default)
# managed-csi-premium <- Premium SSD
# azurefile-csi <- Azure Files (ReadWriteMany)
# azurefile-csi-premium
# Fix: update the PVC to use an existing StorageClass
# (you must delete and recreate the PVC — StorageClass is immutable after creation)
kubectl delete pvc <pvc-name> -n <namespace>
kubectl apply -f pvc-with-correct-storageclass.yaml
WaitForFirstConsumer binding mode — expected behavior:
bash
# If VolumeBindingMode is WaitForFirstConsumer, the PVC intentionally stays
# Pending until a pod using it is scheduled to a node.
# This is normal — create the pod and the PVC will bind.
kubectl describe storageclass <name> | grep VolumeBindingMode
# VolumeBindingMode: WaitForFirstConsumer <- this is expected behavior
Provisioner pod not running:
bash
kubectl logs -n kube-system <provisioner-pod>
# Look for authentication errors, quota errors, or API errors
# Restart the provisioner
kubectl rollout restart deployment/<provisioner-deployment> -n kube-system
Cloud quota exhausted:
In AKS, disk provisioning fails silently when your Azure subscription has hit its managed disk quota. Check Azure portal or CLI:
bash
az vm list-usage --location eastus --query "[?name.value=='Disks']"
Cause 2 — PVC Stuck in Terminating
Symptom:
bash
kubectl get pvc -n <namespace>
# NAME STATUS VOLUME CAPACITY
# data-postgres Terminating pvc-7d9f-xp2k 100Gi
The PVC was deleted but is stuck and will not go away. This commonly blocks namespace deletion.
How to Diagnose
bash
kubectl describe pvc <pvc-name> -n <namespace>
# Look for Finalizers:
# Finalizers: [kubernetes.io/pvc-protection]
What is happening: Kubernetes uses a pvc-protection finalizer to prevent a PVC from being deleted while a pod is actively using it. If the pod has been deleted but the PVC is still stuck, the finalizer was not removed automatically — usually because the pod is stuck in Terminating too.
bash
# Check if any pod is still using the PVC
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") |
.metadata.name'
Fix
bash
# Step 1: Confirm no pod is actively using the PVC
# Only proceed if all pods using this PVC are fully terminated
# Step 2: Remove the finalizer to force deletion
kubectl patch pvc <pvc-name> -n <namespace> \
-p '{"metadata":{"finalizers":[]}}' \
--type=merge
# Step 3: Verify deletion completes
kubectl get pvc -n <namespace>
Warning: Only remove the
pvc-protectionfinalizer when you are certain no pod has an active mount on the volume. Forcing PVC deletion while a pod is mounting it can cause data corruption and filesystem errors inside the container.
Cause 3 — Volume Mount Failure (ContainerCreating Forever)
Symptom:
bash
kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS
# db-pod-xp2k1 0/1 ContainerCreating 0 15m
The PVC is Bound but the pod cannot mount the volume. It sits in ContainerCreating without progressing.
How to Diagnose
bash
kubectl describe pod <pod-name> -n <namespace>
# Look for Events like:
# Warning FailedMount Unable to attach or mount volumes:
# timed out waiting for the condition
# Warning FailedAttach Multi-Attach error for volume "pvc-7d9f-xp2k":
# Volume is already exclusively attached to one node and
# can't be attached to another
Multi-Attach Error
This is the most common mount failure in production. A ReadWriteOnce volume can only be attached to one node at a time. If the pod was rescheduled to a different node after a node failure, the volume is still attached to the old (possibly dead) node.
bash
# Find which node the volume is currently attached to
kubectl get pv <pv-name> -o yaml | grep -A10 nodeAffinity
# Check VolumeAttachments
kubectl get volumeattachment
# Find the attachment for your volume
kubectl get volumeattachment -o json | \
jq '.items[] | select(.spec.source.persistentVolumeName=="<pv-name>") |
{name: .metadata.name, node: .spec.nodeName, attached: .status.attached}'
Fix:
bash
# Delete the stale VolumeAttachment to force re-attach on the new node
kubectl delete volumeattachment <attachment-name>
# The pod should proceed to mount the volume within 30-60 seconds
kubectl get pods -n <namespace> -w
Warning: Only delete a VolumeAttachment when you are certain the old node is no longer running — either it is NotReady, drained, or deleted. Deleting an active attachment can corrupt the filesystem.
CSI Driver Not Running on the Node
bash
# Check if the CSI driver DaemonSet pod is running on the affected node
kubectl get pods -n kube-system -o wide | grep csi | grep <node-name>
# If missing, restart the DaemonSet
kubectl rollout restart daemonset/<csi-daemonset> -n kube-system
Cause 4 — StatefulSet Volume Issues
StatefulSets manage their own PVCs via volumeClaimTemplates. Each pod in the StatefulSet gets its own PVC named <template-name>-<statefulset-name>-<ordinal>.
PVC Not Created for a StatefulSet Pod
bash
# Check if PVCs exist for all StatefulSet pods
kubectl get pvc -n <namespace> | grep <statefulset-name>
# If kafka-3 and kafka-4 PVCs are missing but kafka-0, kafka-1, kafka-2 exist:
# The StatefulSet controller could not provision new PVCs
# Usually caused by missing StorageClass or quota exhaustion
Recovering a StatefulSet After PVC Loss
If a PVC is accidentally deleted for a StatefulSet pod, the pod will be stuck in Pending. The StatefulSet controller will try to recreate the PVC but if the underlying data is gone, it creates a fresh empty volume.
bash
# Check which PVCs are missing
kubectl get pvc -n <namespace>
# If the underlying disk still exists (cloud snapshot or manually retained PV),
# create a PV that references the existing disk and pre-bind it to the PVC
# Example: restore from Azure Disk snapshot
az snapshot create --resource-group <rg> --name <snapshot> \
--source <original-disk-id>
az disk create --resource-group <rg> --name <new-disk> \
--source <snapshot-id>
Scaling Down Does Not Delete PVCs
This is intentional. When you scale down a StatefulSet, its PVCs are not deleted automatically. Kubernetes protects StatefulSet data by retaining PVCs even when pods are gone.
bash
# After scaling down, orphaned PVCs remain
kubectl get pvc -n <namespace>
# data-kafka-4 Bound ... # pod kafka-4 is gone but PVC remains
# To clean up manually after confirming data is no longer needed
kubectl delete pvc data-kafka-4 -n <namespace>
Cause 5 — PV Released But Not Reclaimed
Symptom:
bash
kubectl get pv
# NAME STATUS CLAIM STORAGECLASS
# pv-legacy-01 Released payments/data-postgres manual
The PVC was deleted but the PV still exists in Released state. A new PVC cannot bind to it even though the PV is available.
Why: The PV has a claimRef pointing to the old PVC. Kubernetes will not bind it to a new PVC automatically to prevent accidental data access.
Fix
bash
# Option 1: Delete the PV and let the provisioner create a new one
kubectl delete pv <pv-name>
# Option 2: Manually patch the PV to remove the claimRef and make it Available again
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'
# The PV status changes to Available and can bind to a new PVC
kubectl get pv <pv-name>
# STATUS: Available
Real Production Example — StatefulSet Stuck After Node Pool Upgrade
Scenario: After a planned AKS node pool upgrade, 3 of 5 Kafka pods enter Pending state and do not recover after 25 minutes.
bash
kubectl get pvc -n messaging
# data-kafka-2 Pending <none> 100Gi RWO premium-zrs
# data-kafka-3 Pending <none> 100Gi RWO premium-zrs
# data-kafka-4 Pending <none> 100Gi RWO premium-zrs
kubectl describe pvc data-kafka-2 -n messaging
# Events:
# Warning ProvisioningFailed:
# storageclass.storage.k8s.io "premium-zrs" not found
The premium-zrs StorageClass had been deleted during a cleanup task three weeks earlier. Existing PVCs were already bound so nobody noticed — until the node pool upgrade evicted the StatefulSet pods and they tried to provision new PVCs.
bash
# Recreate the StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF
# PVCs bind to newly provisioned PVs within 2 minutes
kubectl get pvc -n messaging -w
# data-kafka-2 Bound pvc-new-001 100Gi RWO premium-zrs
# Pods recover automatically once PVCs are bound
kubectl get pods -n messaging
Time to resolution: 41 minutes. Lesson: Before deleting any StorageClass, audit all PVCs that reference it. Existing bound PVCs are not affected — but StatefulSets that need to provision new PVCs during node replacements or scaling will break silently.
bash
# Pre-deletion check — always run this before removing a StorageClass
kubectl get pvc --all-namespaces -o json | \
jq '.items[] | select(.spec.storageClassName=="<class-to-delete>") |
{namespace: .metadata.namespace, name: .metadata.name}'
Quick Reference
bash
# Check all PVC status
kubectl get pvc --all-namespaces
# Describe stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check PV status
kubectl get pv
# Check StorageClasses
kubectl get storageclass
# Check VolumeAttachments (for multi-attach errors)
kubectl get volumeattachment
# Force delete stuck PVC (remove finalizer)
kubectl patch pvc <pvc-name> -n <namespace> \
-p '{"metadata":{"finalizers":[]}}' --type=merge
# Find pods using a specific PVC
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc>") |
.metadata.name'
# Audit PVCs using a StorageClass before deleting it
kubectl get pvc --all-namespaces -o json | \
jq '.items[] | select(.spec.storageClassName=="<class>") |
{namespace: .metadata.namespace, name: .metadata.name}'
# Delete stale VolumeAttachment
kubectl delete volumeattachment <attachment-name>
Summary
Storage failures require caution because they involve real data. The diagnosis path:
- PVC Pending — check StorageClass exists, provisioner is running, no quota exhaustion
- PVC Terminating — check no pod is actively using the volume, then remove the finalizer
- ContainerCreating forever — check for Multi-Attach error, delete stale VolumeAttachment
- StatefulSet stuck — check PVC names and StorageClass availability
- PV Released — patch claimRef to null to make it Available again
Always verify no pod is actively using a volume before taking destructive action. One wrong kubectl delete on a PVC with active data is not recoverable without a backup.
Related guides:

