Debugging Kubernetes Scheduling Problems

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

Why Kubernetes Scheduling Fails

The Kubernetes scheduler is responsible for one job: deciding which node a pod should run on. When it cannot find a suitable node, the pod stays in Pending state indefinitely. No error message is thrown at the application level. The pod simply never starts.

Scheduling failures are particularly frustrating because they are invisible from the application’s perspective. A deployment shows 0 of 3 replicas ready. An alert fires. Engineers check pod logs — but there are no logs because the container never started. The clue is always in the scheduler’s reasoning, which you have to know where to find.

bash

kubectl get pods -n production
# NAME                    READY   STATUS    RESTARTS   AGE
# api-svc-7d9f-xp2k1      0/1     Pending   0          8m
# api-svc-7d9f-mn3q2      0/1     Pending   0          8m
# api-svc-7d9f-kx9p3      0/1     Pending   0          8m

Three pods, all Pending, no restarts — this is a scheduling failure, not a crash.

This guide covers every major scheduling failure: insufficient resources, taints and tolerations, affinity rules, ResourceQuota limits, HPA and VPA misconfigurations, and cloud node autoscaler delays.


How the Kubernetes Scheduler Works

The scheduler filters and scores nodes in two phases:

Filtering removes nodes that cannot run the pod. A node fails filtering if it has insufficient CPU or memory, the wrong labels for a node selector, taints the pod does not tolerate, or a PVC the pod needs that cannot be bound in that zone.

Scoring ranks the remaining nodes. The scheduler prefers nodes with the best resource fit, correct zone distribution, and pod affinity alignment.

If zero nodes pass the filtering phase, the pod stays Pending. The scheduler logs its reasoning in the pod events — and that is exactly where you start.


Step 1 — Read the Scheduler’s Reasoning

The single most important command for scheduling failures:

bash

kubectl describe pod <pod-name> -n <namespace>

Go directly to the Events section at the bottom. The scheduler always explains why it could not place the pod:

Events:
  Type     Reason             Age   From               Message
  Warning  FailedScheduling   8m    default-scheduler  0/5 nodes are available:
                                                        2 Insufficient cpu,
                                                        2 node(s) had taint {dedicated:gpu}
                                                        that the pod did not tolerate,
                                                        1 node(s) didn't match
                                                        Pod's node affinity/selector.

This single message tells you exactly what is wrong: two nodes are out of CPU, two have a taint the pod cannot tolerate, and one does not match the affinity rules. You do not need to guess — the scheduler already did the analysis.

The format is always: 0/N nodes are available: <reason1>, <reason2>...


Step 2 — The 5-Minute Scheduling Triage Checklist

bash

# 1. What is blocking scheduling? (read Events section)
kubectl describe pod <pod-name> -n <namespace>

# 2. What resources are available across nodes?
kubectl describe nodes | grep -A8 "Allocated resources"

# 3. What are nodes actually consuming right now?
kubectl top nodes

# 4. Are there taints on nodes that could block this pod?
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# 5. Is there a ResourceQuota limiting the namespace?
kubectl describe resourcequota -n <namespace>

Cause 1 — Insufficient CPU or Memory

Symptom:

0/5 nodes are available: 5 Insufficient memory.

or

0/5 nodes are available: 3 Insufficient cpu, 2 Insufficient memory.

How it works: The scheduler uses requests — not limits — to determine if a node has capacity. A pod with requests.memory: 4Gi needs a node with 4Gi of unallocated memory, even if the pod only ever uses 512Mi at runtime.

How to Diagnose

bash

# See how much of each node is already allocated
kubectl describe nodes | grep -A6 "Allocated resources"

# Example output:
# Allocated resources:
#   Resource           Requests      Limits
#   cpu                3800m (95%)   4200m (105%)
#   memory             6Gi   (85%)   8Gi   (106%)

# Check the resource request on the pending pod
kubectl get pod <pod-name> -o yaml | grep -A10 resources

# Find the biggest resource consumers right now
kubectl top pods --all-namespaces --sort-by=cpu | head -15
kubectl top pods --all-namespaces --sort-by=memory | head -15

Fix Checklist

bash

# Option 1: Reduce the pod's resource requests if they are over-provisioned
# Check actual usage vs requests — if usage is consistently 20% of requests, lower them

# Option 2: Find and remove idle workloads consuming cluster capacity
kubectl get pods --all-namespaces | grep -E "Completed|Evicted"
kubectl delete pods --field-selector=status.phase=Succeeded --all-namespaces

# Option 3: Scale up the node pool (cloud environments)
# AKS:
az aks nodepool scale \
  --resource-group <rg> \
  --cluster-name <cluster> \
  --name <nodepool> \
  --node-count 6

# Option 4: Check for namespace ResourceQuota blocking new pods
kubectl describe resourcequota -n <namespace>

Important: Resource requests determine scheduling. A pod with requests: 4Gi occupies 4Gi of scheduling capacity on its node even if it only uses 200Mi. Audit your requests regularly — over-provisioned requests are the most common cause of cluster resource starvation.


Cause 2 — Taints and Tolerations

Symptom:

0/5 nodes are available: 5 node(s) had taint {dedicated:gpu} that the pod did not tolerate.

Taints are applied to nodes to repel pods that do not explicitly tolerate them. Common in clusters with dedicated node pools — GPU nodes, spot/preemptible nodes, system-only nodes.

How to Diagnose

bash

# Check all node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Or check a specific node
kubectl describe node <node-name> | grep Taints

# Check what tolerations the pod has
kubectl get pod <pod-name> -o yaml | grep -A15 tolerations

Understanding taint effects:

EffectBehavior
NoSchedulePod will not be scheduled on this node unless it tolerates the taint
PreferNoScheduleScheduler tries to avoid this node but will use it if no better option exists
NoExecuteExisting pods without the toleration are evicted; new pods are not scheduled

Fix Checklist

bash

# Add the appropriate toleration to your pod spec
# Example: tolerate a spot instance taint in AKS
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
  operator: "Equal"
  value: "spot"
  effect: "NoSchedule"

# Example: tolerate any taint with a specific key (wildcard)
tolerations:
- key: "dedicated"
  operator: "Exists"
  effect: "NoSchedule"

Taint a node (adding taints):

bash

# Add a taint to a node
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule

# Remove a taint from a node
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-

Cause 3 — Node Affinity and Anti-Affinity

Symptom:

0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector.

Node affinity rules tell the scheduler which nodes a pod can or prefers to run on, based on node labels.

How to Diagnose

bash

# Check the pod's affinity rules
kubectl get pod <pod-name> -o yaml | grep -A30 affinity

# Check what labels nodes actually have
kubectl get nodes --show-labels

# Check if any node matches the required labels
kubectl get nodes -l <key>=<value>

Required vs Preferred Affinity

yaml

affinity:
  nodeAffinity:
    # HARD RULE — pod will never schedule if no node matches
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - eastus-1
          - eastus-2

    # SOFT RULE — scheduler prefers but does not require
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: node-type
          operator: In
          values:
          - high-memory

Pod Anti-Affinity Surprises

Pod anti-affinity prevents pods from being co-located on the same node or zone. In small clusters, strict anti-affinity makes pods permanently unschedulable.

0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules.

This means you have more replicas than nodes that satisfy the anti-affinity constraint — for example, 5 replicas with required anti-affinity on a 3-node cluster.

bash

# Check anti-affinity config
kubectl get pod <pod-name> -o yaml | grep -A20 podAntiAffinity

Fix: Switch from required to preferred anti-affinity unless you have an absolute requirement that replicas never share a node:

yaml

affinity:
  podAntiAffinity:
    # Use preferred unless you truly cannot tolerate co-location
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: myapp
        topologyKey: kubernetes.io/hostname

Cause 4 — ResourceQuota Blocking Pod Creation

Symptom: The cluster has available capacity but pods still cannot be created. The error appears as an event or API rejection:

Error from server (Forbidden): pods "api-svc-7d9f-xp2k1" is forbidden:
exceeded quota: production-quota, requested: limits.cpu=500m,
used: limits.cpu=7800m, limited: limits.cpu=8000m

How to Diagnose

bash

kubectl describe resourcequota -n <namespace>

# Output shows used vs hard limits:
# Resource          Used     Hard
# --------          ---      ---
# limits.cpu        7800m    8000m     <- almost at limit
# limits.memory     14Gi     16Gi
# pods              48       50
# requests.cpu      3900m    4000m

Fix Checklist

bash

# Option 1: Increase the quota limit
kubectl edit resourcequota <quota-name> -n <namespace>

# Option 2: Find and clean up completed or idle pods
kubectl get pods -n <namespace> | grep -E "Completed|Evicted|Error"
kubectl delete pod --field-selector=status.phase=Succeeded -n <namespace>

# Option 3: Reduce limits on over-provisioned pods
# Find pods with high limits but low actual usage
kubectl top pods -n <namespace> --sort-by=cpu

Cause 5 — HPA Not Scaling — Metrics Server Down

The Horizontal Pod Autoscaler depends entirely on metrics-server. If metrics-server is down, HPA silently stops scaling — and you will not notice until a traffic spike arrives.

How to Diagnose

bash

# Check HPA status
kubectl get hpa -n <namespace>

# TARGETS showing <unknown> means metrics are not available
# NAME       REFERENCE         TARGETS         MINPODS   MAXPODS   REPLICAS
# api-hpa    Deployment/api    <unknown>/50%   3         20        3

kubectl describe hpa <hpa-name> -n <namespace>
# Conditions:
#   ScalingActive: False
#   Reason: FailedGetResourceMetric
#   Message: unable to get metrics for resource cpu

# Check metrics-server
kubectl get pods -n kube-system | grep metrics-server
kubectl top nodes   # if this fails, metrics-server is broken

Fix Checklist

bash

# Restart metrics-server
kubectl rollout restart deployment/metrics-server -n kube-system

# If metrics-server keeps crashing, check logs
kubectl logs -n kube-system -l k8s-app=metrics-server

# Common fix for AKS: add --kubelet-insecure-tls flag
kubectl patch deployment metrics-server -n kube-system --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-",
  "value":"--kubelet-insecure-tls"}]'

# Verify HPA starts reading metrics after fix
kubectl get hpa -n <namespace> -w

Production rule: Add a Prometheus alert for kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"}. A silent HPA failure during off-hours will not be noticed until peak traffic the next day.


Cause 6 — Node Autoscaler Delays

In cloud environments (AKS, EKS, GKE), when all nodes are full and a new pod cannot be scheduled, the cluster autoscaler provisions a new node. This takes 2 to 5 minutes. During this window, pods sit in Pending and your application may be degraded.

How to Diagnose

bash

# Check if autoscaler is working
kubectl get events -n kube-system | grep -i "scale\|autoscal"

# Check autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50

# Common log messages:
# "Scale up triggered" — CA decided to add a node
# "Node group has reached maximum size" — cannot scale further, check max node count
# "No candidates for node removal" — scale down did not happen

Fix: Pre-Warm With Buffer Pods

To eliminate cold-start delays, keep buffer capacity using low-priority placeholder pods that get evicted when real workloads need the space:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-capacity-buffer
  namespace: kube-system
spec:
  replicas: 3
  template:
    spec:
      priorityClassName: low-priority-buffer
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-buffer
value: -10
globalDefault: false

When a real high-priority pod needs scheduling, it evicts the buffer pods. Those buffer pods then trigger the autoscaler to add a new node — while your application traffic is already being served.


Real Production Example — Anti-Affinity Blocking Scale-Out

Scenario: Black Friday traffic spike. HPA tries to scale the checkout service from 3 to 15 replicas. After 10 minutes, only 3 replicas are running. 12 pods are stuck in Pending.

bash

kubectl describe pod checkout-svc-6b8d-xp2k1 -n ecommerce

# Events:
# Warning FailedScheduling
# 0/3 nodes are available:
# 3 node(s) didn't match pod anti-affinity rules.

The checkout service had required pod anti-affinity set months ago to ensure high availability. With 3 nodes and 3 existing replicas, the scheduler cannot place any new replica because every node already runs one — and the anti-affinity rule forbids it.

bash

# Check the anti-affinity rule
kubectl get deployment checkout-svc -o yaml | grep -A15 podAntiAffinity

# requiredDuringSchedulingIgnoredDuringExecution:
# topologyKey: kubernetes.io/hostname
# Hard rule — one pod per node, maximum

# Fix: change to preferred anti-affinity
kubectl edit deployment checkout-svc -n ecommerce
# Change: requiredDuringSchedulingIgnoredDuringExecution
# To:     preferredDuringSchedulingIgnoredDuringExecution

# Pods begin scheduling immediately
kubectl get pods -n ecommerce -w

Time to resolution: 9 minutes. Lesson: required anti-affinity is a hard ceiling on your replica count equal to the number of matching nodes. Review all anti-affinity rules before Black Friday and any expected traffic spike. Switch to preferred unless co-location genuinely causes a correctness issue.


Quick Reference

bash

# Why is the pod Pending? (always start here)
kubectl describe pod <pod-name> -n <namespace>

# Check node available capacity
kubectl describe nodes | grep -A6 "Allocated resources"

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check node labels
kubectl get nodes --show-labels

# Check ResourceQuota
kubectl describe resourcequota -n <namespace>

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>

# Check metrics-server
kubectl top nodes

# Check cluster autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=30

# Clean up completed pods consuming quota
kubectl delete pod --field-selector=status.phase=Succeeded --all-namespaces

Summary

Scheduling failures always have a specific reason that the scheduler records in pod events. The diagnosis path is:

  1. Read the Events section firstkubectl describe pod tells you exactly why scheduling failed
  2. Insufficient resources — check allocated vs available with kubectl describe nodes
  3. Taints — check node taints and add tolerations to the pod spec
  4. Affinity rules — switch required to preferred unless co-location is a correctness issue
  5. ResourceQuota — check used vs hard limits, clean up completed pods
  6. HPA silent failure — check metrics-server health, alert on ScalingActive=False
  7. Autoscaler delay — use buffer pods to pre-warm capacity for traffic spikes

Related guides:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top