Debugging Kubernetes Control Plane Failures

“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”

Why Control Plane Failures Are the Hardest to Debug

Every other failure in Kubernetes is isolated. A crashing pod affects one workload. A NotReady node affects the pods on that node. A DNS failure affects name resolution. Control plane failures affect everything simultaneously.

When the API server goes down, kubectl stops working. When etcd is degraded, the API server cannot read or write cluster state. When the scheduler fails, no new pods can be placed. When the controller manager stops, scaling and self-healing stop working entirely. The cluster becomes a read-only snapshot of its last known state — existing workloads may continue running, but nothing can be changed, recovered, or deployed.

This is what makes control plane debugging both critical and difficult. You are often debugging the system you rely on to debug everything else.

bash

kubectl get nodes
# The connection to the server was refused —
# did you specify the right host or port?

That error means the API server is unreachable. From that point forward, your standard kubectl toolbox is gone and you need to work at a lower level.

This guide covers every major control plane failure: API server degradation, etcd issues, scheduler and controller manager failures, admission webhook timeouts, and certificate expiry — with diagnosis and recovery steps for both self-managed and managed (AKS/EKS/GKE) clusters.


Control Plane Architecture — What Can Fail

kubectl / CI/CD pipelines
        |
        v
API Server  <—————————————————> etcd (cluster state)
        |
        +——> Scheduler          (places pods on nodes)
        |
        +——> Controller Manager (reconciles desired vs actual state)
        |
        +——> Cloud Controller   (manages cloud resources: LBs, nodes)
        |
        v
kubelet on each node

Each component has a specific failure signature:

ComponentWhat breaks when it fails
API ServerAll kubectl commands fail. Deployments cannot be created or updated.
etcdAPI server cannot read or write state. Cluster appears frozen.
SchedulerNew pods stay Pending indefinitely.
Controller ManagerDeployments do not scale. Failed pods are not replaced. ReplicaSets are not reconciled.
Cloud ControllerNew LoadBalancers not provisioned. Node registration may fail.
Admission WebhooksAll API requests of certain types fail or are delayed.

Step 1 — The 5-Minute Control Plane Triage

bash

# 1. Can you reach the API server at all?
kubectl cluster-info

# 2. How long does a simple request take?
time kubectl get nodes

# 3. Check control plane component health
kubectl get componentstatuses

# 4. Check system pods
kubectl get pods -n kube-system

# 5. Check recent cluster events for error patterns
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -20

If kubectl cluster-info times out or returns a connection error, the API server is unreachable. Everything else in this guide depends on API server access — if it is completely down in a managed cluster (AKS, EKS, GKE), open a support ticket immediately. The control plane is the cloud provider’s responsibility in managed clusters.


Cause 1 — API Server Slow or Unresponsive

Symptoms:

  • kubectl commands hang for 10–30 seconds before responding
  • Deployments take minutes to roll out instead of seconds
  • kubectl get pods returns stale data or times out
  • CI/CD pipelines time out during deployment steps

How to Diagnose

bash

# Measure API server response time
time kubectl get pods --all-namespaces > /dev/null
# Normal: under 1 second
# Degraded: 5–30 seconds
# Down: timeout or connection refused

# Check API server request metrics
kubectl get --raw /metrics | grep apiserver_request_duration_seconds_bucket | head -20

# Check for request queue depth
kubectl get --raw /metrics | grep apiserver_current_inflight_requests

# Check API server pod logs (self-managed clusters)
kubectl logs -n kube-system kube-apiserver-<node> --tail=50

# For AKS — check control plane diagnostics
az aks show --resource-group <rg> --name <cluster> \
  --query "provisioningState"

Common Causes of API Server Slowness

Admission webhook timeouts

A single misconfigured admission webhook with no timeout can block all API requests of a specific type. Webhooks are called synchronously — if the webhook server is slow or down, every matching API request waits.

bash

# List all admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Describe a specific webhook to check its configuration
kubectl describe validatingwebhookconfiguration <webhook-name>

# Look for:
# TimeoutSeconds: <not set or very high>  <- dangerous
# FailurePolicy: Fail                     <- blocks requests if webhook is down

Fix: add timeouts and safe failure policies:

yaml

webhooks:
- name: my-webhook.example.com
  timeoutSeconds: 5          # never leave this unset
  failurePolicy: Ignore      # safe default for non-critical webhooks
  # failurePolicy: Fail      # only use for security-critical webhooks

bash

# Emergency: disable a failing webhook temporarily
kubectl delete validatingwebhookconfiguration <webhook-name>
# Or patch failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <webhook-name> \
  --type=json \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

Too many LIST/WATCH requests flooding the API server

Runaway controllers, operators, or CI/CD tools that repeatedly list all resources can saturate the API server. This is called a “list bomb.”

bash

# Check audit logs for high-frequency clients (self-managed)
# Look for clients making hundreds of LIST requests per minute

# Check current inflight requests by verb
kubectl get --raw /metrics | grep 'apiserver_current_inflight_requests{request_kind="mutating"}'
kubectl get --raw /metrics | grep 'apiserver_current_inflight_requests{request_kind="readOnly"}'

# Identify which service accounts are making the most requests
# Requires audit log access — available in AKS via Azure Monitor

etcd latency causing API server slowness

The API server stores all state in etcd. If etcd is slow, every read and write operation on the API server is slow. This is covered in detail in the etcd section below.


Cause 2 — etcd Degradation (Self-Managed Clusters)

etcd is the key-value store that holds all Kubernetes state. Every object — pods, deployments, configmaps, secrets — is stored in etcd. If etcd is degraded, the API server cannot function correctly.

Note: In managed clusters (AKS, EKS, GKE), you do not have direct access to etcd. If you suspect etcd issues in a managed cluster, escalate to your cloud provider immediately.

How to Diagnose

bash

# Check etcd pod status
kubectl get pods -n kube-system | grep etcd

# Check etcd cluster health
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check etcd cluster membership and leader status
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  -w table

Warning Signs in etcd Logs

bash

kubectl logs -n kube-system etcd-<node> | grep -E "slow|timeout|overload|leader"

Critical messages:

took too long (200ms) to execute
failed to send out heartbeat on time; took too long, leader is overloaded
server is likely overloaded

These indicate etcd is under I/O pressure. etcd is extremely sensitive to disk latency — it requires SSDs with less than 10ms write latency. Spinning disks or overloaded cloud volumes will degrade etcd significantly.

bash

# Check disk performance on the etcd node
iostat -x 1 10
# Look for: await (average I/O wait) — should be under 10ms
# If await is 50ms+, your disk is too slow for etcd

etcd Maintenance Operations

Compaction and defragmentation — over time etcd accumulates old revisions and its database grows. Regular compaction keeps it healthy:

bash

# Get the current revision
REV=$(kubectl exec -n kube-system etcd-<node> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status -w json | jq '.[0].Status.header.revision')

# Compact old revisions
kubectl exec -n kube-system etcd-<node> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  compact $REV

# Defragment to reclaim disk space
kubectl exec -n kube-system etcd-<node> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag

Taking an etcd snapshot before risky operations:

bash

kubectl exec -n kube-system etcd-<node> -- etcdctl snapshot save /tmp/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Copy snapshot off the node
kubectl cp kube-system/etcd-<node>:/tmp/etcd-backup.db ./etcd-backup-$(date +%Y%m%d).db

Rule: Always take an etcd snapshot before performing cluster upgrades, etcd compaction, or any control plane maintenance. This is your recovery path if something goes wrong.


Cause 3 — Scheduler Not Placing Pods

Symptoms:

  • New pods stay in Pending indefinitely even though nodes have available capacity
  • kubectl describe pod shows no scheduling events at all (not even “FailedScheduling”)
  • No kubectl get events entries from default-scheduler

How to Diagnose

bash

# Check scheduler pod
kubectl get pods -n kube-system | grep scheduler

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<node> --tail=50

# Verify scheduler has a leader (HA clusters)
kubectl get lease -n kube-system kube-scheduler
kubectl describe lease kube-system/kube-scheduler
# Look for: HolderIdentity — the current leader

In HA clusters, the scheduler uses leader election. Only the leader actively schedules pods. If the leader election lease is stale (old leader crashed without releasing the lease), no instance becomes the new leader.

bash

# Force new leader election by deleting the lease
kubectl delete lease kube-scheduler -n kube-system
# All scheduler instances will race to acquire the new lease

Fix

bash

# Restart the scheduler (self-managed)
# If running as a static pod, delete the pod — kubelet recreates it automatically
kubectl delete pod kube-scheduler-<node> -n kube-system

# For managed clusters — scheduler is managed by the cloud provider
# If new pods are consistently not being scheduled, open a support ticket

Cause 4 — Controller Manager Failures

Symptoms:

  • Deployments do not scale up or down
  • Crashed pods are not replaced (ReplicaSet not reconciling)
  • Jobs are not creating new pods
  • CronJobs are not firing

How to Diagnose

bash

# Check controller manager pod
kubectl get pods -n kube-system | grep controller-manager

# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<node> --tail=50

# Test if reconciliation is working
# Create a simple deployment and check if ReplicaSet is created
kubectl create deployment test-cm --image=nginx --replicas=2 -n default
kubectl get replicaset -n default
# If no ReplicaSet appears after 30 seconds, controller manager is not working
kubectl delete deployment test-cm -n default

# Check leader election lease
kubectl get lease kube-controller-manager -n kube-system

Fix

bash

# Restart the controller manager (self-managed)
kubectl delete pod kube-controller-manager-<node> -n kube-system

# Force new leader election
kubectl delete lease kube-controller-manager -n kube-system

Cause 5 — Certificate Expiry

Kubernetes components communicate using mutual TLS. Every component — the API server, kubelet, controller manager, scheduler, and etcd — has certificates that expire. When they expire, components refuse to communicate with each other and the cluster enters a failure mode that looks like network issues until you check the certificates.

How to Diagnose

bash

# Check certificate expiry for all control plane certs (self-managed with kubeadm)
kubeadm certs check-expiration

# Output shows expiry dates for each certificate:
# CERTIFICATE                EXPIRES                  RESIDUAL TIME
# admin.conf                 Jan 10, 2026 00:00 UTC   89d
# apiserver                  Jan 10, 2026 00:00 UTC   89d
# apiserver-etcd-client      Jan 10, 2026 00:00 UTC   89d
# etcd-healthcheck-client    Jan 10, 2026 00:00 UTC   89d

# Check a specific certificate manually
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Check kubelet certificate on a node
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

Fix — Renewing Certificates

bash

# Renew all certificates at once (kubeadm clusters)
kubeadm certs renew all

# Renew a specific certificate
kubeadm certs renew apiserver

# After renewal, restart the control plane components
# For static pod deployments:
crictl ps | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"
crictl stop <container-id>
# kubelet automatically restarts static pod containers

Production rule: Set a calendar reminder 30 days before certificate expiry. Add a Prometheus alert on kubeadm_certs_expiry_seconds if you have monitoring of control plane nodes. Certificate expiry is entirely preventable — it is always an operational oversight.


Real Production Example — Admission Webhook Blocking All Deployments

Scenario: On a Tuesday morning, every deployment across the cluster fails. CI/CD pipelines are reporting errors. No new pods can be created. Existing running pods are unaffected.

bash

kubectl create deployment test --image=nginx -n default
# Error from server (InternalError): Internal error occurred:
# failed calling webhook "pod-policy.example.com":
# Post "https://policy-webhook.kube-system.svc:443/validate-pods?timeout=30s":
# context deadline exceeded

The error names the webhook directly.

bash

# Check if the webhook service is running
kubectl get pods -n kube-system | grep policy-webhook
# policy-webhook-7d9f-xp2k   0/1   CrashLoopBackOff   12   2h

# The webhook pod has been crashing for 2 hours
# With failurePolicy: Fail and no timeoutSeconds set,
# every pod creation request waits the full 30-second default timeout
# then fails

Fix:

bash

# Immediate: patch failurePolicy to Ignore to unblock the cluster
kubectl patch validatingwebhookconfiguration pod-policy \
  --type=json \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

# Deployments immediately succeed
kubectl create deployment test --image=nginx -n default
# deployment.apps/test created

# Then fix the webhook pod
kubectl describe pod policy-webhook-7d9f-xp2k -n kube-system
# Found: missing Secret reference after a namespace cleanup

kubectl create secret generic webhook-tls \
  --from-file=tls.crt=webhook.crt \
  --from-file=tls.key=webhook.key \
  -n kube-system

kubectl rollout restart deployment/policy-webhook -n kube-system

# Once webhook is healthy, restore failurePolicy: Fail
kubectl patch validatingwebhookconfiguration pod-policy \
  --type=json \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'

Time to resolution: 14 minutes. Lesson: Every admission webhook must have timeoutSeconds set (5 seconds is a safe default) and a deliberate failurePolicy. A webhook with failurePolicy: Fail and no timeout is a cluster-wide single point of failure. Audit all webhooks during every cluster review.


Quick Reference

bash

# Check API server reachability
kubectl cluster-info
time kubectl get nodes

# Check control plane pods
kubectl get pods -n kube-system | grep -E "apiserver|etcd|scheduler|controller"

# Check admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Check etcd health (self-managed)
etcdctl endpoint health --cluster

# Check etcd leader
etcdctl endpoint status --cluster -w table

# Take etcd snapshot
etcdctl snapshot save /tmp/backup.db

# Check certificate expiry (kubeadm)
kubeadm certs check-expiration

# Renew all certificates
kubeadm certs renew all

# Check scheduler and controller manager leases
kubectl get lease -n kube-system

# Force new leader election
kubectl delete lease kube-scheduler -n kube-system
kubectl delete lease kube-controller-manager -n kube-system

# Patch webhook failurePolicy to Ignore (emergency)
kubectl patch validatingwebhookconfiguration <name> \
  --type=json \
  -p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'

Summary

Control plane failures require a methodical approach because you are often debugging with limited tooling:

  1. API server unreachable — check admission webhooks first, then etcd health
  2. API server slow — measure latency, check webhook timeouts, check etcd I/O
  3. etcd degraded — check disk latency, compact and defragment, take snapshot before any action
  4. Scheduler not placing pods — check lease holder, restart pod or force leader election
  5. Controller manager not reconciling — check lease holder, test with a simple deployment
  6. Certificate expiry — run kubeadm certs check-expiration monthly, renew 30 days before expiry

In managed clusters, your responsibility ends at the API server boundary. Anything below — etcd, scheduler internals, certificate rotation — is the cloud provider’s job. Escalate quickly rather than spending hours on something you cannot access.


Related guides:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top