“This guide is part of the Production Kubernetes Debugging Handbook — a complete reference for debugging production Kubernetes clusters.”
Why Control Plane Failures Are the Hardest to Debug
Every other failure in Kubernetes is isolated. A crashing pod affects one workload. A NotReady node affects the pods on that node. A DNS failure affects name resolution. Control plane failures affect everything simultaneously.
When the API server goes down, kubectl stops working. When etcd is degraded, the API server cannot read or write cluster state. When the scheduler fails, no new pods can be placed. When the controller manager stops, scaling and self-healing stop working entirely. The cluster becomes a read-only snapshot of its last known state — existing workloads may continue running, but nothing can be changed, recovered, or deployed.
This is what makes control plane debugging both critical and difficult. You are often debugging the system you rely on to debug everything else.
bash
kubectl get nodes
# The connection to the server was refused —
# did you specify the right host or port?
That error means the API server is unreachable. From that point forward, your standard kubectl toolbox is gone and you need to work at a lower level.
This guide covers every major control plane failure: API server degradation, etcd issues, scheduler and controller manager failures, admission webhook timeouts, and certificate expiry — with diagnosis and recovery steps for both self-managed and managed (AKS/EKS/GKE) clusters.
Control Plane Architecture — What Can Fail
kubectl / CI/CD pipelines
|
v
API Server <—————————————————> etcd (cluster state)
|
+——> Scheduler (places pods on nodes)
|
+——> Controller Manager (reconciles desired vs actual state)
|
+——> Cloud Controller (manages cloud resources: LBs, nodes)
|
v
kubelet on each node
Each component has a specific failure signature:
| Component | What breaks when it fails |
|---|---|
| API Server | All kubectl commands fail. Deployments cannot be created or updated. |
| etcd | API server cannot read or write state. Cluster appears frozen. |
| Scheduler | New pods stay Pending indefinitely. |
| Controller Manager | Deployments do not scale. Failed pods are not replaced. ReplicaSets are not reconciled. |
| Cloud Controller | New LoadBalancers not provisioned. Node registration may fail. |
| Admission Webhooks | All API requests of certain types fail or are delayed. |
Step 1 — The 5-Minute Control Plane Triage
bash
# 1. Can you reach the API server at all?
kubectl cluster-info
# 2. How long does a simple request take?
time kubectl get nodes
# 3. Check control plane component health
kubectl get componentstatuses
# 4. Check system pods
kubectl get pods -n kube-system
# 5. Check recent cluster events for error patterns
kubectl get events -n kube-system --sort-by='.lastTimestamp' | tail -20
If kubectl cluster-info times out or returns a connection error, the API server is unreachable. Everything else in this guide depends on API server access — if it is completely down in a managed cluster (AKS, EKS, GKE), open a support ticket immediately. The control plane is the cloud provider’s responsibility in managed clusters.
Cause 1 — API Server Slow or Unresponsive
Symptoms:
kubectlcommands hang for 10–30 seconds before responding- Deployments take minutes to roll out instead of seconds
kubectl get podsreturns stale data or times out- CI/CD pipelines time out during deployment steps
How to Diagnose
bash
# Measure API server response time
time kubectl get pods --all-namespaces > /dev/null
# Normal: under 1 second
# Degraded: 5–30 seconds
# Down: timeout or connection refused
# Check API server request metrics
kubectl get --raw /metrics | grep apiserver_request_duration_seconds_bucket | head -20
# Check for request queue depth
kubectl get --raw /metrics | grep apiserver_current_inflight_requests
# Check API server pod logs (self-managed clusters)
kubectl logs -n kube-system kube-apiserver-<node> --tail=50
# For AKS — check control plane diagnostics
az aks show --resource-group <rg> --name <cluster> \
--query "provisioningState"
Common Causes of API Server Slowness
Admission webhook timeouts
A single misconfigured admission webhook with no timeout can block all API requests of a specific type. Webhooks are called synchronously — if the webhook server is slow or down, every matching API request waits.
bash
# List all admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Describe a specific webhook to check its configuration
kubectl describe validatingwebhookconfiguration <webhook-name>
# Look for:
# TimeoutSeconds: <not set or very high> <- dangerous
# FailurePolicy: Fail <- blocks requests if webhook is down
Fix: add timeouts and safe failure policies:
yaml
webhooks:
- name: my-webhook.example.com
timeoutSeconds: 5 # never leave this unset
failurePolicy: Ignore # safe default for non-critical webhooks
# failurePolicy: Fail # only use for security-critical webhooks
bash
# Emergency: disable a failing webhook temporarily
kubectl delete validatingwebhookconfiguration <webhook-name>
# Or patch failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <webhook-name> \
--type=json \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
Too many LIST/WATCH requests flooding the API server
Runaway controllers, operators, or CI/CD tools that repeatedly list all resources can saturate the API server. This is called a “list bomb.”
bash
# Check audit logs for high-frequency clients (self-managed)
# Look for clients making hundreds of LIST requests per minute
# Check current inflight requests by verb
kubectl get --raw /metrics | grep 'apiserver_current_inflight_requests{request_kind="mutating"}'
kubectl get --raw /metrics | grep 'apiserver_current_inflight_requests{request_kind="readOnly"}'
# Identify which service accounts are making the most requests
# Requires audit log access — available in AKS via Azure Monitor
etcd latency causing API server slowness
The API server stores all state in etcd. If etcd is slow, every read and write operation on the API server is slow. This is covered in detail in the etcd section below.
Cause 2 — etcd Degradation (Self-Managed Clusters)
etcd is the key-value store that holds all Kubernetes state. Every object — pods, deployments, configmaps, secrets — is stored in etcd. If etcd is degraded, the API server cannot function correctly.
Note: In managed clusters (AKS, EKS, GKE), you do not have direct access to etcd. If you suspect etcd issues in a managed cluster, escalate to your cloud provider immediately.
How to Diagnose
bash
# Check etcd pod status
kubectl get pods -n kube-system | grep etcd
# Check etcd cluster health
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check etcd cluster membership and leader status
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
-w table
Warning Signs in etcd Logs
bash
kubectl logs -n kube-system etcd-<node> | grep -E "slow|timeout|overload|leader"
Critical messages:
took too long (200ms) to execute
failed to send out heartbeat on time; took too long, leader is overloaded
server is likely overloaded
These indicate etcd is under I/O pressure. etcd is extremely sensitive to disk latency — it requires SSDs with less than 10ms write latency. Spinning disks or overloaded cloud volumes will degrade etcd significantly.
bash
# Check disk performance on the etcd node
iostat -x 1 10
# Look for: await (average I/O wait) — should be under 10ms
# If await is 50ms+, your disk is too slow for etcd
etcd Maintenance Operations
Compaction and defragmentation — over time etcd accumulates old revisions and its database grows. Regular compaction keeps it healthy:
bash
# Get the current revision
REV=$(kubectl exec -n kube-system etcd-<node> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status -w json | jq '.[0].Status.header.revision')
# Compact old revisions
kubectl exec -n kube-system etcd-<node> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
compact $REV
# Defragment to reclaim disk space
kubectl exec -n kube-system etcd-<node> -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
Taking an etcd snapshot before risky operations:
bash
kubectl exec -n kube-system etcd-<node> -- etcdctl snapshot save /tmp/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Copy snapshot off the node
kubectl cp kube-system/etcd-<node>:/tmp/etcd-backup.db ./etcd-backup-$(date +%Y%m%d).db
Rule: Always take an etcd snapshot before performing cluster upgrades, etcd compaction, or any control plane maintenance. This is your recovery path if something goes wrong.
Cause 3 — Scheduler Not Placing Pods
Symptoms:
- New pods stay in
Pendingindefinitely even though nodes have available capacity kubectl describe podshows no scheduling events at all (not even “FailedScheduling”)- No
kubectl get eventsentries fromdefault-scheduler
How to Diagnose
bash
# Check scheduler pod
kubectl get pods -n kube-system | grep scheduler
# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<node> --tail=50
# Verify scheduler has a leader (HA clusters)
kubectl get lease -n kube-system kube-scheduler
kubectl describe lease kube-system/kube-scheduler
# Look for: HolderIdentity — the current leader
In HA clusters, the scheduler uses leader election. Only the leader actively schedules pods. If the leader election lease is stale (old leader crashed without releasing the lease), no instance becomes the new leader.
bash
# Force new leader election by deleting the lease
kubectl delete lease kube-scheduler -n kube-system
# All scheduler instances will race to acquire the new lease
Fix
bash
# Restart the scheduler (self-managed)
# If running as a static pod, delete the pod — kubelet recreates it automatically
kubectl delete pod kube-scheduler-<node> -n kube-system
# For managed clusters — scheduler is managed by the cloud provider
# If new pods are consistently not being scheduled, open a support ticket
Cause 4 — Controller Manager Failures
Symptoms:
- Deployments do not scale up or down
- Crashed pods are not replaced (ReplicaSet not reconciling)
- Jobs are not creating new pods
- CronJobs are not firing
How to Diagnose
bash
# Check controller manager pod
kubectl get pods -n kube-system | grep controller-manager
# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<node> --tail=50
# Test if reconciliation is working
# Create a simple deployment and check if ReplicaSet is created
kubectl create deployment test-cm --image=nginx --replicas=2 -n default
kubectl get replicaset -n default
# If no ReplicaSet appears after 30 seconds, controller manager is not working
kubectl delete deployment test-cm -n default
# Check leader election lease
kubectl get lease kube-controller-manager -n kube-system
Fix
bash
# Restart the controller manager (self-managed)
kubectl delete pod kube-controller-manager-<node> -n kube-system
# Force new leader election
kubectl delete lease kube-controller-manager -n kube-system
Cause 5 — Certificate Expiry
Kubernetes components communicate using mutual TLS. Every component — the API server, kubelet, controller manager, scheduler, and etcd — has certificates that expire. When they expire, components refuse to communicate with each other and the cluster enters a failure mode that looks like network issues until you check the certificates.
How to Diagnose
bash
# Check certificate expiry for all control plane certs (self-managed with kubeadm)
kubeadm certs check-expiration
# Output shows expiry dates for each certificate:
# CERTIFICATE EXPIRES RESIDUAL TIME
# admin.conf Jan 10, 2026 00:00 UTC 89d
# apiserver Jan 10, 2026 00:00 UTC 89d
# apiserver-etcd-client Jan 10, 2026 00:00 UTC 89d
# etcd-healthcheck-client Jan 10, 2026 00:00 UTC 89d
# Check a specific certificate manually
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# Check kubelet certificate on a node
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
Fix — Renewing Certificates
bash
# Renew all certificates at once (kubeadm clusters)
kubeadm certs renew all
# Renew a specific certificate
kubeadm certs renew apiserver
# After renewal, restart the control plane components
# For static pod deployments:
crictl ps | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"
crictl stop <container-id>
# kubelet automatically restarts static pod containers
Production rule: Set a calendar reminder 30 days before certificate expiry. Add a Prometheus alert on
kubeadm_certs_expiry_secondsif you have monitoring of control plane nodes. Certificate expiry is entirely preventable — it is always an operational oversight.
Real Production Example — Admission Webhook Blocking All Deployments
Scenario: On a Tuesday morning, every deployment across the cluster fails. CI/CD pipelines are reporting errors. No new pods can be created. Existing running pods are unaffected.
bash
kubectl create deployment test --image=nginx -n default
# Error from server (InternalError): Internal error occurred:
# failed calling webhook "pod-policy.example.com":
# Post "https://policy-webhook.kube-system.svc:443/validate-pods?timeout=30s":
# context deadline exceeded
The error names the webhook directly.
bash
# Check if the webhook service is running
kubectl get pods -n kube-system | grep policy-webhook
# policy-webhook-7d9f-xp2k 0/1 CrashLoopBackOff 12 2h
# The webhook pod has been crashing for 2 hours
# With failurePolicy: Fail and no timeoutSeconds set,
# every pod creation request waits the full 30-second default timeout
# then fails
Fix:
bash
# Immediate: patch failurePolicy to Ignore to unblock the cluster
kubectl patch validatingwebhookconfiguration pod-policy \
--type=json \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
# Deployments immediately succeed
kubectl create deployment test --image=nginx -n default
# deployment.apps/test created
# Then fix the webhook pod
kubectl describe pod policy-webhook-7d9f-xp2k -n kube-system
# Found: missing Secret reference after a namespace cleanup
kubectl create secret generic webhook-tls \
--from-file=tls.crt=webhook.crt \
--from-file=tls.key=webhook.key \
-n kube-system
kubectl rollout restart deployment/policy-webhook -n kube-system
# Once webhook is healthy, restore failurePolicy: Fail
kubectl patch validatingwebhookconfiguration pod-policy \
--type=json \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Fail"}]'
Time to resolution: 14 minutes. Lesson: Every admission webhook must have timeoutSeconds set (5 seconds is a safe default) and a deliberate failurePolicy. A webhook with failurePolicy: Fail and no timeout is a cluster-wide single point of failure. Audit all webhooks during every cluster review.
Quick Reference
bash
# Check API server reachability
kubectl cluster-info
time kubectl get nodes
# Check control plane pods
kubectl get pods -n kube-system | grep -E "apiserver|etcd|scheduler|controller"
# Check admission webhooks
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Check etcd health (self-managed)
etcdctl endpoint health --cluster
# Check etcd leader
etcdctl endpoint status --cluster -w table
# Take etcd snapshot
etcdctl snapshot save /tmp/backup.db
# Check certificate expiry (kubeadm)
kubeadm certs check-expiration
# Renew all certificates
kubeadm certs renew all
# Check scheduler and controller manager leases
kubectl get lease -n kube-system
# Force new leader election
kubectl delete lease kube-scheduler -n kube-system
kubectl delete lease kube-controller-manager -n kube-system
# Patch webhook failurePolicy to Ignore (emergency)
kubectl patch validatingwebhookconfiguration <name> \
--type=json \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
Summary
Control plane failures require a methodical approach because you are often debugging with limited tooling:
- API server unreachable — check admission webhooks first, then etcd health
- API server slow — measure latency, check webhook timeouts, check etcd I/O
- etcd degraded — check disk latency, compact and defragment, take snapshot before any action
- Scheduler not placing pods — check lease holder, restart pod or force leader election
- Controller manager not reconciling — check lease holder, test with a simple deployment
- Certificate expiry — run
kubeadm certs check-expirationmonthly, renew 30 days before expiry
In managed clusters, your responsibility ends at the API server boundary. Anything below — etcd, scheduler internals, certificate rotation — is the cloud provider’s job. Escalate quickly rather than spending hours on something you cannot access.
Related guides:

