The Contradiction
At 3:47 AM, your monitoring dashboard shows a healthy Kubernetes cluster—99.97% availability. Your customers report a complete outage.
Ninety seconds later, the pod has self-healed. Metrics look normal. The restart counter reads “1.” But why it restarted—what actually happened—has vanished.
This isn’t a tooling failure. The system simply recovered faster than a human could observe.
The Experiment
To study this 90-second diagnostic gap, I ran a controlled experiment. The goal wasn’t full-scale production, but to recreate the timing behavior of failures that self-heal quickly. The failure type—OOMKill followed by pod restart—behaves the same in small and large clusters; only the impact scale differs.
Environment:
- Cluster: 3-node Minikube (Kubernetes v1.31)
- Pod: 128Mi memory limit
- Monitoring: Prometheus + Grafana
- Event retention: Default event (1 hour OR 1000 events)
The scenario: A pod gradually leaks memory until the kernel OOMKills it. Kubernetes restarts the pod automatically. An engineer investigates 90 seconds later, simulating realistic alert propagation and context-switch delays.text switching, and initial triage.
Timeline:
- T+0s: Pod running, healthy baseline
- T+3s: Memory hits limit, kernel OOMKills the container
- T+5s: Kubernetes restarts the pod
- T+90s: Engineer begins investigation
Key Finding: The failure lasted only 3 seconds; Kubernetes recovered in 2. By the time a human arrived, all critical Kubernetes decision context had vanished, even though application logs remained.
This experiment focuses on Kubernetes decision context, not application log retention. Even organizations with mature centralized logging face this diagnostic gap.
What the Engineer Sees
The following artifacts represent what an on-call engineer can reasonably observe 90 seconds after recovery.
Pod status:
$ kubectl get pod memory-leak-test -n lab01-test
NAME READY STATUS RESTARTS AGE
memory-leak-test 1/1 Running 1 (90s ago) 2m
The pod appears healthy. Restart count is visible. But why did it restart?
Pod details:
$ kubectl describe pod memory-leak-test -n lab01-test
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 1
Started: Sat, 10 Jan 2026 23:19:42 -0500
Finished: Sat, 10 Jan 2026 23:19:57 -0500
Good—the engineer can see it was OOMKilled. But this raises more questions than it answers:
- What was the memory usage pattern before the kill?
- What triggered the memory spike?
- Has this happened before?
- What else was happening on that node?
Events:
$ kubectl get events -n lab01-test | grep -i oom
# No results
The OOM event has already rotated out. In this experiment, it disappeared in under 90 seconds—faster than a realistic human response time.
Previous logs:
$ kubectl logs memory-leak-test -n lab01-test --previous
# May or may not be available
In this case, previous container logs were accessible. But this varies by configuration—many clusters lose terminated container logs immediately.
The diagnostic questions that remain unanswered:
- What was memory usage 30 seconds before OOMKill?
- What code path triggered the allocation?
- Were there ConfigMap changes before the spike?
- What was the node resource state at failure time?
- Is this a pattern, or a one-time event?
These questions require manual correlation across Prometheus, logs, and platform state—none of which preserve the temporal context needed for efficient diagnosis.
The Missing Diagnostic Primitive
What Kubernetes lacks is time-bounded state correlation—the ability to query “what was true at time T” and correlate signals across system boundaries within a temporal window.
This isn’t a missing feature. It’s a missing architectural capability.
Databases preserve write-ahead logs that enable point-in-time recovery. Distributed tracing systems preserve request context across service boundaries. Kubernetes preserves intent (desired state in etcd), but not execution history—what decisions were made, what resources were available, what constraints applied.
Without this primitive, diagnosis becomes archaeology: reconstructing past state from fragmentary evidence rather than querying preserved truth.
How the Gap Breaks in Practice
This single architectural gap creates three distinct failure modes in production diagnostics.
Temporal Decay
Without time-bounded queries, Kubernetes forgets failures faster than humans can observe them.
Events rotate by default after 1 hour OR 1000 events, whichever comes first. In active clusters, 1000 events can accumulate in minutes. In our experiment, OOM evidence disappeared in under 90 seconds.
Container logs from terminated pods have variable retention—some clusters preserve them briefly, others lose them immediately. Current metrics show post-recovery state, not failure state.
Kubernetes treats failures as transient implementation details, not first-class historical events.
When an engineer arrives to investigate, they find a healthy system with a restart counter but no diagnostic context. The evidence that would explain why the restart occurred has already been garbage collected.
Snapshot Absence
Without historical state snapshots, Kubernetes can only explain what exists—never what existed.
Databases preserve write-ahead logs. Kubernetes preserves intent, but not execution history.
What’s missing:
kubectl describe pod --at-time="2026-01-10T23:20:00Z"doesn’t exist- ConfigMap and Secret contents at failure time are not queryable
- Scheduler decision reasoning is not preserved
- Node resource state when placement occurred is not retrievable
An engineer can see the current pod specification. They cannot see what the specification was when the failure occurred, or what the surrounding platform state looked like at that moment.
This is not about observability tooling—it’s about architectural capability. The system state needed to explain Kubernetes decisions is ephemeral by design.
Correlation Gap
Without a shared temporal frame, signals cannot be correlated—and ownership fragments by default.
Each system answers a different diagnostic question—but incidents require answers to questions no system is responsible for.
Consider our OOMKill scenario:
- Memory spike: Prometheus metric (metrics team)
- OOMKill event: Kubernetes events (platform team)
- Application error: Container logs (app team)
- Network latency spike: CNI logs (network team)
None of these systems reference each other. No shared transaction ID. No temporal correlation mechanism. The data exists in isolation.
Even with perfect retention in each system, correlation is the engineer’s responsibility—performed manually, after the fact, under time pressure during an incident.
But We Have Centralized Logging—Doesn’t That Fix This?
A common response to this experiment is: “We have centralized logging—this isn’t a problem for us.”
Logs tell you what the application said—not what the platform decided.
Centralized logging certainly preserves application output, and it is necessary. But it does not preserve Kubernetes decision context.
What logs capture:
- Application stdout/stderr
- Container output
What logs don’t capture:
- Pod spec, ConfigMap, or Secret versions at failure time
- Node resource state
- Scheduler or kubelet decisions
- Cgroup enforcement context
The result: metrics, logs, and events exist in isolation. Engineers must manually correlate them across time and systems, reconstructing what Kubernetes actually did. Evidence exists—but the explanation does not.
What This Means for SRE Teams
Most “root cause analyses” are reconstructions, not observations.
Typical incident workflow:
- Pod restarts, alert fires
- Engineer responds (2-5 minutes elapsed)
kubectlshows healthy pod with restart count- Engineer searches for clues:
- Check events (may have rotated)
- Query Prometheus (manual historical query needed)
- Check logs (may be unavailable for terminated container)
- Ask “has this happened before?” (no pattern detection available)
- Spend time reconstructing timeline across disconnected systems
- Maybe find root cause, maybe document “transient issue”
The pod failed and recovered in 5 seconds. The engineer spent 30-45 minutes reconstructing what happened. This time tax applies to every unexpected restart, every OOMKill, every eviction.
The broader impacts:
- Repeat incidents go undetected (no historical pattern matching)
- New team members struggle (tribal knowledge required for manual correlation)
- Post-mortems lack complete data (evidence gaps prevent true root cause)
- On-call fatigue increases (every incident requires archaeological investigation)
The Missing Diagnostic Primitives
The primitives Kubernetes needs are direct responses to the failure modes documented above.
Primitive 1: Time-Bounded State Queries
Addresses: Temporal decay
What it is: The ability to query historical Kubernetes state at a specific timestamp.
# Hypothetical command
kubectl describe pod memory-leak-test --at-time="2026-01-10T23:20:00Z"
This would return:
- Pod specification as it existed at that moment
- ConfigMap and Secret contents referenced by the pod
- Node resources available when scheduled
- Events that existed at that timestamp
Why observability tools don’t substitute: Prometheus preserves metric history, but not Kubernetes object state. You can graph memory usage over time, but you cannot query “what was in this ConfigMap when the pod started?” Metrics show symptoms; state snapshots show conditions.
The architectural gap: Kubernetes etcd stores current state efficiently, but historical state requires deliberate preservation with queryable indexes. This is a design choice—optimization for operational efficiency over diagnostic capability.
Primitive 2: Cross-System Temporal Correlation
Addresses: Correlation gap
What it is: A shared temporal frame with correlation identifiers across metrics, logs, events, and platform state.
# Hypothetical command
kubectl correlate --pod=memory-leak-test \
--time-range="23:19:30 to 23:20:30" \
--include=events,metrics,logs
This would return a unified timeline showing:
- What changed in Prometheus metrics
- Which Kubernetes events fired
- What appeared in container logs
- Which platform decisions were made
All anchored to a shared timestamp window with correlation identifiers.
Why observability tools don’t substitute: Distributed tracing solves this for requests flowing through applications. But platform-level decisions—scheduling, eviction, resource enforcement—don’t participate in trace contexts. Each system maintains its own timeline with its own timestamp precision and retention policy.
The architectural gap: Correlation requires cooperation from components that were never designed to coordinate. The kubelet doesn’t emit trace IDs. The kernel doesn’t tag OOMKills with pod UIDs. Events don’t reference metric timestamps. This coordination layer doesn’t exist.
Primitive 3: Intent vs Outcome Tracking
Addresses: Both temporal decay and snapshot absence
What it is: Preserved decision history showing what Kubernetes tried to do, what constraints it faced, and what actually happened.
# Hypothetical command
kubectl explain failure memory-leak-test --restart=1
This would show:
- What the HPA wanted (desired replica count)
- What the scheduler attempted (placement decisions)
- What constraints applied (resource quotas, node selectors, taints)
- What succeeded and what failed
- Why specific decisions were made
Why observability tools don’t substitute: Controller logs show what actions were taken, but not the reasoning or alternatives considered. You can see “scaled to 3 replicas” but not “wanted 5, only 3 nodes had capacity, quota prevented more.” The decision context is never serialized.
The architectural gap: Controllers operate on a reconciliation loop, comparing desired and actual state. The intermediate reasoning—what was attempted, what constraints blocked it, what alternatives were considered—exists only in memory during execution and is never persisted.
Living With the Gap (For Now)
Until these primitives exist, teams compensate with partial mitigations. These are treatments for symptoms, not solutions to the underlying architectural gap.
Short-term compensations:
- Increase event retention to 24 hours (postpones rotation, doesn’t eliminate it)
- Enable terminated container log retention (when platform supports it)
- Create Prometheus recording rules for common diagnostic queries
- Build incident runbooks that codify manual correlation steps
Medium-term mitigations:
- Implement centralized logging (ELK, Splunk, Loki) for application output
- Deploy distributed tracing (Jaeger, Tempo) for request-level correlation
- Use event exporters (kube-eventer) to forward events to durable storage
- Create custom diagnostic capture workflows
Complementary tooling:
- Minimal diagnostic capture script (bundles pod specs, events, node state, logs at specific time boundaries).
- Production cluster-wide health snapshot: see kubectl-health-snapshot.
Conclusion
Kubernetes is optimized for self-healing, not for explaining its decisions.
The 90-second evidence gap is architectural, not a tooling bug.
Without time-bounded state queries, cross-system correlation, and intent tracking:
- Engineers spend far more time reconstructing incidents than machines take to recover.
- Repeat failures often go undetected.
- Post-mortems are incomplete; on-call fatigue rises.
The proposed primitives address real-world failure modes, not hypothetical scenarios.
Until these primitives exist, incident response remains an exercise in archaeology.
Next Steps:
- Reproduce the experiment: kubernetes-diagnostic-primitives repo
- Labs 2 & 3 in progress: deep dive into correlation fragmentation and intent opacity.
- Stay tuned for practical workflows for cross-system incident correlation.

