When Kubernetes Forgets: The 90-Second Evidence Gap

The Contradiction

At 3:47 AM, your monitoring dashboard shows a healthy Kubernetes cluster—99.97% availability. Your customers report a complete outage.
Ninety seconds later, the pod has self-healed. Metrics look normal. The restart counter reads “1.” But why it restarted—what actually happened—has vanished.
This isn’t a tooling failure. The system simply recovered faster than a human could observe.

The Experiment

To study this 90-second diagnostic gap, I ran a controlled experiment. The goal wasn’t full-scale production, but to recreate the timing behavior of failures that self-heal quickly. The failure type—OOMKill followed by pod restart—behaves the same in small and large clusters; only the impact scale differs.

Environment:

Cluster: 3-node Minikube (Kubernetes v1.31)
Pod: 128Mi memory limit
Monitoring: Prometheus + Grafana
Event retention: Default event (1 hour OR 1000 events)

The scenario: A pod gradually leaks memory until the kernel OOMKills it. Kubernetes restarts the pod automatically. An engineer investigates 90 seconds later, simulating realistic alert propagation and context-switch delays.text switching, and initial triage.

Timeline:

T+0s: Pod running, healthy baseline
T+3s: Memory hits limit, kernel OOMKills the container
T+5s: Kubernetes restarts the pod
T+90s: Engineer begins investigation

Key Finding: The failure lasted only 3 seconds; Kubernetes recovered in 2. By the time a human arrived, all critical Kubernetes decision context had vanished, even though application logs remained.

This experiment focuses on Kubernetes decision context, not application log retention. Even organizations with mature centralized logging face this diagnostic gap.

What the Engineer Sees

The following artifacts represent what an on-call engineer can reasonably observe 90 seconds after recovery.

Pod status:

$ kubectl get pod memory-leak-test -n lab01-test
NAME               READY   STATUS    RESTARTS      AGE
memory-leak-test   1/1     Running   1 (90s ago)   2m

The pod appears healthy. Restart count is visible. But why did it restart?

Pod details:

$ kubectl describe pod memory-leak-test -n lab01-test
...
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    1
  Started:      Sat, 10 Jan 2026 23:19:42 -0500
  Finished:     Sat, 10 Jan 2026 23:19:57 -0500

Good—the engineer can see it was OOMKilled. But this raises more questions than it answers:

What was the memory usage pattern before the kill?
What triggered the memory spike?
Has this happened before?
What else was happening on that node?

Events:

$ kubectl get events -n lab01-test | grep -i oom
# No results

The OOM event has already rotated out. In this experiment, it disappeared in under 90 seconds—faster than a realistic human response time.

Previous logs:

$ kubectl logs memory-leak-test -n lab01-test --previous
# May or may not be available

In this case, previous container logs were accessible. But this varies by configuration—many clusters lose terminated container logs immediately.

The diagnostic questions that remain unanswered:

What was memory usage 30 seconds before OOMKill?
What code path triggered the allocation?
Were there ConfigMap changes before the spike?
What was the node resource state at failure time?
Is this a pattern, or a one-time event?

These questions require manual correlation across Prometheus, logs, and platform state—none of which preserve the temporal context needed for efficient diagnosis.

The Missing Diagnostic Primitive

What Kubernetes lacks is time-bounded state correlation—the ability to query “what was true at time T” and correlate signals across system boundaries within a temporal window.

This isn’t a missing feature. It’s a missing architectural capability.

Databases preserve write-ahead logs that enable point-in-time recovery. Distributed tracing systems preserve request context across service boundaries. Kubernetes preserves intent (desired state in etcd), but not execution history—what decisions were made, what resources were available, what constraints applied.

Without this primitive, diagnosis becomes archaeology: reconstructing past state from fragmentary evidence rather than querying preserved truth.

How the Gap Breaks in Practice

This single architectural gap creates three distinct failure modes in production diagnostics.

Temporal Decay

Without time-bounded queries, Kubernetes forgets failures faster than humans can observe them.

Events rotate by default after 1 hour OR 1000 events, whichever comes first. In active clusters, 1000 events can accumulate in minutes. In our experiment, OOM evidence disappeared in under 90 seconds.

Container logs from terminated pods have variable retention—some clusters preserve them briefly, others lose them immediately. Current metrics show post-recovery state, not failure state.

Kubernetes treats failures as transient implementation details, not first-class historical events.

When an engineer arrives to investigate, they find a healthy system with a restart counter but no diagnostic context. The evidence that would explain why the restart occurred has already been garbage collected.

Snapshot Absence

Without historical state snapshots, Kubernetes can only explain what exists—never what existed.

Databases preserve write-ahead logs. Kubernetes preserves intent, but not execution history.

What’s missing:

kubectl describe pod --at-time="2026-01-10T23:20:00Z" doesn’t exist
ConfigMap and Secret contents at failure time are not queryable
Scheduler decision reasoning is not preserved
Node resource state when placement occurred is not retrievable

An engineer can see the current pod specification. They cannot see what the specification was when the failure occurred, or what the surrounding platform state looked like at that moment.

This is not about observability tooling—it’s about architectural capability. The system state needed to explain Kubernetes decisions is ephemeral by design.

Correlation Gap

Without a shared temporal frame, signals cannot be correlated—and ownership fragments by default.

Each system answers a different diagnostic question—but incidents require answers to questions no system is responsible for.

Consider our OOMKill scenario:

Memory spike: Prometheus metric (metrics team)
OOMKill event: Kubernetes events (platform team)
Application error: Container logs (app team)
Network latency spike: CNI logs (network team)

None of these systems reference each other. No shared transaction ID. No temporal correlation mechanism. The data exists in isolation.

Even with perfect retention in each system, correlation is the engineer’s responsibility—performed manually, after the fact, under time pressure during an incident.

But We Have Centralized Logging—Doesn’t That Fix This?

A common response to this experiment is: “We have centralized logging—this isn’t a problem for us.”

Logs tell you what the application said—not what the platform decided.

Centralized logging certainly preserves application output, and it is necessary. But it does not preserve Kubernetes decision context.

What logs capture:

Application stdout/stderr
Container output

What logs don’t capture:

Pod spec, ConfigMap, or Secret versions at failure time
Node resource state
Scheduler or kubelet decisions
Cgroup enforcement context

The result: metrics, logs, and events exist in isolation. Engineers must manually correlate them across time and systems, reconstructing what Kubernetes actually did. Evidence exists—but the explanation does not.

What This Means for SRE Teams

Most “root cause analyses” are reconstructions, not observations.

Typical incident workflow:

Pod restarts, alert fires
Engineer responds (2-5 minutes elapsed)
kubectl shows healthy pod with restart count
Engineer searches for clues:
- Check events (may have rotated)
- Query Prometheus (manual historical query needed)
- Check logs (may be unavailable for terminated container)
- Ask “has this happened before?” (no pattern detection available)
Spend time reconstructing timeline across disconnected systems
Maybe find root cause, maybe document “transient issue”

The pod failed and recovered in 5 seconds. The engineer spent 30-45 minutes reconstructing what happened. This time tax applies to every unexpected restart, every OOMKill, every eviction.

The broader impacts:

Repeat incidents go undetected (no historical pattern matching)
New team members struggle (tribal knowledge required for manual correlation)
Post-mortems lack complete data (evidence gaps prevent true root cause)
On-call fatigue increases (every incident requires archaeological investigation)

The Missing Diagnostic Primitives

The primitives Kubernetes needs are direct responses to the failure modes documented above.

Primitive 1: Time-Bounded State Queries

Addresses: Temporal decay

What it is: The ability to query historical Kubernetes state at a specific timestamp.

# Hypothetical command
kubectl describe pod memory-leak-test --at-time="2026-01-10T23:20:00Z"

This would return:

Pod specification as it existed at that moment
ConfigMap and Secret contents referenced by the pod
Node resources available when scheduled
Events that existed at that timestamp

Why observability tools don’t substitute: Prometheus preserves metric history, but not Kubernetes object state. You can graph memory usage over time, but you cannot query “what was in this ConfigMap when the pod started?” Metrics show symptoms; state snapshots show conditions.

The architectural gap: Kubernetes etcd stores current state efficiently, but historical state requires deliberate preservation with queryable indexes. This is a design choice—optimization for operational efficiency over diagnostic capability.

Primitive 2: Cross-System Temporal Correlation

Addresses: Correlation gap

What it is: A shared temporal frame with correlation identifiers across metrics, logs, events, and platform state.

# Hypothetical command
kubectl correlate --pod=memory-leak-test \
  --time-range="23:19:30 to 23:20:30" \
  --include=events,metrics,logs

This would return a unified timeline showing:

What changed in Prometheus metrics
Which Kubernetes events fired
What appeared in container logs
Which platform decisions were made

All anchored to a shared timestamp window with correlation identifiers.

Why observability tools don’t substitute: Distributed tracing solves this for requests flowing through applications. But platform-level decisions—scheduling, eviction, resource enforcement—don’t participate in trace contexts. Each system maintains its own timeline with its own timestamp precision and retention policy.

The architectural gap: Correlation requires cooperation from components that were never designed to coordinate. The kubelet doesn’t emit trace IDs. The kernel doesn’t tag OOMKills with pod UIDs. Events don’t reference metric timestamps. This coordination layer doesn’t exist.

Primitive 3: Intent vs Outcome Tracking

Addresses: Both temporal decay and snapshot absence

What it is: Preserved decision history showing what Kubernetes tried to do, what constraints it faced, and what actually happened.

# Hypothetical command
kubectl explain failure memory-leak-test --restart=1

This would show:

What the HPA wanted (desired replica count)
What the scheduler attempted (placement decisions)
What constraints applied (resource quotas, node selectors, taints)
What succeeded and what failed
Why specific decisions were made

Why observability tools don’t substitute: Controller logs show what actions were taken, but not the reasoning or alternatives considered. You can see “scaled to 3 replicas” but not “wanted 5, only 3 nodes had capacity, quota prevented more.” The decision context is never serialized.

The architectural gap: Controllers operate on a reconciliation loop, comparing desired and actual state. The intermediate reasoning—what was attempted, what constraints blocked it, what alternatives were considered—exists only in memory during execution and is never persisted.

Living With the Gap (For Now)

Until these primitives exist, teams compensate with partial mitigations. These are treatments for symptoms, not solutions to the underlying architectural gap.

Short-term compensations:

Increase event retention to 24 hours (postpones rotation, doesn’t eliminate it)
Enable terminated container log retention (when platform supports it)
Create Prometheus recording rules for common diagnostic queries
Build incident runbooks that codify manual correlation steps

Medium-term mitigations:

Implement centralized logging (ELK, Splunk, Loki) for application output
Deploy distributed tracing (Jaeger, Tempo) for request-level correlation
Use event exporters (kube-eventer) to forward events to durable storage
Create custom diagnostic capture workflows

Complementary tooling:

Minimal diagnostic capture script (bundles pod specs, events, node state, logs at specific time boundaries).
Production cluster-wide health snapshot: see kubectl-health-snapshot.

Takeaway: These mitigations help manage symptoms, but cannot eliminate the fundamental architectural gap—Kubernetes explains what exists, not what existed or why decisions were made.

Conclusion

Kubernetes is optimized for self-healing, not for explaining its decisions.

The 90-second evidence gap is architectural, not a tooling bug.

Without time-bounded state queries, cross-system correlation, and intent tracking:

Engineers spend far more time reconstructing incidents than machines take to recover.
Repeat failures often go undetected.
Post-mortems are incomplete; on-call fatigue rises.

The proposed primitives address real-world failure modes, not hypothetical scenarios.

Until these primitives exist, incident response remains an exercise in archaeology.

Next Steps:

Reproduce the experiment: kubernetes-diagnostic-primitives repo
Labs 2 & 3 in progress: deep dive into correlation fragmentation and intent opacity.
Stay tuned for practical workflows for cross-system incident correlation.

Post Views: 945

When Kubernetes Forgets: The 90-Second Evidence Gap

The Contradiction

The Experiment

Environment:

Timeline:

What the Engineer Sees

The Missing Diagnostic Primitive

How the Gap Breaks in Practice

Temporal Decay

Snapshot Absence

Correlation Gap

But We Have Centralized Logging—Doesn’t That Fix This?

What This Means for SRE Teams

The Missing Diagnostic Primitives

Primitive 1: Time-Bounded State Queries

Primitive 2: Cross-System Temporal Correlation

Primitive 3: Intent vs Outcome Tracking

Living With the Gap (For Now)

Conclusion

About The Author

Shamsher Khan

Leave a Comment Cancel Reply

The Contradiction

The Experiment

Environment:

Timeline:

What the Engineer Sees

The Missing Diagnostic Primitive

How the Gap Breaks in Practice

Temporal Decay

Snapshot Absence

Correlation Gap

But We Have Centralized Logging—Doesn’t That Fix This?

What This Means for SRE Teams

The Missing Diagnostic Primitives

Primitive 1: Time-Bounded State Queries

Primitive 2: Cross-System Temporal Correlation

Primitive 3: Intent vs Outcome Tracking

Living With the Gap (For Now)

Conclusion

About The Author

Shamsher Khan

Related Posts

Leave a Comment Cancel Reply