Building Production-Safe Agentic Remediation with Docker MCP Gateway: Lessons from 43% to 100% Accuracy

Co-authored with Mohammad-Ali A’râbi, Docker Captain


A Docker container exits with an OOMKilled error.

An experienced engineer generally knows what happens next. Check the logs. Confirm it’s actually memory pressure. Increase the limit. Restart the container. Move on.

The question is not whether a language model can perform those steps. Most modern models can. The harder problem is distinguishing between failures that should be remediated automatically and failures that should be escalated to a human operator.

A CrashLoopBackOff rarely benefits from repeated restarts. A memory leak is not solved by continuously increasing memory limits. Some failures are operational. Others are application defects. Treating them the same creates more problems than it solves. That distinction became the central problem in this project.

Over several weeks, we built a remediation system on Docker MCP Gateway and evaluated it across four common container failure scenarios. Every decision was compared against a documented expected outcome and validated through structured audit logs and automated analysis.

The first version was correct 43% of the time. After working through nine engineering challenges, the final version reached 100% correctness across the validation scenarios.

The most important lesson had nothing to do with model selection, prompt engineering, or tool integrations.

The difficult part was defining the operational boundary: when automation should act, when it should escalate, and how those decisions could be made visible, repeatable, and auditable.

This article covers what we built, what failed, and the engineering changes that improved the results.

The full code, audit logs, validation datasets, and analyzer scripts are available in the companion repository.


Why naive auto-remediation is dangerous

The most common mistake in operational automation is treating automatic remediation as the objective. It isn’t. A system that attempts to fix every incident automatically often creates more operational risk than it removes.

Consider a CrashLoopBackOff event.

Restarting the container does not address the underlying problem. If the application contains a configuration error or a software defect, the container simply crashes again. The result is additional operational noise layered on top of the original incident.

OOM events can create a different failure pattern.

Automatically increasing memory limits may restore service temporarily, but it can also conceal an underlying memory leak. The workload continues running while resource consumption grows over time. Months later, teams may find themselves running multi-gigabyte containers that should have required only a fraction of those resources.

A lack of auditability introduces another risk.

If remediation actions are applied without structured records, it becomes difficult to determine what actions were taken, what alternatives were considered, and why a particular response was selected. Production systems require traceability. “The system fixed it” is not a useful postmortem entry.

The safest automation systems are not the ones that perform the most actions. They are the ones with clearly defined operating boundaries.

The real engineering challenge is defining those boundaries, encoding them explicitly, and proving that the system remains within them.

From Mohammad-Ali A’râbi, Docker Captain:

The most dangerous assumption an engineering team can make today is treating an autonomous AI agent like a senior site reliability engineer. It is not. An AI agent is essentially an incredibly fast, highly capable, but entirely untrusted user possessing elevated infrastructure privileges. Over the last decade, the container ecosystem fought bitterly to establish the principle of least privilege. We moved away from running containers as root, we stripped Linux capabilities down to the absolute minimum, and we stopped mounting /var/run/docker.sock directly into monitoring tools because we recognized that operational convenience is the enemy of security. Through extensive vulnerability research and runtime escape scenarios, we have repeatedly proven that granting unrestricted host access to containerized processes inevitably leads to total system compromise.

Now, with the advent of agentic operations, we are at risk of unlearning all of those hard-won lessons. Handing an unconstrained language model the ability to arbitrarily restart containers or manipulate memory limits based on its internal, non-deterministic reasoning is the architectural equivalent of running every production workload with the --privileged flag. It is a catastrophe waiting to happen. The broader pattern defining the future of AI in container operations is not about maximizing the raw intelligence of the underlying model; it is about establishing absolute, cryptographic boundaries around what the model is permitted to execute.

This is exactly where the Docker MCP Gateway fits into the modern tooling strategy. It applies the principles of Zero Trust Network Architecture and Kubernetes Role-Based Access Control (RBAC) directly to the LLM tool-calling layer. By forcing every single autonomous decision through a central chokepoint that mandates HMAC authentication, enforces hard Redis-backed rate limits, and isolates the execution script inside an ephemeral container, we systematically strip the danger out of the equation. We protect the host from the agent just as rigorously as we protect the host from external threat actors.

A Docker Captain’s perspective: production AI infrastructure must mirror traditional high-security infrastructure. It requires verifiable supply chains, strict admission controllers, and immutable audit trails. The agent should never possess the API credentials; the Gateway should handle the secret injection. The agent should never touch the host filesystem; the containerized tool should act as the isolated proxy. If the AI hallucinates, spirals into an infinite loop, or encounters an edge case it does not understand, the infrastructure must mathematically guarantee that the system will fail safely and escalate to a human operator. True agentic autonomy is only possible when the infrastructure enforcing its boundaries is completely inflexible.


What Docker MCP Gateway gives you

The Model Context Protocol (MCP) is an open standard introduced by Anthropic in late 2024 that gives AI agents a uniform interface for calling external tools. One protocol, any tool, any agent. It’s now a multi-company standard with native support from Anthropic, OpenAI, Google DeepMind, and AWS.

MCP solves the protocol problem. It doesn’t solve the production problem.

In production, you need:

  • Authenticated tool calls (not just “the agent has the API key in plaintext somewhere”)
  • Rate limiting (agents can spiral fast)
  • Audit logging of every decision
  • Containerized tool isolation (so a misbehaving tool can’t take down its host)
  • Centralized policy enforcement (so adding a new server doesn’t require reconfiguring every client)

The Docker MCP Gateway is the layer that adds all of this. It sits between AI clients and MCP servers, packaging tool execution as Docker containers and routing every tool call through a single secure enforcement point.

For our work, we built a custom MCP server inside Docker that exposes three remediation tools: check_container_logs, restart_container, and update_container_resources. Every call goes through HMAC authentication, gets rate-limited via Redis, and gets written to a structured JSON audit log.

From Mohammad-Ali A’râbi, Docker Captain:

Docker’s AI tooling strategy is fundamentally about building a verifiable supply chain for reasoning engines. You cannot build secure AI on top of bloated, vulnerable foundations.

The strategy begins with Docker Hardened Images (DHI), providing agents and MCP servers with minimal attack-surface base images backed by cryptographically signed SLSA Level 3 provenance. Docker Hub MCP then acts as the intelligence layer, allowing agents to discover and navigate trusted container artifacts through natural-language interactions. The Docker MCP Gateway establishes the execution perimeter, isolating tools and enforcing cryptographic authentication at every tool call.

Together, these capabilities represents a broader architectural shift from securing application code to securing an agent’s entire operational blast radius. Recent supply-chain attacks such as Shai-Hulud 2.0 have shown that modern attackers increasingly target the automation layers that underpin software delivery. AI agents now operate inside those same environments, making blast-radius reduction a first-class architectural concern.


A decision framework: when to auto-fix vs escalate

Before writing a line of agent code, we wrote down the expected behavior for each failure mode. This wasn’t a nice-to-have — it became the spec the agent had to satisfy and the basis for our validation analyzer later.

Failure typeLikely causeSafe action
OOMKilledResource exhaustion (often legitimate)Auto-fix: increase memory
CrashLoopBackOffCode or configuration bugEscalate — never auto-restart
Single Exit (code 1)Could be transient (network, DB) or persistentTry restart once, escalate if it persists
HealthCheckFailureApp stuck or deadlockedAuto-fix: restart

The principle: transient and resource-driven → auto-fix. Persistent and code-driven → escalate.

This framing matters more than the implementation. It’s the part you should keep even if you replace every other piece of the system. The agent’s job isn’t to be smart — it’s to apply this rule consistently and visibly.

We chose to encode this in the agent’s system prompt rather than in code branching, which turned out to be one of our most important design decisions. More on that below.


The architecture in practice

The system has five logical layers running across three Docker Compose containers:

The architecture separates concerns into five layers. The AutoGen agent (GPT-3.5-turbo, cost-optimized for this decision space) handles reasoning and decision-making. The Docker MCP Gateway sits in front of the tools as a security enforcement point — every tool call passes through HMAC authentication, Redis-backed rate limiting (100 requests/hour), input validation, and structured audit logging. The MCP Tools layer exposes three remediation actions: check_container_logs, restart_container, and update_container_resources. Below that, the Docker API performs the actual container operations. In our current implementation, the Gateway and Tools layers are colocated in a single Python service for simplicity — in a multi-tenant production setup you’d separate them into distinct services that scale independently.

Every tool call generates an audit log entry like this:

{
  "timestamp": "2026-05-07T02:08:15.456Z",
  "incident_id": "inc-20260507-020815",
  "agent_id": "docker-ops-agent-001",
  "alert": {
    "description": "Docker container crashed with OOMKilled",
    "container_id": "nginx-oom-test",
    "status": "OOMKilled"
  },
  "decision_chain": [
    {"tool": "check_container_logs", "result": "..."},
    {"tool": "update_container_resources", "result": "Memory limit updated to 200MB"}
  ],
  "resolved": true
}

That structured output is what makes the system auditable. It’s also what makes our validation work possible.


The engineering reality: 43% to 100%

Across 7 development-phase incidents, our agent made the correct decision 43% of the time. Across 6 validation-phase incidents after applying our fixes, it was correct 100% of the time. Both datasets are committed in the repo’s monitoring/analysis directory.

PhaseRunsCorrectAvg turns/incident
Before fixes73/7 (43%)22.7
After fixes66/6 (100%)11.7

A note on sample size: this is a small dataset, intended to demonstrate reproducibility of the expected behavior across the four scenarios. It does not support statistical claims about reliability under load or at scale. We’re being explicit about that because most AI infrastructure articles aren’t.

What changed between the two phases is documented as nine challenges in the lab README. Three of them drove most of the improvement — and each one taught us something we wouldn’t have predicted from theory alone.

Challenge A — The OOM that couldn’t be fixed

In the early runs, the agent correctly diagnosed an OOMKilled container, called the memory-update tool, and got back this Docker error:

Memory limit should be smaller than already set memoryswap limit, 
update the memoryswap at the same time

Then it correctly escalated, because it had no tool for updating memoryswap. By our analyzer’s expected-behavior rules, this was marked as “wrong” because the OOMKilled scenario expected AutoResolved, not Escalated.

But the agent’s logic was right. The bug wasn’t in the agent — it was in our test container’s --memory-swap configuration. Once we fixed that (set --memory-swap=-1 for unlimited swap), the agent’s behavior didn’t change at all. The same logic that escalated correctly before now succeeded correctly. The agent went from 0/2 to 2/2 correct.

Lesson: the agent is a mirror that exposes infrastructure problems you didn’t know you had. It made us look at our own test environment more carefully, which is something a static script never would have done.

Challenge B — The over-eager restart

In the first three CrashLoopBackOff runs, the agent restarted the container 2 out of 3 times. CrashLoopBackOff is exactly the failure mode where you should never restart — the container is crashing because of a code or config bug, not a transient state. Restarting just generates more crashes.

Our first instinct was to write code: add a check, branch the logic, route CrashLoopBackOff to a different path. We tried something simpler first. We tightened the system prompt:

For CrashLoopBackOff failures: ALWAYS escalate to a human operator.
NEVER attempt to restart the container. Restarting will only cause 
the container to crash again. Your role is to diagnose and report,
not to fix.

That single change — no code, just words in the prompt — made the agent consistently escalate on every subsequent run.

Lesson: explicit boundaries beat implicit reasoning. The system prompt is the agent’s policy file. If the policy isn’t written down, the model will improvise based on whatever it thinks is reasonable. For production agents, “improvise” is rarely what you want.

Challenge C — The hallucinated containers

After resolving real incidents, the agent started making up alerts for containers that didn’t exist — memory-hungry-app, app-crash-loop, none of which were ever in our system. It was inventing failures and then “responding” to them.

Root cause: AutoGen’s max_consecutive_auto_reply was set to 10. After the agent finished a real incident, the conversation framework kept giving it turns. Without a real prompt to respond to, it generated plausible-looking next incidents and walked itself through fake remediations.

Fix: drop max_consecutive_auto_reply to 3. The agent gets exactly enough turns to diagnose, act, and report — then the conversation ends.

Lesson: agentic loops need conversation discipline. Frameworks like AutoGen are designed to keep agents talking. In production, you want them to stop talking once the job is done.

From Mohammad-Ali A’râbi, Docker Captain:

Watching the agent’s correctness improve from 43% to 100% mirrors the exact evolution of DevSecOps: it proves that production AI is not a machine learning challenge; it is a systems engineering challenge. The initial failures were not the fault of the LLM; they were the result of implicit, undocumented policies and permissive execution environments.

Production AI engineering requires moving past the “magic” of conversational models and returning to rigorous, deterministic engineering discipline. It means treating the system prompt as an immutable policy file, writing explicit, boundary-defining rules that leave zero room for the model to improvise. It means enforcing aggressive Redis-backed rate limits to prevent hallucination loops, isolating execution tools to eliminate docker.sock vulnerabilities, and relying exclusively on structured JSON audit logs rather than plain text for forensic validation.

The agent is merely a component. The surrounding infrastructure—the cryptographic constraints, the isolated execution environments, and the hardcoded fallbacks—is what actually makes the system safe. Building trust in AI demands the exact same rigor we apply to cluster security: trust nothing, verify everything, and strictly log the rest.


Production patterns we’d recommend

If you’re building agentic infrastructure with Docker MCP Gateway or similar, these are the patterns that emerged from our nine challenges as the most worth keeping:

Authenticate every tool call, even in dev. We used HMAC signing on every request from agent to MCP server. It’s not just a production hardening — it catches integration bugs early because you’ll know immediately if your auth flow breaks.

Audit trail as structured JSON, not text logs. The audit format we used (incident ID, agent ID, alert, decision chain, resolved flag) made it possible to write an analyzer that validates agent behavior automatically. Plain text logs would have made that impossible.

Rate limit aggressively. We used Redis with 100 requests/hour per agent. Agents can spiral fast — one bug in the system prompt can trigger thousands of tool calls before you notice.

Conservative escalation as default. When the agent isn’t sure, it should escalate. The cost of one false-positive escalation is a paged human checking on a non-issue. The cost of one false-negative auto-fix is a masked production problem. Asymmetric risk, asymmetric default.

Validate against expected behavior. Write down what you expect each failure mode to do, then write an analyzer that checks the audit log against that spec. We open-sourced ours — it’s about 250 lines of Python, no external dependencies. You can adapt it to any agent that produces structured audit logs.

Tighten conversation turn limits. max_consecutive_auto_reply=3 is a sane starting point for production. The agent should do its job and then the conversation should end. Frameworks default to longer because they’re optimized for conversational AI demos, not production ops.


What’s still missing

This article would be marketing if we didn’t include this section. Honest engineering means owning what isn’t built yet.

No Docker Scout MCP server exists yet. Security-aware container discovery — “find the most secure nginx tag,” “show me CVEs in this image” — isn’t possible through MCP today. The Docker Hub MCP server has 13 tools but none of them surface vulnerability data. This is a real gap in the ecosystem.

No incident memory or pattern recognition. Our agent treats every incident as fresh. A production system would learn that this container OOMs every Tuesday at 4pm and recommend a permanent memory increase rather than reactively bumping it each time. We’ve left this as future work.

Sample sizes are small. Our 6 post-fix incidents prove the expected behavior is reproducible across the four scenarios. They don’t prove reliability under production load, traffic spikes, or adversarial conditions. We’d need 100x more data and load testing to make those claims.

MTTR is unmeasured. AutoGen records all decision-chain timestamps within microseconds of each other, so the per-incident duration data we collected isn’t usable as a real mean-time-to-recovery metric. Capturing real MTTR would require external timing instrumentation around the agent.

Gateway and tools are colocated. Our MCP server bundles the security pipeline (HMAC, rate limiting, audit) with the tool execution. In a true multi-tenant production setup, you’d separate these into distinct services so they can scale independently. Our current architecture is fine for a single team or environment; it would need refactoring before serving multiple agent populations.


What this means for AI infrastructure

The interesting part of building agentic infrastructure isn’t getting the agent to act. It’s getting it to not act when acting would make things worse. Docker MCP Gateway is one of the first production tools that takes this seriously — one that treats the infrastructure around the agent as the security layer, not the agent itself. It’s all infrastructure for building agents you can actually trust in production.

We expect this pattern — Gateway + scoped tools + validated decision boundaries + structured audit logs — to become the standard shape of production AI agents. Not because it’s elegant, but because it’s the only shape that makes them auditable, debuggable, and operable by teams larger than one person.

Our nine challenges are probably your next nine challenges. Take the analyzer, take the patterns, take the audit format — they’re all MIT-licensed in the companion repository. If you build something with them, we’d love to hear about it.


Mohammad-Ali A’râbi — Senior Backend Engineer at JobRad, Docker Captain, Snyk Ambassador, and founder of the Black Forest Docker community. Author of Docker and Kubernetes Security (published October 2025) and upcoming fiction novel Black Forest Shadow. He speaks at developer conferences and organizes meetups across Europe.

Shamsher Khan — Senior DevOps Engineer at GlobalLogic (a Hitachi company). IEEE Senior Member, DZone Core Member, CNCF blog contributor. Writes at OpsCart.


Companion repository: github.com/opscart/docker-security-practical-guide/tree/master/labs/11-docker-mcp-gateway

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top