-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Disclosed in #2190.
Here's the original report to [email protected]:
Hi all,
an attacker who controls the container image for two containers that share a volume can race volume mounts during container initialization, by adding a symlink to the rootfs that points to a directory on the volume. The second container won't be able to see the actual mount, but it can race it by modifying the mount point on the volume.
This can be exploited for a full container breakout by racing readonly/mask mounts, allowing writes to dangerous paths like /proc/sys/kernel/core_pattern.
Example:
- The rootfs of container A has a symlink
/proc->/evil/level1 - Container A specifies a named volume mounted to
/evil - Container B, started before container A, shares this named volume and repeatedly swaps
/evil/level1and/evil/level1~ - Container A mounts procfs to
/evil/level1~/level2, but when it remounts/proc/sys, it does so at/evil/level1/level2/sys.
This can reliably be reproduced using runc and podman on Fedora 30 (takes about 0-5s to win the race for me): https://gist.github.com/leoluk/82965ad9df58247202aa0e1878439092
SELinux would ordinarily prevent the exploit by disallowing container_t from writing usermodehelper_t, but it can be disabled by symlinking /proc/self/task/1/attr/exec to something benign like /proc/self/sched (bypassing the procfs check). AppArmor can be disabled similarly.
Docker specifies the mounts in a different order and mounts procfs after it mounts the volumes, mounting over the /proc symlink, which appears to prevent at least the /proc approach. I haven't tested other runc usage scenarios, for instance, k8s+cri-o might be vulnerable as well.
Fabian of Cure53 (in CC) created a minimal PoC that uses runc directly: https://gist.github.com/LiveOverflow/c937820b688922eb127fb760ce06dab9
There are other container init steps after the volume mount that can be raced, obvious ones being utils.CloseExecFrom and the AppArmor/SELinux attrs but there might be others, especially in mountToRootfs (like tricking remount into mounting the rootfs as rshared if there's another volume that specifies the flag, but I haven't tried that).
This is similar to the vulnerability I reported that Adam Iwaniuk disclosed during their Dragon Sector CTF (#2128) and a similar crun one (containers/crun#111).
The fix for the mounts is probably what Aleksa outlined here, using /proc/self/fd to resolve the path: containers/crun#111 (comment)