A Deep Dive into Containers

Since years ago, containers have been a hot topic everywhere. There are many container softwares like Docker, Linux Containers and Singularity. It’s hard to say one understand what containers are without diving into all the gory details of them, so I decided to go on this exploration myself.

The actual motivation was (quite) a bit different, though, as I was a TA of Operating Systems (H) in Spring 2020, and I wanted to bring a wave of innovation into the course labs, so I worked this out very early.

The contents in this article are listed in the Table of Contents on the rightat the top of this page. My implementation in my GitHub repository and the original lab documents (which is also written primarily by me, in Chinese) are linked right above.

My test environment is Ubuntu 18.04 LTS (Kernel 5.3, HWE 18.04). In case of any difference, you can consult Google for details.

If you want to find out the exact system calls involved in a command-line tool, strace is your friend.

Code samples have a different license than this article

While this article is licensed under the CC BY-SA 4.0 license, code samples and snippets are taken from the GitHub repository, which is licensed under the GPL-3.0 license.

Experimenting with isolation

Before we jump straight to writing code, let’s warm ourselves up by playing with an existing, minimal container implementation, to get a better idea of our target.

Preparing the root filesystem

To keep things simple, we’re going to use the system images from the LXC project. Grab the latest Ubuntu image from https://images.linuxcontainers.org/images/ubuntu/, unzip it to somewhere convenient for you, and this part is almost done.

If you’re on a “modern” distro like latest Ubuntu, Debian or Fedora, you need to populate the /etc/machine-id file in the container image with a valid “machine ID”, because systemd needs it. A simple way to do this is

systemd-machine-id-setup --root=/path/to/your/rootfs

If you’re running systemd 240 or later, there’s a better neat tool for this job:

systemd-id128 new > /path/to/your/rootfs/etc/machine-id

Playing with chroot

chroot is an old way to limit the directory tree a process (and its subprocesses) can see to a specific subtree. Under normal circumstances, processes cannot see anything outsite the chroot’d directory. This is called a chroot jail. Understanding the concepts of chroot is an important first step to understanding containers, though a typical container does not use chroot (more on this below).

Using chroot is very easy:

chroot /path/to/your/rootfs

You now get a shell inside the chroot jail. You can perform file-based operation like running “regular” commands and editing system files. All changes are saved in this “container rootfs”. You can even try apt update and apt install vim and see if it works.

As you’re probably aware, chroot is just too simple and sometimes naive to be secure. You can try the following commands, but be sure to save your work. Proceed with caution!

reboot
mount
dd if=/dev/sda of=test bs=4k count=1
echo $$

Playing with systemd-nspawn

As you can see, chroot lacks too many security constaints. Systemd-nspawn, on the other hand, is a complete container implementation and is thus secure against random programs.

Using systemd-nspawn is equally easy:

cd /path/to/your/rootfs
systemd-nspawn

Now repeat your experiments in the chroot section and carefully observe the differences.

The base program

After getting your rootfs up for rocking, we’ll start with a fairly simple chroot-based program, modify it step-by-step, until it becomes the container we want.

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h> // For wait(2)
#include <sys/wait.h>  // For wait(2)

const char *usage =
"Usage: %s <directory> <command> [args...]\n"
"\n"
"  Run <directory> as a container and execute <command>.\n";

void error_exit(int code, const char *message) {
    perror(message);
    _exit(code);
}

int main(int argc, char **argv) {
    if (argc < 3) {
        fprintf(stderr, usage, argv[0]);
        return 1;
    }
    if (chdir(argv[1]) == -1)
        error_exit(1, argv[1]);

    pid_t pid = fork();
    if (pid == 0) {
        // Child goes for target program
        if (chroot(".") == -1)
            error_exit(1, "chroot");
        execvp(argv[2], argv + 2);
        error_exit(255, "exec");
    }

    // Parent waits for child
    int status, ecode = 0;
    wait(&status);
    if (WIFEXITED(status)) {
        printf("Exited with status %d\n", WEXITSTATUS(status));
        ecode = WEXITSTATUS(status);
    } else if (WIFSIGNALED(status)) {
        printf("Killed by signal %d\n", WTERMSIG(status));
        ecode = -WTERMSIG(status);
    }
    return ecode;
}

Namespaces

Namespaces are a fundamental aspect of Linux containers. They provide isolation for a variety of mission-critical system resources like process IDs, hostnames, network stacks and inter-process communication. They are the key to making containers “look independent” from the host system.

As of Linux kernel 5.6 released in April 2020, there are 8 kinds of namespaces present:

Mount namespace isolates mount points (visibility) from the parent. New mount activities in the parent namespace won’t be visible in child namespaces. However, to achieve the reverse, a separate thing called “mount propagation” is involved. First appeared in 2002, Linux 2.4.19.
UTS namespace provides isolated hostnames. UTS stands for “UNIX Time-Sharing system”. First appeared in 2006, Linux 2.6.19.
IPC namespace isolates traditional System V-style IPC methods. First appeared in 2006, Linux 2.6.19.
PID namespace provides a separate set of process IDs so that a process may look different inside. This is important for certain programs to function properly, most notably the init process, which must be PID 1. First appeared in 2008, Linux 2.6.24.
Networking namespace provides a full set of network stack. Suitable for creating isolated network environments for containers. First appeared in 2009, Linux 2.6.29.
User namespace allows mapping UIDs / GIDs from containers to hosts, so that unpriviledged users can perform certain tasks that normally require the superuser privilege, without actually elevating themselves or posing risks to the host. First appeared in 2013, Linux 3.8.
Cgroup namespace provides isolated cgroup hierarchies so containers can safely utilize cgroup functionalities without affecting the host. First appeared in 2016, Linux 4.6.
Time namespace allows different processes to “see” different system times. First appeared in 2020, Linux 5.6.

There are two ways to get namespaces isolated, unshare() and clone(). A brief difference is that unshare isolates for the calling process (except PID namespace, check the manual for more details), while clone creates a new process with isolated namespaces. We’ll go for clone because it’s the system call underneath Go’s exec.Command, and that Go is used for popular container software like Docker and Singularity.

To utilize the clone system call, we need some adaptions, among which the most notable ones are the entry function and the child stack (using mmap(), I had problems later with malloc() in my early testing). The rest are covered pretty well in the manual so there’s no need to repeat them here (e.g. SIGCHLD appearing in flags parameter).

int child(void *arg) {
    printf("My name is %s\n", (char *)arg);
    return 3;
}

int main() {
    char name[] = "child";

#define STACK_SIZE (1024 * 1024) // 1 MiB
    void *child_stack = mmap(NULL, STACK_SIZE,
                             PROT_READ | PROT_WRITE,
                             MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK,
                             -1, 0);
    // Assume stack grows downwards
    void *child_stack_start = child_stack + STACK_SIZE;

    int ch = clone(child, child_stack_start, SIGCHLD, name);
    int status;
    wait(&status);
    printf("Child exited with code %d\n", WEXITSTATUS(status));
    return 0;
}

And for the include headers as well.

#define _GNU_SOURCE    // Required for enabling clone(2)
#include <stdio.h>
#include <sched.h>     // For clone(2)
#include <signal.h>    // For SIGCHLD constant
#include <sys/mman.h>  // For mmap(2)
#include <sys/types.h> // For wait(2)
#include <sys/wait.h>  // For wait(2)

Now that we have clone ready, adding support for namespace isolation is as simple as adding flags to the parameters.

int ch = clone(child, child_stack_start, CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWCGROUP | SIGCHLD, name);

Mounts

Traditionally, mounting is a way to map raw disks to accessible filesystems. Since then, its usage has evolved and supports much more than disk mapping. We’re particularly interested in using special filesystems like /proc (the FS that provides runtime information like processes and kernel parameters), /sys (system settings, device information etc.), /tmp (a temporary filesystem backed by RAM) etc., without which a container won’t function properly.

For a minimal example, we’ll mount 4 essential filesystems with correct mount options for our container. They are the three mentioned above plus /dev as a tmpfs. We’ll also create a few device nodes under /dev so things can go smoothly when they’re needed (e.g. some_command > /dev/null).

We’re not using `devtmpfs` here

If you examine current mounts in your host system, you’ll probably see that /dev is mounted as devtmpfs. While it may appear straightforward to employ that, it’s unacceptable for a container, as it exposes all device nodes to the container, which violates the purpose of isolation of containers. See this answer on Unix & Linux Stack Exchange.

To do this manually, you’ll issue the following commands in a shell.

mount -t tmpfs tmpfs /dev
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t tmpfs tmpfs /tmp

You can then run mount without arguments to see the mount results.

sysfs on /sys type sysfs (rw,relatime)
proc on /proc type proc (rw,relatime)
tmpfs on /dev type devtmpfs (rw,relatime)
tmpfs on /tmp type tmpfs (rw,relatime)

If you compare this with the mount points in your host system, you may notice something different.

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,noexec,relatime,size=65895752k,nr_inodes=16473938,mode=755)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=13191916k,mode=755)

The extra flags (nosuid,nodev,noexec) control the behavior of the mount point. For example, nosuid means the set-uid bit will be ignored for entries under the mount point, while noexec prevents any execution of programs from inside.

Now we’re going to do it in C. The system call is also named mount, and has the following signature:

int mount(const char *source, const char *target,
          const char *filesystemtype, unsigned long mountflags,
          const void *data);

It should be intuitive enough what the first three parameters are for, so for now we can just write

mount("tmpfs", "/dev", "tmpfs", 0, NULL);
mount("proc", "/proc", "proc", 0, NULL);
mount("sysfs", "/sys", "sysfs", 0, NULL);
mount("tmpfs", "/tmp", "tmpfs", 0, NULL);

The fourth parameter corresponds to the flags we discussed above. All applicable flags can be found in the man page for mount(2).

Keep in mind that, however, the last parameter isn’t entirely useless. It’s simply not used for now, but it’ll play a role later. (Actually, you may have noticed already. Good job for that.)

Creating device nodes

Now that we have an empty /dev directory, we should populate it with some device nodes so that software expecting their presence could work. At a minimum, we need null, zero, random and urandom, but you can add tty and console if you want (these two are a bit different - you have been warned).

Device nodes are created with mknod(2), whose prototype is:

int mknod(const char *path, mode_t mode, dev_t dev);

With a little research effort, we know we’ll call it like this:

mknod("/dev/something", S_IFCHR | 0666, makedev(MAJOR, MINOR));

To determine the device node numbers, you can take a look at the same nodes in the host system, using ls -l or stat. Don’t worry, the numbers for special devices remain the same across Linux distros, unlike BSD systems. It shouldn’t take long before you come to this:

mknod("dev/null", S_IFCHR | 0666, makedev(1, 3));
mknod("dev/zero", S_IFCHR | 0666, makedev(1, 5));
mknod("dev/random", S_IFCHR | 0666, makedev(1, 8));
mknod("dev/urandom", S_IFCHR | 0666, makedev(1, 9));

pivot_root

We’re ready with mounts, so now we can take a look at switching the root filesystem for our container.

The base program used chroot() for the time being, but talking about a (baseline) secure container, it’s terrible. We have to resort to another Linux feature, pivot_root, for this purpose.

Let’s first take a look at its man page to determine its prototype.

int pivot_root(const char *new_root, const char *put_old);

This system call is special enough that we must also take care of its notes and requirements. For example, new_root must be a mount point. While the man page does provide a solution to this problem by mounting the directory on top of itself, it’s too prone to errors for us to adopt. Instead we’ll be creating a temporary directory to use as the mount point.

const char *newroot = "/tmp/ispawn";
mkdir(newroot, 0755);

We also need a value for the second parameter to pivot_root, the put_old directory. The manual says the following:

put_old must be at or underneath new_root

A direct interpretation is that put_old must be at a subpath under new_root, which means we can simply create (or reuse an existing) a directory under new_root to use.

const char *put_old = "/tmp/ispawn/oldroot";
mkdir(put_old);

And now we can do pivot_root with the directories we just set up:

pivot_root(newroot, put_old);

If everything so far is correct, we should now be running inside the new root tree. The “old root”, or the root filesystem of the host system, is now available at /oldroot.

Apparently, a container shouldn’t be able to access the host filesystem without explicit grants, so we’re going to “hide” the old root. It is, from the view from within the container, an ordinary mount point that we can just unmount. However, as there (definitely) are other processes in the host system still using the filesystem, it can’t be unmounted directly.

There’s a technique called “lazy unmounting”, where existing processes continue to use the filesystem as usual, while other processes see it disappeared. It could be dangerous, but as we’re the one-and-only process inside the container, we know it’s safe for us.

With that many information told, the actual code is really simple:

umount2("/oldroot", MNT_DETACH);

We’re using the umount2 system call because we need to pass the extra flags to it. Now that the host filesystem is gone, we can remove the now-empty directory (remember we’re doing clean-up jobs):

rmdir("/oldroot");

We’ve isolated our container filesystem from the host system, and then we can proceed to securing and fortifying our container.

Capabilities

In the traditional UNIX era, there were only two privilege levels - privileged (root) and unprivileged (non-root), where a privileged process has every privilege to alter the system, while an unprivileged process has none. Since Linux 2.2 in 1999, capabilities have been added to the kernel so that unprivileged processes may acquire certain abilities needed for some task, while privileged processes may drop capabilities unneeded, allowing for privilege control at a finer granularity. A ping process doesn’t need any extra privileges than sending ICMP packets, and a web server (probably) doesn’t need any extra privileges than binding to a low port (1 to 1023), do they?

With capabilities, unprivileged processes can be granted access to selected system functionalities, while privileged processes can be deprived of selected ones. For example, CAP_NET_BIND_SERVICE is the capability to bind to TCP or UDP ports between 1 and 1023, and CAP_CHOWN enables the use of chown(2).

Now turning our focus back to containers. Without privilege separation, a “root” process inside a container can still do dangerous things, like scanning your hard drive where the host filesystem resides, and manipulate it. This is definitely not anything expected, so we’re going to limit the capabilities the container can have as a whole.

The system calls behind capabilities manipulation are very complicated, so unlike in previous sections, we’re going to use wrapped-up libraries to aid with this. There are two options available, libcap and libcap-ng, of which the latter is easier to understand and use. The documentations for libcap and libcap-ng are given. Note that since they’re “external” libraries, extra flags need to be supplied when compiling. For libcap you’ll add -lcap to the compilation command, and similarly for libcap-ng you’ll add -lcap-ng to the command.

As an easier starting point, we’ll use Docker’s capabilities set to avoid having to sort everything out by ourselves. Before we start, there’s another thing to learn - the different “sets” of capabilities of a process. In a few short words,

The bounding set restricts the maximum possible set of capabilities a process (and all its descendants) can have
The effective set is what a process currently has and is effective
The permitted set may be granted when “asked” (using the appropriate system calls)

It’s noticeable that we want to limit all three sets for the container. Using libcap-ng, the code is very simple:

capng_clear(CAPNG_SELECT_BOTH);
capng_updatev(CAPNG_ADD, (capng_type_t)(CAPNG_EFFECTIVE | CAPNG_PERMITTED | CAPNG_BOUNDING_SET),
    CAP_SETPCAP,
    // ...
    CAP_SETFCAP,
    -1);
capng_apply(CAPNG_SELECT_BOTH);

With capng_clear, we clear all capabilities from our pending changes, and add whitelisted capabilities, before finally applying the changes.

Using libcap, however, is slightly more complicated to achieve the same, as there’s no direct “clear all” function, but instead you’ll have to list them by yourself. Here’s an older version of my attempted code if you want to learn. Nevertheless, it’s never bad to learn more.

SecComp

SecComp (Secure Computing) is a security module in Linux that lets a process to transition one-way into a “secure state” where no system call other than read(), write(), sigreturn() and exit() is allowed. It’s easily noticeable that this feature is too strict for making something useful, and seccomp-bpf is an extension to the rescue.

Seccomp BPF extends the seccomp module with Berkeley Packer Filter (BPF), an embedded instruction set that allows highly customized seccomp rules to be deployed. With BPF, you can create custom logic for system call filtering, including matching and testing individual system call arguments.

System call filtering

To ensure full control, we’re using a whitelist for system calls. This means any unknown one will be rejected. So we’ll start by creating a new “SecComp filter context”, and set the default action to “reject”. By “reject”, we’ll return “permission denied” when a process tries to call it.

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(1));

The SCMP_ACT_ERRNO(1) refers exactly to “respond with EPERM”, which will be hit if no other filters apply.

System call whitelist

We’ll now add each “safe” system call to our filter and set it to “allowed”. To save some time scratching your head examining each system call, we’ll adopt Docker’s syscall whitelist. Each system call will be wrapped in SCMP_SYS so it’s turned into a suitable number used inside SecComp.

We need to add the whole big list of “general” system calls, plus some platform- or scenario-specific ones, namely, two special system calls for amd64 platform, and a few others for system administration, since we’ve allowed CAP_SYS_ADMIN inside the container.

Use your favorite text processing toolstack to get the big list into a C-array so we can loop over, like this:

int allowed_syscalls[] = {
    SCMP_SYS(accept),
    SCMP_SYS(accept4),
    SCMP_SYS(access),
    // Many, many more...
    SCMP_SYS(waitpid),
    SCMP_SYS(write),
    SCMP_SYS(writev),

And then append these special ones we want to include as well:

    // amd64-specific required syscalls
    SCMP_SYS(arch_prctl),
    SCMP_SYS(modify_ldt),

    // CAP_SYS_ADMIN-specific syscalls
    SCMP_SYS(bpf),
    SCMP_SYS(clone),
    SCMP_SYS(fanotify_init),
    SCMP_SYS(lookup_dcookie),
    SCMP_SYS(mount),
    SCMP_SYS(name_to_handle_at),
    SCMP_SYS(perf_event_open),
    SCMP_SYS(quotactl),
    SCMP_SYS(setdomainname),
    SCMP_SYS(sethostname),
    SCMP_SYS(setns),
    SCMP_SYS(syslog),
    SCMP_SYS(umount),
    SCMP_SYS(umount2),
    SCMP_SYS(unshare)
};

Find the number of items included, and save it for easier later use.

size_t allowed_syscalls_len = sizeof(allowed_syscalls) / sizeof(allowed_syscalls[0]);

We can then add each system call to our new SecComp filter as “allowed” with a simple loop:

for (int i = 0; i < allowed_syscalls_len; i++) {
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allowed_syscalls[i], 0);
}

Loading SecComp filter

After our filter has been constructed, we can load it onto our process for it to take effect.

seccomp_load(ctx);

And finally, release the workspace to avoid memory leaks.

seccomp_release(ctx);

Caveats

Incompatible system calls

As I worked this out on an Ubuntu 18.04 environment, some newer system calls weren’t available in my system headers, like the io_uring-related ones that are introduced in Linux 5.1. You can safely comment out any of them that your compiler complains about not recognizing. There shouldn’t be too many of them if your environment is up-to-date, though.

Precautionary checking

As it’s too common for one of the function calls to fail, I’ve added sanity checks for them. Here’s the complete code of this part.

int filter_syscall(void) {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(1));
    if (ctx == NULL) {
        return -1;
    }
    for (int i = 0; i < allowed_syscalls_len; i++) {
        if ((errno = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, allowed_syscalls[i], 0)) < 0) {
            return -1;
        }
    }
    if ((errno = -seccomp_load(ctx)) < 0) {
        return -1;
    }
    seccomp_release(ctx);
    return 0;
}

Resource restriction

The last part we’ll visit is restricting container resources. Surely we don’t want a container to overuse system resources like CPU or RAM and make the host system less stable. Linux Control Groups (Cgroups) is designed for efficient resource constraint that we’re going to make use of. There are many “cgroup systems” for different aspects of system resources, including CPU, RAM and even disk I/O. Looks pretty neat, right?

Unlike other parts we’ve built so far, cgroup doesn’t use system calls for setup and configuration, but a filesystem-based interface instead, like those in /proc or /sys. In fact, the cgroup control interface resides exactly under /sys, at /sys/fs/cgroup. With this interface, we read and write “files” to change configuration values, and create and delete directories to add or remove structures.

There are multiple cgroup “controllers” working on different aspects of system resources, each having a distinct tree structure under /sys/fs/cgroup. So first we’ll examine what cgroup controllers are available:

root@ubuntu:~# ls /sys/fs/cgroup
blkio        cpuacct  freezer  net_cls           perf_event  systemd
cpu          cpuset   hugetlb  net_cls,net_prio  pids        unified
cpu,cpuacct  devices  memory   net_prio          rdma

Here we’re interested in some of them, namely, blkio, cpu, memory and pids.

Let’s first take a look at pids. We’ll create our own subtree to start with:

root@ubuntu:~# mkdir /sys/fs/cgroup/pids/ispawn
root@ubuntu:~# ls /sys/fs/cgroup/pids/ispawn
cgroup.clone_children  notify_on_release  pid.events  tasks
cgroup.procs           pid.current        pid.max

It’s easily imagined that pid.max controls the maximum number of PIDs in this subsystem, so let’s write something to it:

root@ubuntu:~# echo 16 > /sys/fs/cgroup/pids/ispawn/pid.max

To verify that it’s working, make an attempt to exceed the limit. Open another shell and find its pid with echo $$. Write the number that you see (it’s the PID of the new shell) to /sys/fs/cgroup/pids/ispawn/cgroup.procs. You can verify that the new process has been added to the subsystem by reading that cgroup.procs files out, and you’ll see the PID you just written.

Now switch to the new shell and try spawning a lot of subprocesses, for example:

for i in {1..20}; do /bin/sleep 10; done

You can see the shell output as Operation not permitted for 5 to 6 times. This means it has hit the PID cap and fails to spawn more processes.

In our C-based container program, we’ll do this in the parent process. The code is intuitively simple.

mkdir("/sys/fs/cgroup/pids/ispawn", 0777);
FILE *fp = fopen("/sys/fs/cgroup/pids/ispawn/pid.max", "w");
fprintf(fp, "%d", 16);
fclose(fp);
FILE *fp = fopen("/sys/fs/cgroup/pids/ispawn/cgroup.procs", "w");
fprintf(fp, "%d", pid); // pid of the child process
fclose(fp);

We can now proceed to setting other limits:

To reduce CPU shares, we write to cpu/cpu.shares. Because CPU shares are relative to each other and the system default is usually 1024, setting the value to 256 for our container gives it 1/4 as much CPU as other processes when the system load goes up. (It still gets more CPU when needed and when the system is more idle.)
To limit memory usage, we write to memory/memory.limit_in_bytes (for userspace memory) and memory/memory.kmem.limit_in_bytes (for kernel memory).
- However, this limits only physical memory usage, so when swap is present, memory gets swapped out onto disk when it hits the limit. To completely disable swap for our container, set memory/memory.swappiness to zero.
To reduce disk I/O priority, we write to blkio/weight. This is relative to 100 so writing 50 will reduce its disk I/O priority to half.
The last thing to note is that the tree hierarchies are independent among different cgroup controllers, so you have to create the same ispawn directory in each of them, and write cgroup.procs inside each of them.

Heads up

The course lab at the time was based on Ubuntu 18.04 with Linux kernel 5.3 (18.04 HWE). The cgroup controllers in newer kernels may be very different from what’s presented in this article. For example, with Linux 5.4 on Ubuntu 20.04, the keys in PID cgroup begins with pids. instead of pid., and blkio has a completely different set of available keys. Make sure you examine the cgroup directories before copying and pasting code.

Mounting cgroup controllers inside the container

To enable applications in our container to use cgroup controllers, we must mount them inside. Like how we mounted /sys, /tmp and other filesystems, we check the output of mount to determine how we’re going to call mount(2).

cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)

Everything looks similar to what we’ve just done, but there’s one different thing: There’s no mount flag for pids.

Recalling we skipped the last parameter of mount, now it’s time to pick it back up. Fortunately, it isn’t too complicated. For our use case, we can just pass the string "pids" to that parameter, and we swap the string for another to mount another cgroup controller. You can read the man page for cgroups(7) about this, look for Mounting v1 controllers.

To mimic the monut points on our host system, we additionally mount a tmpfs at /sys/fs/cgroup, and remount this mountpoint as read-only after adding the controllers. The final result looks like this:

void mount_cgroup(void) {
    int cgmountflags = MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_RELATIME;
    // Mount a tmpfs first
    mount("none", "sys/fs/cgroup", "tmpfs", cgmountflags, "mode=755");

    // Prepare mount points
    mkdir("sys/fs/cgroup/blkio", 0755);
    mkdir("sys/fs/cgroup/cpu,cpuacct", 0755);
    mkdir("sys/fs/cgroup/memory", 0755);
    mkdir("sys/fs/cgroup/pids", 0755);

    // Mount cgroup subsystems
    mount("cgroup", "sys/fs/cgroup/blkio", "cgroup", cgmountflags, "blkio");
    mount("cgroup", "sys/fs/cgroup/cpu,cpuacct", "cgroup", cgmountflags, "cpu,cpuacct");
    mount("cgroup", "sys/fs/cgroup/memory", "cgroup", cgmountflags, "memory");
    mount("cgroup", "sys/fs/cgroup/pids", "cgroup", cgmountflags, "pids");

    // cpu and cpuacct need symlinks
    symlink("cpu,cpuacct", "sys/fs/cgroup/cpu");
    symlink("cpu,cpuacct", "sys/fs/cgroup/cpuacct");

    // Remount the tmpfs as R/O
    mount(NULL, "sys/fs/cgroup", NULL, MS_REMOUNT | MS_RDONLY | cgmountflags, NULL);
    return 0;
}

A small problem with cgroup namespace

During my experiments, I noticed a strange issue where I could see the host cgroup hierarchies in my container implementation. It turns out that the cgroup “root” inside a cgroup namespace is the subtree the process belongs in when this cgroup namespace is created / isolated. Once the namespaces is created, its root is determined and fixed, even if the “root” process is moved into another subtree later.

This means the child process must be “moved” to the desired cgroup subtree before the cgroup namespace is isolated. This leaves us with two options:

The parent process moves itself to the target cgroup subtree before calling clone() with CLONE_NEWCGROUP
The parent process calls clone() without CLONE_NEWCGROUP, moves the child process to the target cgroup subtree, and then tells the child process to isolate the cgroup namespace.

It should be noted that with the second option, some kind of “syncing” is needed to avoid the child process going too quickly to perform the cgroup namespace isolation before the parent process finishes its job. It’s easy to come up with a solution that just works: We can create a pipe between the processes, where the parent process can send something to tell the child process that it’s ready.

With this in mind, the second option is actually easier to implement, since there’s another system call for isolating namespaces in-place (i.e. without creating a new process), that we put away earlier. It’s unshare(2). It’s simple to use, too, just call unshare(CLONE_NEWCGROUP) when ready.

To verify that this issue is handled correctly, check the content in /proc/1/cgroup. The correct result should look like this, where every line ends with a single /:

cpuset:/
rdma:/
blkio:/
pids:/
devices:/
net_cls,net_prio:/
memory:/
hugetlb:/
perf_event:/
cpu,cpuacct:/
freezer:/
name=systemd:/
:/

With an incorrectly written container, certain lines may have an unexpected value, generally starting with /../, for example:

cpuset:/
rdma:/
blkio:/../user.slice
pids:/../user.slice/user-0.slice/
devices:/
net_cls,net_prio:/
memory:/../user.slice/user-0.slice/
hugetlb:/
perf_event:/
cpu,cpuacct:/../user.slice
freezer:/
name=systemd:/
:/

As explained above, these paths are “paths to the cgroup location of PID 1 relative to the ‘root’ of the cgroup namespace”. When properly done, the PID 1 should have all of its cgroup hierarchies belonging at “root”.

Don’t be surprised to see the inconsistent lines from /proc/1/cgroup, as a process can be at different locations in different cgroup controllers.

Conclusion

Now here, at this point, we’ve gone through all technologies required for a functional and secure Linux container, although our “container” isn’t necessarily functional and secure. It’s going to be hard work examining and patching all the loopholes for the best security, if you’d like, but the fundamentals have been covered already so there won’t be anything new.

There are two namespaces we’ve skipped in the beginning (three if you count CLONE_NEWTIME). They are slightly more complicated to set up and isn’t necessary for a container, as Docker doesn’t use User Namespaces and systemd-nspawn doesn’t use Network Namespaces by default.

There are also more to consider if you want multiple containers to run simultaneously. One notable thing is that each should have a separete cgroup subtree. Avoiding mount point conflict in race conditions is another thing to take into account.

Should you want a ready-to-use example to play with, here’s the complete code that I wrote, with some bells and whistles added: iBug/iSpawn. Keep in mind that it’s wrote for Ubuntu 18.04 and things could have been changed drastically, so it may not work in your system.

A Deep Dive into Containers

iBug

Code samples have a different license than this article

Experimenting with isolation

Preparing the root filesystem

Playing with chroot

Playing with systemd-nspawn

The base program

Namespaces

Mounts

We’re not using `devtmpfs` here

Creating device nodes

pivot_root

Capabilities

SecComp

System call filtering

System call whitelist

Loading SecComp filter

Caveats

Incompatible system calls

Precautionary checking

Resource restriction

Heads up

Mounting cgroup controllers inside the container

A small problem with cgroup namespace

Conclusion

Further reading

Share on

Leave a comment

You may also enjoy

Taking the 24 puzzle game to the next level

Secure site-to-site connection with Linux IPsec VPN

Centralized Linux authentication with OpenLDAP

High-performance mass web crawling on AWS

A Deep Dive into Containers

iBug

Code samples have a different license than this article

Experimenting with isolation

Preparing the root filesystem

Playing with chroot

Playing with systemd-nspawn

The base program

Namespaces

Mounts

We’re not using devtmpfs here

Creating device nodes

pivot_root

Capabilities

SecComp

System call filtering

System call whitelist

Loading SecComp filter

Caveats

Incompatible system calls

Precautionary checking

Resource restriction

Heads up

Mounting cgroup controllers inside the container

A small problem with cgroup namespace

Conclusion

Further reading

Share on

Leave a comment

You may also enjoy

Taking the 24 puzzle game to the next level

Secure site-to-site connection with Linux IPsec VPN

Centralized Linux authentication with OpenLDAP

High-performance mass web crawling on AWS

We’re not using `devtmpfs` here