A docker-like container management using systemd

exppad* / blog / A docker-like container management using systemd April 07, 2015

This article is an overview of how you can manage docker-like containers using systemd.

As you may know, the famous container manager docker is not the only container technology. It has been originally based on LXC, though it embed additional functionalities, but it mostly provides a high level API for container management. And actually it has been made possible by a bunch of quite recent kernel features such as namespaces, control groups or union filesystems.

As I was reading the excellent series of Lennart Poettering's articles about the use of systemd for administrators, I found out that systemd is actually able to handle its own containers.

It is not so surprising actually, since systemd is close to the kernel and so provides an abstraction over a lot of its features. But anyway, I did not know it was so easy-to-use!

I present in this article the basics of systemd-nspawn, the enhanced chroot that systemd brought to us to build containers.

Our first container

Let's start with some practical example, directly inspired from one of the Poettering articles. We are going to run a simple Debian inside the most basic container.

Set up files

We use debootstrap to pull a minimal Debian on our filesystem. This command is available on most linux distributions (well, I checked for apt, yum and yaourt, using AUR).

debootstrap --arch=amd64 unstable debian-tree/

It downloads Debian unstable files in the local debian-tree directory.

NB: I use Debian as an example, but using yum or pacstrap, you can install respectively fedora or arch as easily.

Run container

Just run systemd-nspawn (eventually as root):

systemd-nspawn --directory=debian-tree/

NB: You can use -D instead of --directory=

It performs some kind of chroot and open up a shell inside the hosted system. It not only isolates local filesystem from its host, but also abstracts its interfaces. It is much more powerfull than a simple chroot.

NB: By default /bin/sh is called inside the container, but you can specify the command you want systemd to run into the container as an extra argument. In that case, arguments specified after the command are forwarded to it and not treated by systemd-nspawn.

But we just mounted the Debian filesystem actually, and no more. If you call ps, you will not see any process but the shell (even if you are root).

root@debian-tree:~# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 bash
    6 ?        00:00:00 ps

The Debian OS has not really started. Before we do so, think about setting a password for your root user, you will need it later:

root@debian-trash:~# passwd
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully

Booting container

Call systemd-nspawn with --boot option (or -b). Alternatively, you can ask systemd-nspawn to run /sbin/init:

systemd-nspawn -D debian-tree/ --boot
# or
systemd-nspawn -D debian-tree/ /sbin/init

You should see boot messages, with lines beginning with [ OK ]. Also, if you are hosting a distribution relying itself on systemd, e.g. a recent release of Debian, you would see interesting message at the early boot:

Spawning container debian-trash on /your/current/path/debian-trash.
Press ^] three times within 1s to kill container.
systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SECCOMP -APPARMOR)
Detected virtualization 'systemd-nspawn'.
Detected architecture 'x86-64'.

[...]

You can notice that the hosted systemd detects that it has been ran inside a systemd-nspawn container! You can use systemd-detect-virt at any time to get this information.

Also note that you can escape the container (and so kill it) by pressing ^] (Control key plus ]) three times quickly, which can be useful, especially if you forgot to set any root password.

Managing containers

Docker provides a nice API to manage your containers. While systemd is not as exhaustive, it also have an interface for it, though the machinectl command (like "machine control").

$ machinectl
MACHINE                          CONTAINER SERVICE         
debian-tree                      container nspawn          

1 machines listed.

You can get more information about one particular container using machinectl status <machine>:

$ machinectl status debian-tree
debian-tree
           Since: mar. 2015-04-07 08:46:37 CEST; 55min ago
          Leader: 24553 (systemd)
         Service: nspawn; class container
            Root: /home/system/local/debian-tree
         Address: 192.168.1.108
                  fe80::2ae3:47ff:fe04:f134
              OS: Debian GNU/Linux 8 (jessie)
            Unit: machine-debian\x2dtree.scope
                  ├─24553 /lib/systemd/systemd
                  └─system.slice
                    ├─cron.service
                    │ └─24618 /usr/sbin/cron -f
                    ├─systemd-journald.service
                    │ └─24571 /lib/systemd/systemd-journald
                    ├─console-getty.service
                    │ ├─14910 man machinectl
                    │ ├─14921 pager -s
                    │ ├─24625 /bin/login --
                    │ └─24656 -bash
                    └─rsyslog.service
                      └─24620 /usr/sbin/rsyslogd -n

NB: If you want to process machinectl status output, please consider using machinectl show instead. It has been designed with this goal in mind.

If the container's content supports it, you can use machinectl reboot, machinectl poweroff or machinectl login for example. You can still use machinectl terminate to simply kill the whole container, whatever is running inside it.

About security

It is well known that LXC containers are still not secure enough to be used as absolute jails. Actually, this is due to possible exploits of kernel weakness that are present because the introduction of containers technologies is recent. If you need full isolation, consider using solution such as [KVM].

And as you can see it using systemd-cgtop (systemd's real time cgroup visualizer), some information about the host system leaks into the guest machine:

On the guest

Path                                     Tasks   %CPU   Memory  Input/s Output/s

/                                            -  144.6     2.3G       0B    94.5K
/machine.slice                               -    1.5    30.9M        -        -
/machine.sli...e-debian\x2dtree.scope       7    1.5    12.3M        -        -
/system.slice                                -      -    71.0M        -        -

On the host

Path                                     Tasks   %CPU   Memory  Input/s Output/s

/                                            -  144.6     2.3G       0B    94.5K
/machine.slice                               -    1.5    30.9M        -        -
/machine.sli...e-debian\x2dtree.scope        7    1.5    12.3M        -        -
/system.slice                                -      -    71.0M        -        -
/system.slice/NetworkManager.service         2      -        -        -        -
/system.slice/accounts-daemon.service        1      -        -        -        -
/system.slice/avahi-daemon.service           2      -        -        -        -
/system.slice/bluetooth.service              1      -        -        -        -
/system.slice/colord.service                 1      -        -        -        -
/system.slice/dbus.service                   1      -        -        -        -
/system.slice/gdm.service                    2      -        -        -        -
/system.slice/geoclue.service                1      -        -        -        -
/system.slice/httpd.service                  7      -        -        -        -
/system.slice/ipython-notebook.service       1      -        -        -        -
/system.slice/itorch-notebook.service        2      -        -        -        -
/system.slice/polkit.service                 1      -        -        -        -
/system.slice/postgresql.service             6      -        -        -        -
/system.slice/redis.service                  1      -        -        -        -
/system.slice/rtkit-daemon.service           1      -        -        -        -
/system.slic...lice/getty@tty2.service       1      -        -        -        -

For instance, you can access from the contained system to the resources used by the host! You can also see all other machine started, since they are by default on the same cgroup slice, namely /machine.slice.

And another important point is that by default the network interface is not virtualized: the container accesses the same interface than the host (try some ip link). But the good news is that we can easily manage to isolate it. And we can even add bridges between host and guest interfaces.

Network isolation

Complete isolation

If you do not need any network connection, you can completely isolate the container from its host's network using the --private-network option:

systemd-nspwan -bD debian-tree/ --private-network

If you look at the available network interfaces, you will see only a loopback:

guest$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

And actually, this loopback itself is isolated from the host's loopback interface, so there is absolutely no communication between host and guess (wrt. the previously advertised security limitations, of course).

Adding bridges

Anyway, it is common to need the container to be able to listen on some port. For instance, the container can host some web server, or socket served service. So we need to be able to expose some interface to the container.

Providing Internet interface

The simplest solution is to provide the container with a whole network interface.

systemd-nspwan -bD debian-tree/ --network-interface=eth0

Now you can see eth0 from within the container. But the main drawback is that this interface would not be accessible from the host: it is moved from one host namespace to guest namespace.

guest$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether **:**:**:**:**:** brd ff:ff:ff:ff:ff:ff

Note that the host loopback cannot be moved to the guest. Also not that you could add a second --network-interface argument to use many network interfaces.

NB: As soon as you specify that you would share a network interface, it automatically hides the other one, as if --private-network was specified.

You can then create bridged or vlan interfaces in the host and provide them to the container. Fortunately, systemd-spawn comes with some options simplifying this process.

Virtual Ethernet link

You can use --network-veth to create a virtual Ethernet link between host and container.

systemd-nspwan -bD debian-tree/ --network-veth

You will see a new interface on the host machine, named ve-debian-tree like "virtual ethernet to debian-tree container" and the container will be provided a host0 interface.

In order to communicate, you must then add an IP address to these new interfaces. For instance, you could use the reserved 10.0.0.1/8 IP block:

# Add address 10.0.0.1 to ve-debian-tree interface of host system
host$ ip addr add 10.0.0.1/24 broadcast 10.0.0.255 dev ve-debian-tree
host$ ip link set dev ve-debian-tree up

# Add address 10.0.0.2 to host0 interface of guest system
guest$ ip addr add 10.0.0.2/24 broadcast 10.0.0.255 dev host0
guest$ ip link set dev host0 up

You can then check you setup using ping 10.0.0.1 and ping 10.0.0.2 on guest and host respectively.

Now that you can communicate between host and guest, you may want to forward some host external ports toward guest system and block the other ones using for instance iptables.

Mount host directory within guest system

Although unlike docker containers, a systemd-nspawn container is persistent (modifications are not forget at reboot), it can be made non-persistent using --volatile=yes and anyway it can be useful to bind some host directory to some guest directory.

You can achieve this with --bind=/my/host/directory;/my/guest/directory.

# Boot container and mount host's /home into it
systemd-nspawn -bD debian-tree --bind=/home

Note that when both host and guest path are identical, you can omit the second one.

You can also mount it as read-only, using --bind-ro. Alternatively, you can use machinectl bind.

Dealing with images

I will end this overview by speaking about images, although reading [the man page][man:systemd-nspawn] of systemd-nspawn would show you a lot more settings.

I did not investigate a lot about images but there seems to be a complete image management available.

systemd-nspawn can take as argument a container images instead of e filesystem directory, using --image (or -i). These images can be for instance simple tarballs, but also docker images, and machinectl can help you pull images with commands such as machinectl pull-tar.

So I have to look further how I can use this power, but it seems that systemd provides us with a simple but powerful containing solution that can be deeply integrated with its other units.