This article is an overview of how you can manage docker-like containers using systemd.
As you may know, the famous container manager docker is not the only container technology. It has been originally based on LXC, though it embed additional functionalities, but it mostly provides a high level API for container management. And actually it has been made possible by a bunch of quite recent kernel features such as namespaces, control groups or union filesystems.
As I was reading the excellent series of Lennart Poettering's articles about the use of systemd for administrators, I found out that systemd is actually able to handle its own containers.
It is not so surprising actually, since systemd is close to the kernel and so provides an abstraction over a lot of its features. But anyway, I did not know it was so easy-to-use!
I present in this article the basics of
systemd-nspawn, the enhanced
chroot that systemd brought to us to build containers.
debootstrap to pull a minimal Debian on our filesystem. This command is available on most linux distributions (well, I checked for
yaourt, using AUR).
debootstrap --arch=amd64 unstable debian-tree/
It downloads Debian unstable files in the local
NB: I use Debian as an example, but using
pacstrap, you can install respectively
arch as easily.
systemd-nspawn (eventually as root):
NB: You can use
-D instead of
It performs some kind of
chroot and open up a shell inside the hosted system. It not only isolates local filesystem from its host, but also abstracts its interfaces. It is much more powerfull than a simple
NB: By default
/bin/sh is called inside the container, but you can specify the command you want systemd to run into the container as an extra argument. In that case, arguments specified after the command are forwarded to it and not treated by
But we just mounted the Debian filesystem actually, and no more. If you call
ps, you will not see any process but the shell (even if you are root).
root@debian-tree:~# ps PID TTY TIME CMD 1 ? 00:00:00 bash 6 ? 00:00:00 ps
The Debian OS has not really started. Before we do so, think about setting a password for your root user, you will need it later:
root@debian-trash:~# passwd Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully
--boot option (or
-b). Alternatively, you can ask
systemd-nspawn to run
systemd-nspawn -D debian-tree/ --boot # or systemd-nspawn -D debian-tree/ /sbin/init
You should see boot messages, with lines beginning with
[ OK ]. Also, if you are hosting a distribution relying itself on systemd, e.g. a recent release of Debian, you would see interesting message at the early boot:
Spawning container debian-trash on /your/current/path/debian-trash. Press ^] three times within 1s to kill container. systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SECCOMP -APPARMOR) Detected virtualization 'systemd-nspawn'. Detected architecture 'x86-64'. [...]
You can notice that the hosted systemd detects that it has been ran inside a
systemd-nspawn container! You can use
systemd-detect-virt at any time to get this information.
Also note that you can escape the container (and so kill it) by pressing
^] (Control key plus
]) three times quickly, which can be useful, especially if you forgot to set any root password.
Docker provides a nice API to manage your containers. While
systemd is not as exhaustive, it also have an interface for it, though the
machinectl command (like "machine control").
$ machinectl MACHINE CONTAINER SERVICE debian-tree container nspawn 1 machines listed.
You can get more information about one particular container using
machinectl status <machine>:
$ machinectl status debian-tree debian-tree Since: mar. 2015-04-07 08:46:37 CEST; 55min ago Leader: 24553 (systemd) Service: nspawn; class container Root: /home/system/local/debian-tree Address: 192.168.1.108 fe80::2ae3:47ff:fe04:f134 OS: Debian GNU/Linux 8 (jessie) Unit: machine-debian\x2dtree.scope ├─24553 /lib/systemd/systemd └─system.slice ├─cron.service │ └─24618 /usr/sbin/cron -f ├─systemd-journald.service │ └─24571 /lib/systemd/systemd-journald ├─console-getty.service │ ├─14910 man machinectl │ ├─14921 pager -s │ ├─24625 /bin/login -- │ └─24656 -bash └─rsyslog.service └─24620 /usr/sbin/rsyslogd -n
NB: If you want to process
machinectl status output, please consider using
machinectl show instead. It has been designed with this goal in mind.
If the container's content supports it, you can use
machinectl poweroff or
machinectl login for example. You can still use
machinectl terminate to simply kill the whole container, whatever is running inside it.
It is well known that LXC containers are still not secure enough to be used as absolute jails. Actually, this is due to possible exploits of kernel weakness that are present because the introduction of containers technologies is recent. If you need full isolation, consider using solution such as [KVM].
And as you can see it using
systemd-cgtop (systemd's real time cgroup visualizer), some information about the host system leaks into the guest machine:
On the guest
Path Tasks %CPU Memory Input/s Output/s / - 144.6 2.3G 0B 94.5K /machine.slice - 1.5 30.9M - - /machine.sli...e-debian\x2dtree.scope 7 1.5 12.3M - - /system.slice - - 71.0M - -
On the host
Path Tasks %CPU Memory Input/s Output/s / - 144.6 2.3G 0B 94.5K /machine.slice - 1.5 30.9M - - /machine.sli...e-debian\x2dtree.scope 7 1.5 12.3M - - /system.slice - - 71.0M - - /system.slice/NetworkManager.service 2 - - - - /system.slice/accounts-daemon.service 1 - - - - /system.slice/avahi-daemon.service 2 - - - - /system.slice/bluetooth.service 1 - - - - /system.slice/colord.service 1 - - - - /system.slice/dbus.service 1 - - - - /system.slice/gdm.service 2 - - - - /system.slice/geoclue.service 1 - - - - /system.slice/httpd.service 7 - - - - /system.slice/ipython-notebook.service 1 - - - - /system.slice/itorch-notebook.service 2 - - - - /system.slice/polkit.service 1 - - - - /system.slice/postgresql.service 6 - - - - /system.slice/redis.service 1 - - - - /system.slice/rtkit-daemon.service 1 - - - - /system.slic...firstname.lastname@example.org 1 - - - -
For instance, you can access from the contained system to the resources used by the host! You can also see all other machine started, since they are by default on the same cgroup slice, namely
And another important point is that by default the network interface is not virtualized: the container accesses the same interface than the host (try some
ip link). But the good news is that we can easily manage to isolate it. And we can even add bridges between host and guest interfaces.
If you do not need any network connection, you can completely isolate the container from its host's network using the
systemd-nspwan -bD debian-tree/ --private-network
If you look at the available network interfaces, you will see only a loopback:
guest$ ip link 1: lo:
mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
And actually, this loopback itself is isolated from the host's loopback interface, so there is absolutely no communication between host and guess (wrt. the previously advertised security limitations, of course).
Anyway, it is common to need the container to be able to listen on some port. For instance, the container can host some web server, or socket served service. So we need to be able to expose some interface to the container.
The simplest solution is to provide the container with a whole network interface.
systemd-nspwan -bD debian-tree/ --network-interface=eth0
Now you can see
eth0 from within the container. But the main drawback is that this interface would not be accessible from the host: it is moved from one host namespace to guest namespace.
guest$ ip link 1: lo:
mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether **:**:**:**:**:** brd ff:ff:ff:ff:ff:ff
Note that the host loopback cannot be moved to the guest. Also not that you could add a second
--network-interface argument to use many network interfaces.
NB: As soon as you specify that you would share a network interface, it automatically hides the other one, as if
--private-network was specified.
You can then create bridged or vlan interfaces in the host and provide them to the container. Fortunately,
systemd-spawn comes with some options simplifying this process.
You can use
--network-veth to create a virtual Ethernet link between host and container.
systemd-nspwan -bD debian-tree/ --network-veth
You will see a new interface on the host machine, named
ve-debian-tree like "virtual ethernet to debian-tree container" and the container will be provided a
In order to communicate, you must then add an IP address to these new interfaces. For instance, you could use the reserved
10.0.0.1/8 IP block:
# Add address 10.0.0.1 to ve-debian-tree interface of host system host$ ip addr add 10.0.0.1/24 broadcast 10.0.0.255 dev ve-debian-tree host$ ip link set dev ve-debian-tree up
# Add address 10.0.0.2 to host0 interface of guest system guest$ ip addr add 10.0.0.2/24 broadcast 10.0.0.255 dev host0 guest$ ip link set dev host0 up
You can then check you setup using
ping 10.0.0.1 and
ping 10.0.0.2 on guest and host respectively.
Now that you can communicate between host and guest, you may want to forward some host external ports toward guest system and block the other ones using for instance
Although unlike docker containers, a
systemd-nspawn container is persistent (modifications are not forget at reboot), it can be made non-persistent using
--volatile=yes and anyway it can be useful to bind some host directory to some guest directory.
You can achieve this with
# Boot container and mount host's /home into it systemd-nspawn -bD debian-tree --bind=/home
Note that when both host and guest path are identical, you can omit the second one.
You can also mount it as read-only, using
--bind-ro. Alternatively, you can use
I will end this overview by speaking about images, although reading [the man page][man:systemd-nspawn] of
systemd-nspawn would show you a lot more settings.
I did not investigate a lot about images but there seems to be a complete image management available.
systemd-nspawn can take as argument a container images instead of e filesystem directory, using
-i). These images can be for instance simple tarballs, but also docker images, and machinectl can help you pull images with commands such as
So I have to look further how I can use this power, but it seems that systemd provides us with a simple but powerful containing solution that can be deeply integrated with its other units.
Article publié le 7 Avril 2015 par