monochromatic blog:
git clone git://
Log | Files | Refs

commit ed6b12aac58c951e9924bdd1325c2fc9ea68a0b3
parent 076c73eb2cf52b5b1fdac70165a64c1566c4b053
Author: z3bra <willyatmailoodotorg>
Date:   Thu, 24 Mar 2016 22:08:53 +0000

Finish & publish container blogpost

2016/03/hand-crafted-containers.txt | 248+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---- | 2+-
css/monochrome.css | 22++++------------------
index.txt | 1+
4 files changed, 242 insertions(+), 31 deletions(-)

diff --git a/2016/03/hand-crafted-containers.txt b/2016/03/hand-crafted-containers.txt @@ -1,7 +1,16 @@ -# [Hand-made containers](#) +# [Hand-crafted containers](#) ## &mdash; 18 March, 2016 -### 0. intro +### tl;dr + + # CTNAME=blah + # mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib + # ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/ + # cp /bin/echo /ns/$CTNAME/bin/ + # ip netns add $CTNAME + # ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!' + +### 0. Intro Containers are the latest trend, for a good reason: they leave room for new ideas in terms of security, flexibility, performance and much more. @@ -23,7 +32,7 @@ an application (a complex one). In this regard, there is only a single type of containers. We can now focus on what's really important, how do they work? -### 1. namespaces +### 1. Namespaces That's a keyword, so let's ask our internet god what it means: @@ -37,7 +46,7 @@ to a process. When a namespace is created for a process, all its children will be created within this namespace, and inherit the "limitations" of the parent. -#### mount +#### Mount The process will be able to mount and unmount filesystems without affecting the rest of the system. For example, if you unmount a partition within the namespace, all the processes within it will see it as unmounted, while it @@ -52,7 +61,7 @@ This namespace concern shared memory, System V message queues and sempaphores. Processes in the namespace will be unable to communicate with the host's processes this way. -#### network +#### Network Processes will have their own network stack. This includes the routing table, firewall rules, sockets, and so on. @@ -60,16 +69,231 @@ firewall rules, sockets, and so on. Processes' IDs will get a different mapping that they have on the host. They will get renumbered, starting from 1. -#### user +#### User The namespaces will have their own set of user and group IDs. -### 2. making containers +### 2. Making containers Now that we know what containers are and how they work, it's time to make -some! +one! +For the purpose of this article, we will try an build the simplest container +capable of printing "Hello, world!". + +Here is the program: + + $ more <<EOF> hello.c + #include <unistd.h> + int + main(int argc, char **argv) + { + write(1, "Hello, world!\n", 14); + return 0; + } + EOF + $ cc hello.c -o hello + +#### 2.0 `chroot(1)` +This one is an old tool that will run a command or spawn an interactive +shell after changing the root directory. +It is used to isolate a process, or group of processes from the host's +filesystem tree. This has long be used for security purposes +(see [chroot jail](, but escaping from +chroot is rather easy for someone with root (UID 0) access. +This is why `chroot` alone cannot be considered secure, but coupled with user +namespace and privilege dropping, one can turn a chroot in a real jail. + +Back to the topic. Let's copy our `hello` binary into the chroot, and try to +run it: + + $ mkdir rootfs + $ cp ./hello ./rootfs/hello + # chroot ./rootfs ./hello + chroot: failed to run command "./hello": No such file or directory + +This is the worst error message you can get. Of course `./hello` exists! +We just copied it. But what does this error mean then? Let's take a closer +look at this binary: + + $ file ./hello + ./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/, for GNU/Linux 3.12.0, not stripped + +The output may differ slightly depending on your system, but the important +part here is the following: + +> dynamically linked, interpreter /lib/ + +Dynamically linked binaries cannot be run on their own. Long story short, +`/lib/` is a program that is implicitely called to run all +the dynamic binaries on a linux system, it's called the +[linker]( So in order to have a +binary run in the chroot, you need to copy over the linker AND all the libraries +your binary links to. To get a list of these libraries, use the `ldd` command: + + $ ldd hello + (0x00007ffd3e7dc000) + => /lib/ (0x00007fdc1a482000) + /lib/ (0x00007fdc1a82a000) + +You can ignore the [`vdso`]( +line as it's handled by the C library. +Our `hello` binary depends on two files: `/lib/`, the linker, +and `/lib/`, the C library (containing system calls like `write(2)`). + +In order to run our `hello` program, we'll have to copy them over in place. After +that, our program should run totally fine: + + $ mkdir -p rootfs/lib + $ cp /lib/ /lib/ ./rootfs/lib + # chroot ./rootfs ./hello + Hello, world! + +TADAAAA!! That was easy right? +Another option is to simply compile our program *statically*. It means that all the +needed objects from libraries will be compiled into the program, removing the need +for a linker and libc in the chroot: + + $ mkdir rootfs + $ cc hello.c -o hello -static -s + $ cp hello ./rootfs + # chroot ./rootfs ./hello + Hello, world! + +Let's take a look at the size of this "container". For scale, the +"[Smallest possible docker container](" +weights 3.6Mib... + + $ du -sh rootfs + 720K rootfs + +That's most likely the lightest container you've seen, right? + +#### 2.1 env +To isolate our process from the host, we'll have to clean all the environment +from all its variables, to make sure the container won't know anything about its +host. We can do this with the `env` command: + + $ export FOO="bar" + $ env -i /bin/sh + $ env # we are now in a subshell + PWD=/home/z3bra + +You can see that the subprocess doesn't have the `$FOO` variable in its +environment, even though it has been exported earlier. +You can set the environment by passing variables AFTER the `env -i` command, +this is useful to set the `$container` variable which has been "standardized" as +a way to tell processes they are running inside a container. + +We now have a way to isolate our `hello` process from the host's environment. + + # env -i container="handcraft" chroot ./rootfs ./hello + +#### 2.2 `unshare(1)` +This tool is the one that will actually isolate containers. It has been created +especially for this purpose, and will let you run a process unshared from +different namespaces: mount, user, network, PID, IPC and UTS. +In the same order, each flag will separate your `command` from the given +namespace. See `unshare(1)` for more informations: + + unshare -m -U -n -p -i -u <command> + +We can actually leave the `-n` flag untouched, as some tools provide a better +approach to network isolation (see `ip-netns(1)`, described later in this post). + +Another point worth mentionning is that if you want to isolate the process from +the PID namespace, you should consider using the options `--fork --mount-proc`, +so that the process will see a "virtualized" `/proc` that will represent the +namespace, and not the host. For example: + + # unshare -p --fork --mount-proc ps -faux + USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND + root 1 0.0 0.0 13012 2276 pts/2 R+ 23:57 0:00 ps -aux + +We just found a way to isolate our program a bit more: + + # unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello + +For the curious, you can check the `nsenter(1)` program, that will help you +run a process within another process namespace. + +#### 2.3 `ip-netns(1)` + +The `ip(1)` command includes a `netns` subcommand to manage network namespaces. +It is useful to give network access to a process while keeping it away from the +host's network stack. + +You need to be familiar with the concept of +[bridges](\(networking\)), and +[virtual network interfaces]( +(veth) pairs here. +Virtual ethernet devices pairs acts like both ends of a tube: when a packet is +written on one end, it is also written on the other. This simple concept will +help us get an internet acces *inside* the container, while using the network +stack of the host. + +The process is easy: we will create a `veth` pair, move one end inside the +container, and bridge the other side with a physical interface. +Let's assume your physical interface is named `eth0`. We will create a bridge +`br0`, add `eth0` on this bridge, and request an IP for this interface: + + # brctl addbr br0 + # brctl addif br0 eth0 + # dhcpcd br0 + +Then, we create a network namespace, a veth pair and move one end if this +pair inside the namespace (we will name it "handcraft"): + + # ip netns add handcraft + # ip link add veth1 type veth peer name eth1 + # ip link set eth1 netns handcraft + +Now that our namespace has an interface able to communicate with the outside +world, we can bridge it together with `eth0` and request an IP: + + # brctl addif br0 veth1 + # ip link set veth1 up + # ip netns exec dhcpcd eth1 + +We now have a namespace 100% isolated from the host, that can reach the +outside world over ethernet! +You can run any command inside this namespace, and they will use the network +stack we just created. For example: + + # ip netns exec handcraft curl -s + +We can now run our `hello` program with its own network stack (even though +it doesn't make any sense!): + + # ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello + +Don't feel ashamed by such a long-ass command, because that is what `lxc`, +`docker`, and other container applications do behind your back! + +### 3. Bonus: cgroups + +Control groups are a feature of the kernel used to limit the resources +used by a process, or a group of processes. Cgroups can limit CPU +shares, RAM, network usage, disk I/O, ... + +I will not cover their usage here, as this article is already long, but +They are totally worth mentionning as an improvement over our containers. + +### 4. Congratz + +Containers are a truly awesome concept. They make great use of new +technologies, and all the tools presented above allow the standard users +to exploit them in many different ways. +Applications like LXC and docker both recreate a full operating system, +even though they are used to run a single process (web server, database, ...). + +By knowing how this works under the hood, we will be able to use the +container technology to isolate the application in a smarter way than +shipping it along with a full operating system. + +For further reading, check out these links: -2.0 chroot -2.1 unshare / nsenter -2.2 ip-netns +* []( +* []( +* []( +* []( -3. cgroups +Now get out there, and make some containers! diff --git a/ b/ @@ -1,4 +1,4 @@ -MD = ./markdown +MD = markdown NAME = monochromatic PREFIX = /var/www/ diff --git a/css/monochrome.css b/css/monochrome.css @@ -85,27 +85,13 @@ header h1 a:hover { /* }}} */ /* Coding style (<code>) {{{ */ -code, pre { - color: inherit; +pre { + color: #eee; font-family: monospace; font-size: 90%; - padding: 2px; - background-color: #eee; - border: 1px solid #bbb; + background-color: #333; + border: 1px solid #eee; border-radius: 4px; -} - -/* - * code:before, code:after { - * content: "`"; - * } - */ - -pre code:before, pre code:after { - content: none; -} - -pre { padding: 10px; overflow-x: auto; overflow-y: hidden; diff --git a/index.txt b/index.txt @@ -1,3 +1,4 @@ +* 0x001b - [Hand-crafted containers](/2016/03/hand-crafted-containers.html) * 0x001a - [Make your own distro](/2016/01/make-your-own-distro.html) * 0x0019 - [Install Alpine at](/2015/08/install-alpine-at-onlinenet.html) * 0x0018 - [cross-compiling with PCC and musl](/2015/08/cross-compiling-with-pcc-and-musl.html)