monochromatic

monochromatic blog: http://blog.z3bra.org
git clone git://z3bra.org/monochromatic
Log | Files | Refs

hand-crafted-containers.txt (11773B)


      1 # [Hand-crafted containers](#)
      2 ## — 18 March, 2016
      3 
      4 ### tl;dr
      5 
      6 	# CTNAME=blah
      7 	# mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib
      8 	# ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/
      9 	# cp /bin/echo /ns/$CTNAME/bin/
     10 	# ip netns add $CTNAME
     11 	# ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!'
     12 
     13 ### 0. Intro
     14 
     15 Containers are the latest trend, for a good reason: they leave room for new
     16 ideas in terms of security, flexibility, performance and much more.
     17 
     18 But what are containers? It is a group of processes isolated together from the
     19 host operating system. This isolation can happen in different places
     20 (namespaces), be it in the network, the filesystem, the process tree, or all of
     21 them (there are more, in fact. More on this later).
     22 
     23 We can differentiate three types of containers:
     24 
     25 + operating system containers
     26 + application containers
     27 + I LIED!
     28 
     29 If we think about it, an operating system is a process `/sbin/init` that will
     30 spawn other subprocesses. This way, an operating system is nothing more than
     31 an application (a complex one). In this regard, there is only a single type of
     32 containers.  
     33 We can now focus on what's really important, how do they work?
     34 
     35 ### 1. Namespaces
     36 
     37 That's a keyword, so let's ask our internet god what it means:
     38 
     39 > In computing, a namespace is a set of symbols that are used to organize
     40 > objects of various kinds, so that these objects may be referred to by name.
     41 >
     42 > -- sincerely, [wikipedia](https://en.wikipedia.org/wiki/Namespace)
     43 
     44 In other words, a namespace is a way to refer to one or more isolations applied
     45 to a process.  
     46 When a namespace is created for a process, all its children will be created
     47 within this namespace, and inherit the "limitations" of the parent.
     48 
     49 #### Mount
     50 The process will be able to mount and unmount filesystems without affecting
     51 the rest of the system. For example, if you unmount a partition within the
     52 namespace, all the processes within it will see it as unmounted, while it
     53 will remain mounted for all others processes on the host.
     54 
     55 #### UTS (Unix Time-Sharing)
     56 This will give the ability to change the host and domain name in the namespace
     57 without changing it on the host.
     58 
     59 #### IPC (Inter-Process Communication)
     60 This namespace concern shared memory, System V message queues and sempaphores.
     61 Processes in the namespace will be unable to communicate with the host's
     62 processes this way.
     63 
     64 #### Network
     65 Processes will have their own network stack. This includes the routing table,
     66 firewall rules, sockets, and so on.
     67 
     68 #### PID (Process IDentification)
     69 Processes' IDs will get a different mapping that they have on the host. They
     70 will get renumbered, starting from 1.
     71 
     72 #### User
     73 The namespaces will have their own set of user and group IDs.
     74 
     75 ### 2. Making containers
     76 
     77 Now that we know what containers are and how they work, it's time to make
     78 one!
     79 For the purpose of this article, we will try an build the simplest container
     80 capable of printing "Hello, world!".
     81 
     82 Here is the program:
     83 
     84 	$ more <<EOF> hello.c
     85 	#include <unistd.h>
     86 	int
     87 	main(int argc, char **argv)
     88 	{
     89 		write(1, "Hello, world!\n", 14);
     90 		return 0;
     91 	}
     92 	EOF
     93 	$ cc hello.c -o hello
     94 
     95 #### 2.0 `chroot(1)`
     96 This one is an old tool that will run a command or spawn an interactive
     97 shell after changing the root directory.
     98 It is used to isolate a process, or group of processes from the host's
     99 filesystem tree. This has long be used for security purposes
    100 (see [chroot jail](https://en.wikipedia.org/wiki/Chroot)), but escaping from
    101 chroot is rather easy for someone with root (UID 0) access.
    102 This is why `chroot` alone cannot be considered secure, but coupled with user
    103 namespace and privilege dropping, one can turn a chroot in a real jail.
    104 
    105 Back to the topic. Let's copy our `hello` binary into the chroot, and try to
    106 run it:
    107 
    108 	$ mkdir rootfs
    109 	$ cp ./hello ./rootfs/hello
    110 	# chroot ./rootfs ./hello
    111 	chroot: failed to run command "./hello": No such file or directory
    112 
    113 This is the worst error message you can get. Of course `./hello` exists!
    114 We just copied it. But what does this error mean then? Let's take a closer
    115 look at this binary:
    116 
    117 	$ file ./hello
    118 	./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-x86-64.so.2, for GNU/Linux 3.12.0, not stripped
    119 
    120 The output may differ slightly depending on your system, but the important
    121 part here is the following: 
    122 
    123 > dynamically linked, interpreter /lib/ld-linux-x86-64.so.2
    124 
    125 Dynamically linked binaries cannot be run on their own. Long story short,
    126 `/lib/ld-linux-x86-64.so.2` is a program that is implicitly called to run all
    127 the dynamic binaries on a linux system, it's called the
    128 [linker](https://en.wikipedia.org/wiki/Dynamic_linker). So in order to have a
    129 binary run in the chroot, you need to copy over the linker AND all the libraries
    130 your binary links to. To get a list of these libraries, use the `ldd` command:
    131 
    132 	$ ldd hello
    133 	linux-vdso.so.1 (0x00007ffd3e7dc000)
    134 	libc.so.6 => /lib/libc.so.6 (0x00007fdc1a482000)
    135 	/lib/ld-linux-x86-64.so.2 (0x00007fdc1a82a000)
    136 
    137 You can ignore the [`vdso`](http://man7.org/linux/man-pages/man7/vdso.7.html)
    138 line as it's handled by the C library.
    139 Our `hello` binary depends on two files: `/lib/ld-linux-x86-64.so.2`, the linker,
    140 and `/lib/libc.so.6`, the C library (containing system calls like `write(2)`).
    141 
    142 In order to run our `hello` program, we'll have to copy them over in place. After
    143 that, our program should run totally fine:
    144 
    145 	$ mkdir -p rootfs/lib
    146 	$ cp /lib/ld-linux-x86-64.so.2 /lib/libc.so.6 ./rootfs/lib
    147 	# chroot ./rootfs ./hello
    148 	Hello, world!
    149 
    150 TADAAAA!! That was easy right?
    151 Another option is to simply compile our program *statically*. It means that all the
    152 needed objects from libraries will be compiled into the program, removing the need
    153 for a linker and libc in the chroot:
    154 
    155 	$ mkdir rootfs
    156 	$ cc hello.c -o hello -static -s
    157 	$ cp hello ./rootfs
    158 	# chroot ./rootfs ./hello
    159 	Hello, world!
    160 
    161 Let's take a look at the size of this "container". For scale, the
    162 "[Smallest possible docker container](https://docs.docker.com/articles/baseimages/#creating-a-simple-base-image-using-scratch)"
    163 weighs 3.6Mib...
    164 
    165 	$ du -sh rootfs
    166 	720K    rootfs
    167 
    168 That's most likely the lightest container you've seen, right?
    169 
    170 #### 2.1 env
    171 To isolate our process from the host, we'll have to clean all the environment
    172 from all its variables, to make sure the container won't know anything about its
    173 host. We can do this with the `env` command:
    174 
    175 	$ export FOO="bar"
    176 	$ env -i /bin/sh
    177 	$ env # we are now in a subshell
    178 	PWD=/home/z3bra
    179 
    180 You can see that the subprocess doesn't have the `$FOO` variable in its
    181 environment, even though it has been exported earlier.
    182 You can set the environment by passing variables AFTER the `env -i` command,
    183 this is useful to set the `$container` variable which has been "standardized" as
    184 a way to tell processes they are running inside a container.
    185 
    186 We now have a way to isolate our `hello` process from the host's environment. 
    187 
    188 	# env -i container="handcraft" chroot ./rootfs ./hello
    189 
    190 #### 2.2 `unshare(1)`
    191 This tool is the one that will actually isolate containers. It has been created
    192 especially for this purpose, and will let you run a process unshared from
    193 different namespaces: mount, user, network, PID, IPC and UTS.  
    194 In the same order, each flag will separate your `command` from the given
    195 namespace. See `unshare(1)` for more information:
    196 
    197 	unshare -m -U -n -p -i -u <command>
    198 
    199 We can actually leave the `-n` flag untouched, as some tools provide a better
    200 approach to network isolation (see `ip-netns(1)`, described later in this post).
    201 
    202 Another point worth mentionning is that if you want to isolate the process from
    203 the PID namespace, you should consider using the options `--fork --mount-proc`,
    204 so that the process will see a "virtualized" `/proc` that will represent the
    205 namespace, and not the host. For example:
    206 
    207 	# unshare -p --fork --mount-proc ps -faux
    208 	USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    209 	root         1  0.0  0.0  13012  2276 pts/2    R+   23:57   0:00 ps -aux
    210 
    211 We just found a way to isolate our program a bit more:
    212 
    213 	# unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
    214 
    215 For the curious, you can check the `nsenter(1)` program, that will help you
    216 run a process within another process namespace.
    217 
    218 #### 2.3 `ip-netns(1)`
    219 
    220 The `ip(1)` command includes a `netns` subcommand to manage network namespaces.
    221 It is useful to give network access to a process while keeping it away from the
    222 host's network stack.
    223 
    224 You need to be familiar with the concept of
    225 [bridges](https://en.wikipedia.org/wiki/Bridging_\(networking\)), and 
    226 [virtual network interfaces](https://en.wikipedia.org/wiki/Virtual_network_interface)
    227 (veth) pairs here.  
    228 Virtual ethernet devices pairs acts like both ends of a tube: when a packet is
    229 written on one end, it is also written on the other. This simple concept will
    230 help us get an internet access *inside* the container, while using the network
    231 stack of the host.
    232 
    233 The process is easy: we will create a `veth` pair, move one end inside the
    234 container, and bridge the other side with a physical interface.  
    235 Let's assume your physical interface is named `eth0`. We will create a bridge
    236 `br0`, add `eth0` on this bridge, and request an IP for this interface:
    237 
    238 	# brctl addbr br0
    239 	# brctl addif br0 eth0
    240 	# dhcpcd br0
    241 
    242 Then, we create a network namespace, a veth pair and move one end if this
    243 pair inside the namespace (we will name it "handcraft"):
    244 
    245 	# ip netns add handcraft
    246 	# ip link add veth1 type veth peer name eth1
    247 	# ip link set eth1 netns handcraft
    248 
    249 Now that our namespace has an interface able to communicate with the outside
    250 world, we can bridge it together with `eth0` and request an IP:
    251 
    252 	# brctl addif br0 veth1
    253 	# ip link set veth1 up
    254 	# ip netns exec dhcpcd eth1
    255 
    256 We now have a namespace 100% isolated from the host, that can reach the
    257 outside world over ethernet!
    258 You can run any command inside this namespace, and they will use the network
    259 stack we just created. For example:
    260 
    261 	# ip netns exec handcraft curl -s z3bra.org/slj
    262 
    263 We can now run our `hello` program with its own network stack (even though
    264 it doesn't make any sense!):
    265 
    266 	# ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
    267 
    268 Don't feel ashamed by such a long-ass command, because that is what `lxc`,
    269 `docker`, and other container applications do behind your back!
    270 
    271 ### 3. Bonus: cgroups
    272 
    273 Control groups are a feature of the kernel used to limit the resources
    274 used by a process, or a group of processes. Cgroups can limit CPU 
    275 shares, RAM, network usage, disk I/O, ...
    276 
    277 I will not cover their usage here, as this article is already long, but
    278 They are totally worth mentionning as an improvement over our containers.
    279 
    280 ### 4. Congratz
    281 
    282 ... for reading this far.
    283 
    284 Containers are a truly awesome concept. They make great use of new
    285 technologies, and all the tools presented above allow the standard users
    286 to exploit them in many different ways.  
    287 Applications like LXC and docker both recreate a full operating system,
    288 even though they are used to run a single process (web server, database, ...).
    289 
    290 By knowing how this works under the hood, we will be able to use the
    291 container technology to isolate the application in a smarter way than
    292 shipping it along with a full operating system.
    293 
    294 For further reading, check out these links:
    295 
    296 * [http://doger.io](http://doger.io)
    297 * [http://git.r-36.net/ns-tools](http://git.r-36.net/ns-tools)
    298 * [https://github.com/arachsys/containers](https://github.com/arachsys/containers)
    299 * [https://github.com/p8952/bocker](https://github.com/p8952/bocker)
    300 
    301 Now get out there, and make some containers!