It is helpful to remember that containers are just normal Unix processes with two special tricks.
Normal Unix Processes
Unix starts processes by performing a fork() system call to create a new child process. The child process still contains the same program as the parent process, so the parent processes program still has control over the child. It usually performs a number of operations within the context of the new child, preparing the environment for the new program, from within.
Then, after the environment is complete, the parent program within the child processes context replaces itself by calling execve(). This system call unloads the current program in a process and reuses the process to load a new program into it.
When the program in a process has finished, it terminates itself with an exit() system call. The parameter to exit() is the status code. The parent process waits for the end of the child program using a wait() system call. When the child exits, the parents wait() call returns, and delivers the status code of the child’s exit() invokation to the parent.
Special Trick: Isolation
Containers are Linux processes that are set up two special kinds of resource barriers:
- Linux Namespaces, which limit which kind of system objects a process can see. There are seven Linux Namespaces. Six of them are explained in a good series on Namespaces in LWN, and the Linux Namespaces Manpage explains all seven of them, including the new cgroup Namespace.Namespaces can provide a process with its own set of network resources, including a private ethernet interface with addresses, a local routing table instance and a local iptables rules instance. Namespaces can provide processes with its own view of the filesystem as well, limiting it to a certain subtree of the host filesystem and also limiting the effects of mount() system calls executed by the process.
- Control Groups (cgroups), which limit how much of a given resource that is visible to a process this process can consume. Control Groups have been subject to a major redesign in the most recent kernels, see this talk.
Both resource barriers set up processes in a way that they cannot reach beyond the established borders: Two processes in different namespaces that are correctly set up cannot see each other and can believe they are alone on a machine. At the same time they cannot deplete each other of host machine resources, because the control groups prevent that if correctly set up and sized, minimising crosstalk between the two processes.
The combination of Namespaces and Cgroups forming a resource barrier is called a Pod in Kubernetes. For all practical purposes, the Pod has the same resource-isolation effect that a VM has, but unlike a VM it does consume practically no resources.
Inside a Pod, zero or more processes may be running. Processes running inside a Pod are still normal Linux processes executing normal system calls (within the limits that namespaces and cgroups impose on them), so they will be just as fast as regular processes.
Also, Processes sharing a Pod can see each other and touch each other, just like regular processes running on a physical or virtual machine. This allows co-scheduling of processes inside a Pod – Kubernetes calls containers doing this “Composite Containers” and there are a number of deployment patterns and use-cases for these.
Special Trick: Images
If you want to execute something in a filesystem namespace in a Pod, you need to build a stub operating system inside the subtree that is visible inside the Pod.
By default, a filesystem namespace will limit the Pod and all processes inside the Pod to certain subdirectory, not unlike a chroot() environment. By default this subdirectory will be empty. In order to be able to run things with this as a root directory, the necessary minimum of files needs to be present.
This can be done by copying the necessary files into the environment, or by installing a stub operating system from packages into the environment, but both processes may end up copying a few hundred or even thousands of files into the target environment.
Filesystem images are files that contain the block structure of a device inside a file. Using loop mounts, the file can be interpreted by the Operating System as a device, creating a file tree. With this concept, an entire environment can be copies into place using a single file, and then mounting that.
Building these files can be pretty complicated, though, and these files can also become quite large – and if you have multiple of these environments, they will also duplicate quite a bit of content: Each container will require a copy of the C standard library and many other system files that are simply shared between almost all containers.
Container systems often address this problem by defining the contents of an image as an assembly of Layers. A layer is simply a tar file that is being downloaded and unpacked on top of other, earlier downloads so that the total resulting image is the union of all layers stacked on top of each other.
Because it can become necessary for a higher layer to effectively delete files or directories contained in lower layers, there needs to be a mechanism that transports this information.
Docker images use an awfully designed mechanism called “whiteout files” for that. Basically, if the tar contains a dotfile starting with the four letters “.wh.” (that’s for “whiteout”, Tippex), a file of the same name as the suffix of the whiteout file will be blocked out from a lower layer. So if you unpack /etc/.wh.passwd, this will block out /etc/passwd from a lower layer.
Images can be written to. Writes will always go into the topmost layer.
Putting it all together: Kubernetes
Original Docker is a tool that has been created with developer concerns in mind and zero regard for operations. Docker can run containers on a single host, and has no idea about networking at all – it uses the single IP address a host has, and exposes ports of processes on the inside using iptables rules that merrily map ports between the physical host and the view of the world inside containers.
Docker builds images rather quickly by un-tar-ing layers on top of each other, and because writes are always going to the topmost layer and that layer can be discarded, it is also easily possible to undo all writes after a run and get back the pristine image for a new run. This makes it ideal for running tests.
Docker in production is another matter – usually production is larger than a single host, and usually production workloads are being addressed by proper networking equipment which does not so much deal with port numbers as with IP addresses instead.
So production docker usually needs an IP address per host, some kind of routing and networking isolation layer for all of that, and it needs a thing that takes requests to run images and assigns a physical host to each image.
Images are handy – they package a piece of software with all dependencies, and allow us to start processes on any host in a multihost cluster without installing additional software.
So the following can becomes possible:
We have a cluster with a number of physical hosts, all of which are capable of retrieving images and launching containers. Multiple cluster masters exist, one of which has been elected leader and executes the actual scheduling function.
The cluster accepts scheduling requests, asking the cluster to pick up a certain image which has the following requirements on CPU, memory, disk and network I/O, and find a spot to run it.
The scheduler surveys the cluster, finds such a spot and then asks the cluster node to do just that: Set up a pod, download and assemble the image and then launch the process with the image inside that Pod.
The end result is a process that sees one single, private network interface and runs inside a set of Resource Barriers that form the Pod. The program code for the process as well as all libraries or other data is container inside the image.
By installing software including all dependencies inside the image we are not polluting our base operating system image. Once the execution of that workload finishes, we can “uninstall” everything simply by deleting the image (and its constituent layers).
We can create a cluster that represents a large amount of compute power in the form of CPU cores, memory, disk space and network I/O and have one single scheduler to run arbitrary applications that have been packaged in images on it. The cluster scheduler takes care of workload placement, and as long as we have sufficient capacity we can hand out compute power as needed without caring where exactly in our datacenter that capacity is available – if our datacenter is built in a way that can support this kind of use and if our application is built in a way that it can handle such an environment.