Controller and worker nodes
In an Operos cluster, there are two types of machines - a controller and workers. The controller is the “brain” of the system. It runs etcd, the Kubernetes master components (API server, controller manager, and scheduler), the Ceph monitor, DNS, and the UI. Regular workloads don’t generally run on the controller.
The worker nodes comprise the resource pool in the cluster. Their CPU, memory, and storage are pooled together and made available to user workloads by Kubernetes.
Build and install process
Most general-purpose Linux distributions, such as RedHat or Ubuntu, are assembled during installation and then configured after they’re running. The installer for these distros installs individual packages to the target machine - like the kernel, GNU tools, and applications. After the OS is provisioned, it’s common to use a configuration management tool, like Puppet, Chef, Salt, or Ansible to set up additional packages and configuration to the machine to prepare it for its actual role in the cluster. This allows for maximum flexibility; however, it also means a more complex, lengthy, and error-prone provisioning and configuration process.
Operos works differently. Because it is a special-purpose OS, it comes pre- bundled with most of the software and configuration. This bundling happens when Operos is built, during the continuous integration phase. The artifacts of this phase are a set of layered SquashFS images that contain the complete filesystem of the target machine. Now the installer’s job is a lot simpler - it only has to partition the disk, copy the images to the machine, and set up the bootloader.
Under the hood, Operos uses Arch Linux. The build process is heavily based on the Archiso workflow.
Layered OS images
For Operos to function, three complete image sets are needed, one for each of: the controller, worker, and installer. There are many similarities between the images. For example, both the controller and the worker must have Kubelet and Calico installed, and all of the images need the base OS packages. To make this possible without exploding the size of the ISO, the build produces a set of SquashFS images that can be layered on top of each other using OverlayFS. At the very top of the overlay stack on each of the nodes is a copy-on-write layer that enables the root file system to be writable.
The following diagram shows all of the layers and how they are related to each other.
When a node boots, the early stage initialization scripts (in the initcpio image) mount each SquashFS image, and layer them on top of each other via OverlayFS. Then they switch the root filesystem to the newly created overlay stack and kick off systemd.
In the spirit of making things as simple as possible for the administrators, the process of provisioning a worker is designed to be fully automatic once the machine is powered on. No additional configuration management or provisioning tools are needed. The workers’ OS runs entirely in RAM - there is no installation step.
The workers boot across the network from the controller. To enable this, the controller runs a few services:
- dhcpd: issues IP addresses and boot instructions
- tftpd: serves the SYSLINUX binaries
- HTTP file server (nginx): serves the kernel, initcpio, and SquashFS images
When instructed to boot over the network, the worker obtains an IP and BOOTP instructions from the dhcpd server, downloads the SYSLINUX binaries from tftpd and starts it. SYSLINUX presents the user with the boot menu. Once the user proceeds with booting, SYSLINUX starts the kernel with the appropriate initcpio image. The initcpio scripts fetch the SquashFS images from nginx, then kick off the mount/overlay/boot process.
Operos needs a way to identify worker nodes in way that is be stable across reboots. Usually this is done by tracking the individual OS installation (e.g. systemd’s machine-id). Since Operos workers run entirely in RAM, this does not work.
Instead, Operos implements a special fingerprinting algorithm that is based on the identities of various hardware components in the machine. This algorithm produces a UUID for the machine that is stable even if some components change.
This ID is used as the hostname of the worker node. It is also used in identifying the machine to the Ceph subsystem.
Note: the upgrade system is still a work in progress and will be ready in the next release, v0.3.
Because Operos uses immutable images as its root filesystem, the upgrade process is conceptually simple and relatively fool proof.
On the controller, upgrading is a matter of replacing the OS images with new versions, updating the bootloader’s kernel arguments to point to the new images, and rebooting. On the first boot of the new version, any data in the overlay copy-on-write partition is migrated forward.
Rolling back to the previous version is similarly simple. Whenever it is upgraded, Operos retains the previous version of the SquashFS images, and leaves a boot option to use those images.
The workers pull the OS from the controller each time they boot, so a reboot automatically means that the OS is “upgraded”. Once the controller has been upgraded, each of the worker machines will be power cycled one by one. In the future, Operos will use a more powerful algorithm that takes into account the pods and services that are running on each machine. This algorithm will be tuned to minimize the disruption to the cluster.
Operos automatically takes advantage of all disk storage on worker nodes. All disks are split into two partitions:
- Local ephemeral storage. Docker images, container data, and local volumes are stored here.
- Distributed storage. Each of these partitions is bound to a Ceph OSD.