Our goal for Operos 1.0

That a single IT generalist can deploy a complete system for running cloud native applications on premises in minutes and that this system requires only a small fraction of that person’s time in ongoing maintenance.

We believe that we are well on our way to achieving the first part of this goal with the early releases of Operos. Much of the roadmap is consequently focused on the second part: reducing the maintenance burden as much as possible.

The Road Map

Let’s begin down the roadmap generally in order of importance, and start with the items that have specific milestones attached.

Software Updates (slated for Operos 0.3)

Addressing our goal of requiring minimum intervention during operation means solving the software update problem. Solving the software update problem is the single most important item on this roadmap and has influenced many design decisions regarding Operos, for example:

  • Operos workers run entirely in memory, meaning that an upgrade is just a matter of rebooting the worker node.

  • Taking inspiration from containers, we use layers of SquashFS files to store the code and then overlay a writable filesystem to store persistent controller data.

Where the innovation comes with our automatic update process is in the degree of automation. Our goal is that the administrator of a Operos cluster will only need to schedule a system upgrade at a convenient time or on an as-needed basis from Waterfront. Once the upgrade is activated, the controller is patched and rebooted. When it resumes operation, it will move through the cluster, selecting which worker nodes to update and when. The key innovation here is that Operos will use metrics from Kubernetes and Ceph to pick what node to update - and when - to minimize the amount of turbulence that your applications will experience while the cluster is upgrading.

Authentication/Authorization (slated for Operos 0.4)

Most deployments will require some form of authentication and authorization. Kubernetes solves authorization through its role-based access control system, but leaves authentication up to the user. For 0.4, we plan to integrate an open-source identity provider that brokers to a variety of backends, such as LDAP, OpenID-Connect, and SAML. We also plan to implement granular delegation of access to Operos cluster management functions in Waterfront.

Metrics/logging forwarding and routing (slated for Operos 0.4)

The Operos controller collects system metrics and logs generated during the operation of the cluster, which are displayed in Waterfront. The Operos metric collection system is not meant, however, to replace a larger organizational business intelligence system, nor is it intended for long-term retention. What Operos will offer is the ability to configure its system metrics and logging pipeline to forward this data to a wide variety of systems.

Policy-driven egress/ingress and load balancing

The most significant as yet missing core component of Operos is policy driven ingress/egress for service communication outside the cluster. This is a general problem in the Kubernetes ecosystem, as even managed offerings in the public cloud require extra orchestration steps. Operos already reserves a portion of the cluster network which can be used by networking devices, such as a load balancer/firewall/router appliance, to route traffic into the the cluster. The downside to this approach is that the administrators must keep these devices in sync with the addresses of Kube ingress controllers, as new ones are created or old ones removed. As part of our roadmap for Operos, we will add ingress controllers for automating this for common networking appliances. We will also be adding a novel solution for those organizations that don’t have (or don’t want) an appliance. We will detail this in a subsequent blog post.

Waterfront improvements

On the UI front, we will be improving the dashboards. We describe the dashboards as “curated” on the Operos landing page. By this we mean that we are using our experience and standard methods - for example the USE method - to select the metrics which are necessary for troubleshooting. More effort will also go into the inventory of worker hardware as well. For example, you will be able to shut down, decommission, and put nodes into maintenance.

opsctl

We believe all great systems need great command line tools. As such, we will be creating a tool called ‘opsctl’ (for “Operos control”) that enables administrators to perform every action they can through the web UI from the command line. These tools will utilize an open API that will be documented, so you can also write your own custom tools. Whether you need to integrate with a complex ERP system, or your ERP system is Python scripts and spreadsheets, we want to make sure you are covered.

Advanced cluster network topologies

At the moment, the controller has a very simple network setup user facing (waterfront/Kube API) versus cluster data/control plane what we call the private interface in the installer. We plan to modify the installer to allow the administrator to split the cluster data/control plane between physical interfaces on the controller. Common reasons for traffic splitting would be to dedicate channels like the control plane (PXE boot, image delivery), Ceph data plane, and the inter-pod communication fabric, etc. based on bandwidth needs. We also, wish to allow those interfaces to include virtual interfaces as well (e.g., VLANs), potentially a controller could then only need one interface.

Dynamic Ceph tuning

Managing Ceph is a speciality unto itself. We intend to improve our current system, so hopefully you don’t have to become that specialist. Currently, Operos sets up Ceph for the initial cluster deployment of 1 controller and two nodes. You can add as many nodes as you want, but Operos will need to tune the minimum replica sizes and the number of placement groups to get the best use of these nodes. Additionally, we want to take advantage of layer two networking information such as LLDP to modify the Ceph crushmap to ensure fault domains are defined correctly in larger multi-rack clusters.

Full Disk Encryption

We plan on making full disk encryption the default on all worker nodes, and optional on the controllers. Why optional? Well, the Operos worker runtime is stateless - it will get the encryption keys from the controller. The controller does not have a place where it can store its own encryption keys securely, which means it must be unlocked by an administrator on a reboot. In some scenarios, concentrating on the physical security of the controller’s disks might be less of a burden than the outage period while the controller waits to be unlocked.

The beginning of the end of the beginning

So, that’s the roadmap for Operos 1.0 - subject to the caveat that - like any plan - it’s subject to revision at any time as we onboard more pilot users, receive feedback, and as field testing shifts priorities. If you are interested in contributing we are tracking the work in progress as github issues. Feel free to add feature requests, comment or contribute!

If you are interested in discussing the timeline or in piloting Operos in your organization please send us an email at info@paxautoma.com or join us on the Operos Slack

Thanks,

The Pax Automa Team.