Updated as of 2021-02-20

As the 2nd anniversary of the Dotfile “endeavor” has recently passed, I am beginning to wrangle with a fact that has been true since I first launched Dotfile ServerStack: I’m doing it wrong.

The current platform which almost all Dotfile services run from is a clever (if hacked together) Docker Swarm-based quasi-cluster. This provides an extremely easy method of deployment and management and was a viable option at the time given my relatively limited resources and experience. While Kubernetes at that point occupied a vast majority of the server orchestration space, it did not have the monolithic presence it has today. Furthermore, it was kneecapped by the fact it was unable to be easily deployed between a scale larger than “3 raspberry pis” but smaller than “supermassive data center”. Thus Docker Swarm was chosen instead. As time went on and K8s got bigger and better, less and less high-quality support from apps could be found for Docker Swarm so a lot of Dotfile services were not as good or available as I would have liked them to be.

However with 2021 I am a smarter person, or at least in a position to seriously invest in my server lab. After a, soon to be, series of arguably poor financial decisions I have purchased 2 ‘new’ servers. Over the next while (and next few posts) I will architect, trial, and ultimately upgrade Dotfile ServerStack into what I’m calling [email protected].

Outline and Goals

My end goal for this project in short is:

  1. Use industry-standard Kubernetes to provide high availability on-prem cloud orchestration which can easily be scaled and managed from a single location (be it CLI tools, a web panel, or a single SSH node).
  2. Use Virtual Machines with host hardware passthrough to provide high levels of cluster availability with minimal overhead.
  3. Architect the cluster so that replicas of as many components as possible are made to the point that with enough resources the failure of any single component has minimal to no impact on the system as a whole.
  4. Build with support in mind a potential hybrid cloud solutions through providers like AWS or DigitalOcean.
  5. Harden the cluster and secure user data more than before.
    • Do not for any reason allow unencrypted user data to touch external clouds.

After a lot of thought I have chosen to use virtual machines over baremetal to run nodes. While I will certainly take a performance hit from overhead, the ease of setting up new nodes and having easy redundant control planes is worth it in my view. Plus hardware passthrough will significantly minimize said overhead. This also makes it easy to setup images for external cloud providers. If the machines are distributed well, this feeds into goal #3 by significantly weakening the problems that could occur if a single server were to fail.

Underlying Architecture

While the concept of a K8s cluster is very exciting and cool, first I have to address what the cluster will run on. To do this I have split it into multiple subheadings grouped by largest-scale to smallest.

Components

Federation

The system on the whole will be designed to work as a federated group. Each member (or ‘site’) potentially being located anywhere in the world on either bare-metal or external cloud. I will focus on on-prem sites within this post. DigitalOcean or AWS will be discussed in a separate post.

The federation will be managed via kubefed with one cluster per site. This means that the failure points within the system on this level are limited to:

  1. Internet-side DNS failure.
    • This is highly unlikely and even if it did happen it would be outside of the scope of what I can control.
  2. All (or enough) sites fail simultaneously
    • Ok, this is a bit more likely and a bit more complex than that. Depending on the medium used for each actual application running, the method of distributing storage can vary greatly. While a lot of this will be discussed when talking about the clusters and nodes themselves, its important to note that different services may have different breaking points. With enough sites a cluster-wide failure should be able to be minimized and everything is probably fine so long as a majority of sites don’t fail.
    • Sadly, I only have one site going into this so if that fails well… there’s not much that can be done. This is why I put in the “with enough resources” caveat in my goals; there’s never a question of if something will fail, it’s a question of when. As my house only has one internet connection, if the telephone poll in my front yard gets knocked over I can’t really recover and I don’t have the money for “in case of emergency” digital ocean clusters just yet.

Otherwise, the DNS will be configured to round robin requests to all clusters that are online

The Watchers Crown

The federation is maintained by a single master cluster, which I will lovingly dub the Panopticon.

The Panopticon handles the discovery of new clusters and the deployment and propagation of new services across the federation. Now the upside to this is that the only issue that will occur if the the Panopticon fails is that no new clusters can be discovered and no changes to the services made.

The other clusters will still be able to talk among themselves and bring the Panopticon up to speed when its restored, and users will still be able to access services from the internet via the other clusters.

The Sites

The sites in this case as stated above are groups of physical servers and routers. The servers all run a Xen hypervisor. Again, this is done so that if an entire server or servers fail, the remaining ones can pick up the slack. Each site is designed to be able to allow for the machines to load balance as little as a single external IP but could use more if available.

The single point of failure here is the reason a federation has to be formed: if the router fails then that’s it. The entire site would be considered to have failed; even if the machines can process connections no incoming connections can be made in the first place. There is really no work around that isn’t something probably gruesomely expensive from the ISP, a concept which I loathe to even consider.

The Machines

The servers host a bunch of virtual machines that all access a site-wide router. The machines are given resources appropriate to their role in the ecosystem: either being a part of the proper cluster or routing traffic of some kind.

Machines that will run worker cluster nodes are generally to be given the majority of resources compared to the ECAMs (discussed below) or cluster control planes.

The Software

Machines that are not a part of the cluster are instead what I’ve dubbed an ECAM, short for “External Cluster Administrative Machine”. ECAMs act as an external L7 reverse proxy for the cluster’s redundant control plane endpoints. Data sent from the workers to the control planes will go through an ECAM via a virtual IP (VIP), which will round robin the data off to the available control planes.

The machines themselves use Alpine Linux running HAProxy as the reverse proxy and Keepalived to obtain a VIP on the network. This prevents potential failures by providing redundancy as if the currently ‘active’ ECAM fails, another ‘passive’ one can take its place and control of the VIP. There will be one ECAM per server to keep the system as resilient as possible. I however keep them separate from the control plane cluster nodes to prevent potential weirdness at the cost of a bit more overhead.

Now then, I can finally talk about how I plan to design a Kubernetes cluster.

Cluster Architecture

I’m not going to go into depth on how exactly Kubernetes itself works, there’s mountains of smarter people to explain that. Instead I will discuss the technologies, backends, and plugins I decided to use to get this thing to work.

The cluster nodes, specifically the control plane nodes, will run a trimmed down version of Debian Buster; along with providing redundant control planes they will also provide coupled etc.d storage. I coupled the two for simplicity’s sake as it will reduce the need for even more cluster-external architecture. I will use kubeadm to initialize the cluster as it appears to be the ‘official’ solution and follows the UNIX philosophy of doing 1 job and doing it well. After that I can hand off the extraneous maintenance of the cluster (i.e. things Kubernetes doesn’t do itself) to other plugins.

Networking

A majority of this post has been dedicated to networking but there’s more that has to be discussed at the cluster-level.

Internally, the cluster will use Calico as a CNI due to its potential when functioning at scale and easy compatibility with load balancers.

Externally, connections will be handled by Traefik and MetalLB. Traefik will act as a L7 reverse proxy and ingress controller while MetalLB will act as a L3 load balancer (not something otherwise easily doable for on-prem clusters) and allows the creation of services of type LoadBalancer; which is very useful for certain services that cannot be as easily handled by Traefik.

Storage

The other key aspect of the cluster is storage. The cluster will use Longhorn by RancherOS for persistent volume storage. Longhorn has been in the works for a while and now that it has GA’d I feel justified in relying on it. For databases, usage of CockroachDB where available will be preferred to allow for easy distributed storage and PostgreSQL compatibility.

Finally, in the name of sanity I will be much more rigorous about backups, alongside the etcd, which will be backed up via rsync or similar to a cluster-external discrete storage and kept offline. Volume snapshots will also be taken regularly via Velero for preservation of pod data.

Services

The function of this project is to create a platform which allows for an easy, maintainable, and unified method for deploying new services. As such, I feel listing out all the services I could potentially want to use this for is well outside the scope of this particular series. Each service will get its own write-up when I deploy it. However, I do want to note a couple exceptions which I want to discuss here too.

SSO

One of the other main issues with ServerStack is a lack of a coherent sign-in method. [email protected], [email protected], [email protected], [email protected], etc all have different local sign in methods which makes usage really a chore.

I plan to rectify this with [email protected], which will provide single sign-on for all other services. More on this when I have a proper cluster.

Dashboard

All dashboards for each of the different services will be unified under “[email protected]” and configured so that each controller gets its own prefix (so traefik would get control.dotfile.sh/traefik/[whatever])

TODO: How does each control panel handle the federation?? Can these things be unified cross-cluster? Do I have to put a DNS rule put in place to isolate individual sites?

For example, consider:

control.site`[0-9]+`.dotfile.sh

or something to that effect? Wherein each site would get its own subdomain.

  • This could also be implemented federation-wise as it could work to help isolate service errors in a particular cluster.
  • It would need to be configured at the traefik level so if site n got a request for subdomain.site[n].domain.tld, it would send it to the same place it would as subdomain.domain.tld.
    • The reason it would have to be implented at a traefik-level is that otherwise there would be differences in a service config for each cluster. Every service that utilized this would also have to have an entirely different traefik rule regarding a siten subdomain boot.

Next Steps

With this rough outline done, my next step is to implement it on a small scale to prove that this model is theoretically functional. As time goes on and I learn more this design document will be revised to reflect the current architectural plans. See the datestamp at the top to learn when the last revision was.

EOF