Matthias Winzeler

Founder

Building the Cloud Native Data Center – Part 3: Choosing Your On-Prem Kubernetes Stack

18 min read
Cover Image

Running Kubernetes on-prem starts with two choices: where your nodes run (VMs or metal) and how you get the platform (build or buy).

In the last article in this series, we learned why on-prem Kubernetes is difficult: HA control planes (LBs, etcd, quorum), ongoing day-2 ops (patching, cert rotation, upgrades) and cluster sprawl.

In this part, we’ll explore two important choices every on-prem Kubernetes endeavour starts with:

  1. Where to run it: on a virtualization platform or directly on bare metal
  2. How to get it: build it yourself or buy a platform

Running on Virtual Machines

Since most orgs have virtualization in place, they start here: Kubernetes clusters run as VMs on vSphere, OpenStack or Proxmox. This virtualization layer is often managed by a dedicated infrastructure team, which abstracts away hardware, networking, storage, capacity management and monitoring.

Kubernetes on Virtual Machines

Pros & Cons

Pros of VMs
  • Abstracts networking, storage, capacity, monitoring
  • Clear team boundaries (infra vs platform)
  • Mature operational model
  • Easy to split large hosts into smaller nodes
  • Often achieves higher hardware utilization by consolidating many VMs on shared hosts
  • Built-in scaling and self-service via VM APIs
Cons of VMs
  • Hypervisor/Virtualization is another control plane/orchestrator to operate
  • More complexity and failure surface
  • Licensing and operational overhead
  • Constrained for GPU/SmartNIC-heavy workloads
  • Slight performance overhead

Tackling the hard parts

The last article identified challenges which this approach helps to solve:

  • Load balancers & HA: Often available via LBaaS; easy to put kube-apiserver behind a VIP.
  • Day-2 operations: VM templates/snapshots simplify rollouts and blue-green control plane upgrades.
  • Bootstrap: Straightforward: spin up VMs from a template.
  • Cluster sprawl: Easier to spin up more clusters (but still your platform team’s burden).

When it fits

You already have IaaS and licenses, and/or you plan to buy a K8s platform that integrates with IaaS APIs. This is usually the fastest and lowest-risk path to start.

Running on Bare Metal

Running Kubernetes directly on servers removes the virtualization layer and makes Kubernetes itself the baseline infrastructure platform (meaning, you could for example run VMs in it, as well).

It simplifies the stack, but usually also shifts responsibility for hardware and lifecycle onto the Kubernetes team.

Kubernetes on Bare Metal

Be aware that Kubernetes on Bare Metal comes with its own interesting challenges which we wrote about in a dedicated blog post: Kubernetes on Bare Metal: The Four Hard Problems

Importantly, you still need three dedicated control plane nodes per cluster, which wastes capacity on large servers. This is where Hosted Control Planes become almost a necessity: they let you pack many control planes onto shared management nodes instead of dedicating whole servers to each one.

So you might end up with a setup like this:

Kubernetes on Bare Metal with HCPs

Choose Your Worker Server Profile Wisely

With virtualization, selecting the right hardware server size is easy: the infra team can bay a few beefy hosts and let teams slice them up as VMs of various sizes. On bare metal on the other hand, your Kubernetes nodes are the physical servers themselves, so you have to make sure the servers you order match the workload and cluster profiles.

When picking bare metal servers for your Kubernetes workers, keep these guidelines in mind:

  • Avoid large servers: As outlined as Problem 1 in our blog post about bare metal problems, you don’t want to run more than 500 pods per node for stability reasons. For example, if each pod averages 1GB of used RAM (which we found a good rule of thumb for enterprise environments), you won’t ever use more than 512GB RAM per server. Buying more is wasted money and electricity.
  • Avoid very small servers: Small servers (below 128GB RAM) often don’t make efficient use of rack space, power, and networking ports. Each physical node carries a fixed cost in terms of CPU, cabling, switch ports, and management overhead. To use your space and power budget effectively, we found that servers below 128GB rarely make sense.
  • Respect HA & maintenance requirements: Most customers require the cluster to be spread over at least two racks or sites for high-availability. So for each cluster, you will need at least two servers (one for each rack/site). If you have site- or rack-local persistent storage (a common architecture), you’ll need at least two servers per site, otherwise you can’t take one server down for maintenance while keeping PDBs happy. In practice, most customers require at least four physical servers per cluster.
  • Multiply per cluster: Once you’ve established the baseline for each cluster (e.g. four servers of 256GB RAM each), multiply that by the number of clusters you plan to run. This quickly adds up to a significant hardware footprint.

Given the constraints above, it’s better to scale with many smaller nodes rather than few large ones. We found that a fleet of 128GB-256GB RAM servers are a good sweet spot for many use cases.

Cluster-as-a-Service Becomes Hard

Since each new clusters now requires at least a few physical servers, the Cluster-as-a-Service model (where developers can self-serve new clusters on demand, or platform teams want to create dedicated clusters for certain projects/teams) becomes much more difficult. You can mitigate that to some degree by keeping spares (especially when using many smaller servers), but in general, if you expect many dedicated, on-demand clusters, using bare metal workers may not be the best fit.

A Note on Costs

Comparing the cost of virtualization vs bare metal for our customers, we found some interesting cost dynamics:

  • Since the base cost per server is high (especially the CPU is expensive), having many small servers (e.g. 256GB RAM) for a bare metal build quickly becomes more expensive than fewer, larger servers (e.g. 4TB RAM) you’d purchase when choosing to go virtualized.
  • If you expect to end up with a big fleet of smaller servers, also calculate the per-server cost on rack space, power & cooling and networking. We found that these “hidden” costs (especially expensive switch ports) can double the cost in many cases.
  • Virtualization not only allows you to consolidate many workloads on fewer, larger servers (= cheaper) but also enables better hardware utilization by packing many VMs on shared hosts. As an effect, you’ll need fewer physical servers overall - and this saving on hardware can outweigh the licensing of the virtualization layer (even the crazy Broadcom quotes).

We strongly recommend running the numbers for your own use case, hardware vendor and virtualization licensing costs.

Pros & Cons

Pros of bare metal
  • Simpler overall stack with fewer layers to manage
  • Lower licensing costs
  • Enables use cases that need direct hardware access (GPUs, DPUs, SR-IOV)
  • Establishes Kubernetes as your infrastructure platform
  • Supports KubeVirt to run VMs inside Kubernetes
Cons of bare metal
  • Requires knowledge in areas usually abstracted away by virtualization: hardware failures, firmware, RAID, NICs, PXE provisioning
  • Most likely requires Hosted Control Planes and their management cluster
  • Need to choose server sizes that fit your workload profile
  • Many small servers are usually more expensive than few large ones
  • Higher operational effort to keep machines consistent and healthy
  • Cluster-as-a-Service is difficult due to physical server requirements

Tackling the hard parts

Here’s how this approach relates to the challenges identified in the last article:

  • Control plane HA/LB/etcd: While the necessity for Hosted Control Planes gives you many advantages (orchestration, lifecycle management, LBs, …), it also introduces new complexities like a separate mangement cluster
  • Day-2 operations: Without VM snapshots or templates, rolling out patches and upgrades means touching real machines. A neat way to work around this are immutable, image-based workers. For more on this, you might want to read our article From Metal to Kubernetes Worker.
  • Bootstrap: You start from bare metal. Every server needs to be provisioned, imaged and enrolled before it can join the cluster.

When it fits

If you want to get rid of virtualization or have Kubernetes become your infrastructure rather than sit on top of it.

Virtual Workers on Bare Metal: Best of Both Worlds?

Some benefits of virtualization are extremely compelling: cost-wise, its consolidation advantages and feature-wise, its ability to dynamically create small nodes from larger servers, which is essential for “Cluster-as-a-Service” setups.

Thus, we see that the industry is moving towards virtualizing Kubernetes Workers inside Bare Metal Kubernetes: using KubeVirt or similar projects to run Kubernetes Worker VMs inside the outer bare metal Kubernetes.

Which leads to an architecture like this:

Kubernetes on Bare Metal with HCPs and Virtual Workers

Examples include:

In short, the industry is reintegrating virtualization as a native building block of the Kubernetes platform (rather than as a separate layer underneath it).

Build or Buy the Platform

Once you’ve decided where your nodes run, the question remains how you’ll get Kubernetes.

  • Build it yourself: You assemble the platform. You’ll pick OS, bootstrap tooling (kubeadm/Talos), wire CNI/CSI/Ingress, stand up GitOps/monitoring/backups, automate upgrades, and integrate with load balancers, PKI, IAM, and image pipelines.
  • Buy a platform: You adopt a pre-integrated distribution. You run the installer, connect to your IaaS or metal, and get opinionated defaults for networking, storage, upgrades, and consoles. You focus on landing zones, guardrails, and enablement; the vendor provides lifecycles and support.

Build it Yourself

You bootstrap clusters with kubeadm or Talos, wire all components yourself, and automate upgrades, observability, backups, and multi-cluster tooling.

Tackling the hard parts

Possible approaches for handling the challenges identified in our last article:

  • Control plane HA/LB/etcd: You design and automate it (LB VIPs, etcd backup/restore, quorum).
  • Day-2 ops: You build the upgrade train (OS, kubelet, etcd, control plane).
  • Cluster sprawl: You script multi-cluster with CAPI & GitOps; still your responsibility.

You can adopt advanced patterns like Hosted Control Planes, immutable workers or worker slicing with KubeVirt as needed if you are prepared to integrate and operate them yourself.

Pros & Cons

Pros
  • Maximum control over components and roadmaps
  • No vendor lock-in, easy to swap parts
  • Deep internal expertise and ownership
  • Flexible to unusual networking or security needs
  • Can adopt advanced patterns (Hosted Control Planes, immutable workers) on your own timeline
Cons
  • Several FTEs required to build the whole platform
  • Higher risk and longer time to readiness
  • You own HA/LB/etcd, upgrades, and incident response
  • Fleet and multi-cluster tooling to build and maintain
  • You must integrate patterns like Hosted CPs or immutable workers yourself

Buy a Platform

You install a supported distribution (e.g., OpenShift, Tanzu, Rancher, Canonical) and get opinionated defaults, supported lifecycles, and integrated multi-cluster tooling out of the box.

Tackling the hard parts

Typically, the platform will ship with solutions to the challenges identified in our last article:

  • Control plane HA & upgrades: Standardized and supported; fewer bespoke scripts
  • Day-2 ops: Vendors provide patch cadence, cert rotation, and health checks
  • Cluster sprawl: Built-in fleet/multi-cluster management reduces toil (not zero, but less)
  • Bootstrap: Still your job to provide infra (LB, storage, network), but installers simplify the process

Note: Many modern platforms include patterns like Hosted Control Planes, immutable worker images or worker slicing with virtualization out of the box.

Pros & Cons

Pros
  • Fastest path to production with opinionated defaults
  • Supported upgrades, cert rotation, and lifecycle
  • Integrated networking, storage, monitoring, and RBAC
  • Vendor certifications and ecosystem integrations
  • Many include Hosted Control Planes, immutable worker models, or virtual worker pools
Cons
  • Licensing and subscription costs
  • Less flexibility for low-level components
  • Platform boundaries may constrain advanced use cases
  • Risk of vendor dependence over time
  • If missing advanced patterns, you depend on vendor roadmap

Future Options to Watch

The on-prem Kubernetes space is evolving quickly. In just the past year, several new players and patterns have emerged aiming to make it feel more like the cloud:

  • Omni (by Sidero): brings Talos-based Kubernetes clusters to bare metal with a cloud-like control plane
  • Spectro Cloud Palette: full-stack Kubernetes lifecycle with multi-cluster governance
  • Canonical MicroCloud: a lightweight cluster platform using LXD and Juju
  • Kamaji and HyperShift: open-source projects pushing the “Hosted Control Planes” pattern forward

This is still a fast-moving market, with new approaches appearing and maturing quickly. It’s worth watching if you’re planning a longer-term Kubernetes strategy.

…and of course, us!

At meltcloud, we were surprised that none of the existing platforms fully solve all of these challenges. So we’re building it ourselves: a cloud-like Kubernetes platform for your own hardware.

We combine several patterns we’ve talked about:

  • Hosted Control Planes to avoid control plane waste on bare metal, hosted on an appliance to solve the bootstrap/management problem
  • Immutable Workers to remove day-2 pain on bare metal ( how it works)
  • Elastic Pools to slice up large bare metal nodes into virtual workers for multi-cluster scenarios ( docs)

The goal: make Kubernetes behave like GKE, AKS or EKS, but in your own data center. If that sounds interesting, here’s our Platform Overview.

Wrapping up

These two choices (where your nodes run and how you get Kubernetes) shape how your teams work with the platform every day. There’s no right answer - it depends on your people, skills and goals.

Continue reading

This article is part of a series. Next up: Part 4: Taming the network jungle: CNI choices, L2/L3 realities, firewalls, east-west vs north-south, and other fun topics.