Matthias Winzeler

Founder

Kubernetes on Bare Metal: The Four Hard Problems

13 min read
Cover Image

Kubernetes doesn't need virtualization. On bare metal you cut out the hypervisor tax, talk directly to hardware, and run accelerators like GPUs or SmartNICs at full speed. That's why many teams are asking: why run Kubernetes through virtualization, when you could run it on the metal?

After years of layering Kubernetes on top of virtualization platforms, teams are starting to look at bare metal again. Let’s look at why this might be a good idea and four hard nuts which need to be cracked to do so.

Why Bare Metal Kubernetes?

The reasons are both technical and economic:

  • Eliminate the virtualization tax
  • Get direct access to GPUs and SmartNICs
  • Run both Containers and VMs within Kubernetes
  • Regain sovereignty over infrastructure decisions

Basically, bare metal runs Kubernetes without the extra layers in between. Let’s drill down to learn what that means:

No Virtualization Tax and Complexity

For years, most teams ran Kubernetes on top of hypervisors by default. But that adds its own overhead:

  • Operational overhead: Teams run two control planes: the hypervisor’s and Kubernetes’. Both need patching, upgrades, and monitoring.
  • Resource duplication: CPU, RAM, and storage are allocated to the hypervisor and then again to the workloads inside Kubernetes.
  • Vendor lock-in: Infrastructure lifecycles are dictated by hypervisor release schedules and licensing terms.

The recent Broadcom–VMware licensing changes were a wake-up call. Many organizations discovered that their Kubernetes strategy was indirectly controlled by a proprietary hypervisor vendor. Moving Kubernetes directly onto bare metal removes this dependency. Kubernetes becomes the only control plane you have to manage.

Virtual vs Bare Metal Node
Virtual vs Bare Metal Kubernetes Node

GPU-Accelerated Workloads for ML and AI

Bare metal gives containers direct access to NVIDIA GPUs without a hypervisor layer. This is critical for GenAI and ML training and inference, where even a small overhead translates to wasted GPU hours.

  • The NVIDIA device plugin exposes GPUs to pods via resource requests (resources.limits.nvidia.com/gpu).
  • MIG (Multi-Instance GPU) partitions an A100 or H100 into multiple isolated instances, each schedulable as a distinct GPU resource in Kubernetes.
  • The NVIDIA GPU Operator automates driver, runtime, and monitoring installation across the cluster.

If you want to run your own GPU platform with NVIDIA, you probably want to run your Kubernetes on the metal.

Latency-Sensitive Networking Workloads

Network-intensive applications such as 5G packet processing, real-time trading, or edge routing require predictable latency. Bare metal makes it possible to use specialized hardware directly:

  • SR-IOV exposes NIC queues directly to pods, avoiding the overhead of virtual bridges.
  • SmartNICs and DPUs can offload functions like encryption, packet filtering, or storage acceleration and be consumed by workloads as Kubernetes resources.

These approaches are only practical on bare metal, since virtualization introduces jitter and hides direct device access.

Cloud-Native Virtualization

If you are on bare metal, you can run more than just containers: With KubeVirt, you can run VMs alongside pods, managed by the same scheduler and storage stack.

The difference from a traditional hypervisor is that Kubernetes becomes the platform. Virtualization is just another workload it orchestrates.

Want to start moving off VMWare? Now is a good time to begin, as KubeVirt becomes more and more mature.

The Four Hard Problems

While bare metal is powerful, it also introduces new problems. Let’s dissect them:

1. Supersized Nodes Max Out Kubernetes Before Hardware

Modern servers pack dozens of cores and terabytes of RAM. They are designed to host hundreds of VMs under a hypervisor. Kubernetes, by contrast, tends to hit limits based on the number of pods per node long before the hardware runs out.

  • Kubernetes recommends ~110 pods per node.
  • Tuning kubelet (--max-pods) and switching to eBPF CNIs like Cilium to replace kube-proxy can raise this into the hundreds or thousands.
  • But scaling kubelet, systemd, and scheduling loops becomes the real bottleneck long before hardware is fully used.

Let’s take a look at some tests we recently did on dual-socket Skylake servers (64 cores, 1 TB RAM):

  • kubelet resource use grew almost linearly with pod count, but we observe exponential growth in cAdvisor response times, leading to unpredictable behaviour on the cluster. Graph
  • systemd bottlenecks surfaced under bulk pod creation, as its single-threaded cgroup manager hit timeouts.
      ...
      Error syncing pod, skipping" err="failed to ensure that the pod: 3d1387a1-5b79-468f-af46-b41cd6c4b1cf
      cgroups exist and are correctly applied: failed to create container for [kubepods besteffort pod3d1387a1-5b79-468f-af46-b41cd6c4b1cf] :
      unable to start unit \"kubepods-besteffort-pod3d1387a1_5b79_468f_af46_b41cd6c4b1cf.slice\" :
      Timeout waiting for systemd to create kubepods-besteffort-pod3d1387a1_5b79_468f_af46_b41cd6c4b1cf.slice
      ...
    
    Which leads to pile ups in systemd and an unstable system.
  • Scheduling latency rose sharply beyond ~1,200 pods: pod creation slowed even though deletion stayed constant. Scaling Impact
  • Drain storms: Other than that, the node was running 2,000 pods happily; but once we drained the node – descheduling 2,000 pods at once – the process completely blocked one node, rendering it unreactive. Further, those 2,000 pods then had to be redistributed all at once to the remaining nodes, putting significant load on them. As the saying goes: When you throw a big stone into a pond, you get a big splash.

Takeway: The hardware could, in theory, host thousands of pods, but Kubernetes’ control loops and systemd make such densities fragile. In practice, you probably don’t want to run more than 500 pods per node.

Possible Remedies

  • Use smaller servers: Multi-node servers like Dell C6620, HPE Apollo 2000, or Supermicro BigTwin could be an interesting option here, since they pack 4 independent servers into 2U.
  • Virtualize: Split large machines into multiple VMs, each acting as a smaller Kubernetes node (see Problem 4, where this could also be a remedy)

2: Control Planes Waste Bare Metal Resources

Kubernetes clusters traditionally dedicate three nodes to control plane components. This architecture ensures high availability, but on bare metal it wastes resources:

  • Light workload, heavy footprint Control plane processes typically consume only a few vCPUs and a few gigabytes of RAM. Dedicating entire bare metal servers to them wastes most of the hardware. Even small hardware configurations don’t get utilized above a small percentage.

    Control Plane on Bare Metal Utilization
    Exemplary Contol Plane Utilization on Bare Metal
  • Cost scales with cluster count Each new cluster needs its own control plane set. In multi-cluster environments, this multiplies idle hardware and wastes rack space.

  • Lifecycle operations Even if underutilized, control plane servers must be patched, monitored, and upgraded like production systems.

Bare metal clusters work fine, but dedicating full servers to control planes wastes a lot of hardware and effort.

Possible Remedies

Projects like Kamaji or Hypershift have come up with a smart approach: Hosted Control Planes, where the Kubernetes control planes themselves are hosted within Kubernetes. Instead of dedicating entire servers per node, control planes run as workloads inside an existing cluster; reducing waste and aligning lifecycles as just another set of pods.

However, this introduces another challenge, the classic bootstrap problem: In an empty data center, how to create the first Kubernetes cluster to host these control planes?

3: Managing Bare Metal Workers Is Tricky

Even with control planes offloaded to Hosted Control Planes, the worker nodes remain an operational challenge: Unlike in the cloud or with virtualization, where workers are disposable VMs, bare metal workers are physical servers that must be provisioned, updated, and kept in sync over time.

The challenges include:

  • Enrollment complexity: Each machine needs firmware prep, partitioning, OS install, kubelet bootstrap, and runtime configuration.
  • Configuration drift: If handled manually or with ad-hoc scripts, nodes fall out of sync. Some miss patches, others fail after reboot.
  • Two separate lifecycles: The OS (kernel, systemd, drivers) and Kubernetes (kubelet, runtime) must both be updated and aligned.
  • Vendor sprawl: Hardware-specific firmware and drivers introduce another layer of dependencies.

Cloud platforms solved this long ago with golden images and immutability.

But on bare metal, there is no native mechanism for image-based provisioning.

Possible Remedies

One way forward is to run bare metal workers like containers: immutable, stateless, and easy to replace.

At meltcloud, we build on Unified Kernel & System Images with ISO-based enrollment and atomic updates – a topic we’ll cover in an upcoming blog post – while other projects like Talos Linux or Bottlerocket follow similar principles.

4: Cluster-as-a-Service Is Hard With Physical Servers

While it remains desirable to run a few large, shared clusters, there are reasons to dedicate clusters to individual teams:

  • Operators needing higher privileges
  • Teams unable to align on Kubernetes version upgrades
  • Workloads requiring root access or custom configurations

In these cases, a cluster-as-a-service model becomes necessary, giving each team its own isolated Kubernetes environment.

On bare metal, however, this runs into the same challenges we had in the pre-virtualization era: static, physical servers that cannot be provisioned elastically. The smallest possible cluster (with more than one worker for high availability) already consumes significant hardware, and each additional cluster multiplies the cost.

Cluster as a Service with Physical Servers
Very, expensive, very static Clusters

The result: what feels lightweight in the cloud becomes heavyweight on bare metal, where dedicated clusters mean dedicated machines.

Possible Remedies

Two approaches could bring flexibility back to bare metal:

  • Virtual clusters → tools like vCluster and its vNode add virtual control planes and isolated worker capacity inside a shared cluster.
  • KubeVirt workers → run lightweight VMs as worker nodes to carve up resources dynamically, for example using our Elastic Pools (Preview) feature, which builds on KubeVirt and Karpenter to create virtual worker pools on bare metal.

On bare metal, it may make more sense to run a few large clusters and use virtual clusters or virtual nodes to mimic many clusters without the hardware overhead.

Where Do We Go From Here?

Running Kubernetes on bare metal simplifies your stack and gives you full control of your hardware, but it also exposes challenges that the cloud usually hides.

  • Supersized nodes overload kubelet, systemd, and scheduling long before the hardware is fully used.
  • Control planes consume more bare metal capacity than they need when deployed in the traditional way.
  • Worker lifecycle management gets complex without image-based, immutable approaches.
  • Cluster-as-a-service is harder to achieve when each cluster requires dedicated physical servers.

None of this rules out bare metal, but it does mean you can’t run it the same way you run cloud clusters.

How are you approaching these challenges? Are you already experimenting with Hosted Control Planes? Are you running workers as immutable images or even as KubeVirt VMs?

We’d love to hear how you’re running Kubernetes on bare metal.

In our next post, we’ll dive deep into how we turn bare metal servers into Kubernetes workers: using immutable images, stateless design, and atomic updates.