VPA for Workloads with Heterogeneous Resource Requirements

My first Kubernetes Enhancement Proposal (KEP) is officially underway!

Background

The key way that the Google Cloud Managed Service for Prometheus scales is by being deployed as a DaemonSet. This allows any workload in the cluster that exposes metrics to have a Prometheus instance local to the node that will scrape metrics. This makes scraping pretty reliable is an architectural decision that my team has been happy with.

We deploy our managed Prometheus to a wide variety of environments with different requirements, so setting resource requests and limits is a confounding issue. Some of our customers are just getting started and the Prometheus pods sit mostly idle, with few metrics to scrape. Others consider metrics an essential part of observing very large-scale applications, and push those little Prometheus collector pods to their limits. Because my team is responsible for setting up the DaemonSet in question, we have to decide the resource requests and limits to specify in the manifest, and they apply to all of our customers.

When the actual requirements of the customer’s workloads are different from our generic solution, this can lead to issues. A few examples:

If the memory requests are set too low, workloads can be incorrectly scheduled to overload the actual capacity of the node and trigger an Out Of Memory (OOM) error.
If the CPU request or limit is too low, it can lead to throttling of our workload or customer workloads that don’t get the CPU capacity they need.
If the CPU or memory requests are set too high, the scheduler will displace workloads that otherwise could be scheduled on the node and lead to wasted capacity.
If memory limits are set too low, our Prometheus instances will hit the limit and crash with an OOM error.

We have explored several options to attempt to meet the needs of our customers. But most of them ultimately amount to us guessing or making the customer guess at the level of resources to allocate. That is, except for our new approach of using Vertical Pod Autoscaling.

Vertical Pod Autoscaling (VPA)

With VPA, the DaemonSet is observed and then the resource allocations for these pods are adjusted periodically. For many customers, this is an effective solution, even with some known limitations.

One limitation that is not currently addressed is workloads that have heterogeneous resource requirements under the same controller. In our case, when one node hosts a disproportionate share of the metrics being collected, that Prometheus instance will require more resources than its “peer” pods on other nodes. The current behavior of VPA is that all pods under the same controller get the same resources allocated.

Some applications, like Kube State Metrics, are known to expose many metrics, which in turn generates load for the collector pod local to the Kube State Metrics instance. In large clusters, it’s easy to reach tens of thousands of time series from KSM alone. This is more than enough to unbalance the collection load for the Prometheus instances.

The specifics of the differences in load will vary, but we observe Prometheus instances that consume about 32MB of memory on the lower end and several gigabytes of memory on the higher end. This is a big enough range to wreak havoc on the scheduler, for the same reasons described above that led us to VPA in the first place.

Improving VPA

I propose that instead of uniformly applying the same resource requests and limits to all the pods, that each pod be considered individually. With this enhancement, we could still use VPA to help us address the disparate resource requirements of our customers without making the flawed assumption that metrics load will be evenly distributed across the cluster.

As of this writing, I am just getting started and many design decisions remain to be made, but I am excited to begin this project! If you think you might find this feature useful, please share your experience with me so I can get a more broad perspective on how I can meet the needs of other Kubernetes users.

The Proposal

Check out the progress of the proposal on GitHub.

The Talk

Catch me at KubeCon Europe 2025!