Cutting Kubernetes Cloud Costs 35% With These Tweaks
Most Kubernetes cost-cutting guides tell you to right-size your pods and use spot instances. You’ve heard this. You’ve probably done it. And you’re still overpaying.
The 35% savings I’m about to describe didn’t come from the obvious levers. They came from one specific, under-discussed area: the gap between what your cluster requests and what it actually uses during real production workloads. This gap is where cloud providers make their margin on Kubernetes customers, and it’s where you can claw that money back.
The Request-Usage Gap Is Your Biggest Leak
Here’s the uncomfortable truth about Kubernetes resource management: developers set CPU and memory requests based on worst-case scenarios, load tests run once, or pure guesswork. These requests become your billing floor.
In a recent engagement, we analyzed a 47-node production cluster running a mix of microservices. The headline numbers:
- Average CPU utilization: 23%
- Average memory utilization: 41%
- Requested vs. used ratio: 3.2x for CPU, 1.8x for memory
That 3.2x CPU ratio means they were paying for three times the compute they actually consumed. Not during traffic spikes — as a baseline, around the clock.
The standard advice here is “use the Vertical Pod Autoscaler.” This is correct but incomplete. VPA’s recommendation mode is genuinely useful for understanding where you’re over-provisioned. But its auto-update mode is rarely production-safe — it evicts pods to resize them, which can cause cascading failures in tightly-coupled services. More importantly, VPA treats each deployment in isolation when your cost problem is actually a cluster-wide bin-packing issue.
The real optimization happens at the intersection of pod sizing and node selection, which brings us to the tweak that moved the needle most.
Cluster Autoscaler Tuning Nobody Talks About
The Cluster Autoscaler has a parameter called --scale-down-utilization-threshold. The default is 0.5 — meaning a node is considered underutilized and eligible for removal when its requested resources fall below 50% of capacity.
This default is conservative for good reason: removing nodes is risky. But it’s also why your cluster probably runs 20-40% more nodes than necessary.
In the 47-node cluster I mentioned, we found:
- 11 nodes consistently at 30-45% utilization
- These nodes hosted pods that could fit on other nodes if the autoscaler were more aggressive
- The 0.5 threshold meant these nodes were “safe” and never considered for removal
We dropped the threshold to 0.35 and added a longer --scale-down-delay-after-add (15 minutes instead of the default 10) to prevent thrashing. The result: the cluster stabilized at 34 nodes during normal operation. That’s a 28% reduction in node count from a single parameter change.
The second autoscaler tweak that matters is --expander. The default random expander picks arbitrarily between node groups when scaling up. If you’re using multiple instance types — which you should be for cost optimization — this randomness defeats the purpose. Switch to least-waste or priority, and configure your node groups so cheaper instance types are preferred when they can accommodate the pending pods.
This combination — aggressive scale-down thresholds plus intelligent node selection — accounted for roughly 20% of the total savings.
The Pod Disruption Budget Trap
Here’s where it gets counterintuitive. Tight Pod Disruption Budgets protect availability, but they also protect your cloud bill from optimization.
When PDBs prevent pods from being evicted, they anchor workloads to specific nodes. This blocks both the cluster autoscaler and manual defragmentation efforts. The autoscaler will mark a node as removable, attempt to drain it, hit the PDB wall, and give up. The node stays running.
We found 8 deployments with maxUnavailable: 0 PDBs that were effectively pinning nodes in place. These weren’t critical services — they were batch processors and internal tools that someone had copy-pasted production PDB settings onto years ago.
The fix wasn’t removing PDBs entirely. It was auditing them against actual availability requirements:
- Services with sub-second latency SLAs: keep tight PDBs
- Async workers with retry logic: relax to
maxUnavailable: 1or higher - Development and staging workloads: remove PDBs entirely
This alone freed 4 additional nodes for removal during off-peak hours.
Resource Quotas as a Cost Lever
Namespace-level resource quotas are typically framed as governance tools — preventing one team from starving others. But they’re also your most underused cost control mechanism.
Without quotas, developers default to over-requesting resources because there’s no feedback loop. The request “costs” them nothing within the organization, even though it costs the organization plenty on the cloud bill.
The approach that worked: implement quotas that start at 110% of current usage, then ratchet down monthly. This creates incremental pressure without breaking deployments.
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: team-payments
spec:
hard:
requests.cpu: "40"
requests.memory: "80Gi"
limits.cpu: "60"
limits.memory: "120Gi"
When a namespace approaches its quota, developers face a choice: optimize their resource requests or justify the overage. In practice, most teams find 20-30% headroom in their requests when they’re forced to look.
The key is visibility. Deploy kube-resource-report or similar tooling to show teams their request efficiency alongside their quota consumption. When developers can see they’re requesting 4 CPU cores but averaging 0.8 cores of actual usage, the conversation shifts from “we need more quota” to “let’s fix these requests.”
The 35% In Aggregate
Here’s how the savings broke down:
- Cluster Autoscaler threshold tuning: ~20%
- PDB audit and node liberation: ~8%
- Request right-sizing driven by quota pressure: ~7%
The remaining percentage points came from accumulated small wins: removing orphaned PVCs, consolidating redundant monitoring exporters, and deleting test deployments that had been running in production namespaces for 18 months.
None of these changes required architectural rework, application rewrites, or exotic tooling. They required someone to look at the gap between Kubernetes’ view of the cluster and the cloud provider’s bill, then systematically close that gap.

What This Means for Your Cluster
The specific numbers won’t transfer directly to your environment. A cluster running mostly memory-bound workloads will see different ratios than one dominated by CPU-intensive services. Stateful workloads with strict anti-affinity rules limit how aggressively you can bin-pack.
But the diagnostic approach transfers completely: measure the gap between requested and used resources, identify what’s preventing the autoscaler from eliminating slack, and create organizational mechanisms that push resource requests toward reality.
If you’re managing significant Kubernetes spend and haven’t audited these specific parameters recently, there’s almost certainly savings waiting. The infrastructure isn’t broken — it’s just configured for a level of caution that’s costing you money.
We help teams find and close these gaps. If your cloud bill is growing faster than your traffic, we should talk.