Home / DevOps / Beginner’s Guide to Kubernetes Horizontal Pod Autoscaling (HPA)
Beginner’s Guide to Kubernetes Horizontal Pod Autoscaling (HPA)
Kubernetes Pod Autoscaling HPA and VPA Complete Guide

Table of Contents

Kubernetes Pod Autoscaling: HPA and VPA Complete Guide

Modern cloud-native applications face a fundamental challenge: demand is unpredictable. One moment your cluster is idle; the next, a product launch or viral campaign sends traffic through the roof. Without an intelligent scaling strategy, you are either over-provisioning resources and burning budget, or under-provisioning and delivering a broken experience to users.

Kubernetes solves this with two powerful pod autoscaling mechanisms — Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA). While they address different dimensions of scaling, they are most effective when understood together. This guide covers how each works, how to configure and test them, when to use one over the other, and how they complement each other in a production environment.

If you are also managing node-level capacity, be sure to read our guide on Node and Cluster Autoscaling alongside this one.

What Is Kubernetes Pod Autoscaling?

Kubernetes pod autoscaling is the process of automatically adjusting the resources allocated to your running workloads based on real-time demand — without manual intervention. It operates at the pod level and is available in two forms:

Horizontal Pod Autoscaling (HPA) adds or removes pod replicas in response to demand. When traffic spikes, more pods are created to share the load. When it drops, excess pods are removed.

Vertical Pod Autoscaling (VPA) increases or decreases the CPU and memory resources allocated to each individual pod. Rather than adding more pods, it makes existing pods stronger.

Both tools serve the same ultimate goal — efficient Kubernetes resource management — but they achieve it in fundamentally different ways. Understanding when to use each is the key to a well-tuned cluster.

Horizontal Pod Autoscaling (HPA): Scaling Out

How HPA Works

HPA automatically adjusts the number of pods in a Deployment or StatefulSet based on real-time resource metrics. When CPU utilization or memory consumption exceeds a defined threshold, Kubernetes adds more pods to distribute the load. When demand drops, it scales back down — all without manual intervention.

Think of it as a self-adjusting load balancer: instead of over-provisioning infrastructure to handle worst-case traffic, HPA lets your cluster breathe naturally with demand.

Why HPA Matters

Without autoscaling, you face a hard choice: provision for peak load and waste money on idle resources during quiet periods, or provision lean and risk degraded performance during spikes. HPA eliminates that tradeoff entirely.

The core benefits are:

Efficient resource utilization — the cluster uses exactly what it needs at any given moment. No idle pods, no unnecessary spend.

Improved application performance — traffic spikes are absorbed by adding pods, keeping response times stable for end users.

Cost savings at scale — dynamic provisioning replaces the guesswork of manual capacity planning, making infrastructure leaner and more agile.

This is why HPA is foundational to any serious DevOps and cloud strategy.

How HPA Works Internally

How HPA Works Internally

1. Periodic Metrics Collection

The Metrics Server polls resource usage from all running pods every 15 seconds by default. CPU and memory utilization data is collected and exposed via the Kubernetes API, forming the basis for all autoscaling decisions.

2. Autoscaler Logic and Scaling Formula

Once metrics are available, the HPA controller compares current usage against the target threshold you have defined. The desired number of replicas is calculated as:

Desired Replicas = (Current Utilization / Target Utilization) × Current Replicas

For example, if your pods are running at 100% CPU and your target is 50%, HPA will double the pod count to restore balance.

3. Stabilization and Cooldown

To prevent “thrashing” — constant up-and-down scaling in response to minor fluctuations — Kubernetes applies stabilization windows and cooldown periods. These ensure HPA does not react to every brief spike and gives the cluster time to settle after a scaling event.

Key Components of HPA

Metrics Server — collects CPU and memory data across all pods and exposes it via the Kubernetes API. Installing it is the first step to enabling HPA:

bash

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Autoscaler Controller — a built-in Kubernetes control plane component that watches your workloads, compares usage against thresholds, and updates the replica count. It checks metrics roughly every 15 seconds and makes scaling decisions every 30 seconds.

Target Workload — the Deployment or StatefulSet HPA acts on. Resource requests and limits must be defined on the target workload for HPA to monitor correctly.

Custom Metrics — beyond CPU and memory, HPA can respond to application-level metrics such as requests per second, queue depth, or any metric exposed via a custom metrics adapter. This enables precise, business-aware scaling decisions.

Configuring HPA: Step-by-Step

Prerequisites

Before configuring HPA, confirm you have:

  • A running Kubernetes cluster
  • kubectl installed on your local machine
  • An application with defined resource requests and limits
  • Metrics Server installed in the cluster

Option 1: Quick Setup via kubectl

bash

kubectl autoscale deployment my-app –cpu-percent=50 –min=1 –max=10

This tells Kubernetes to maintain 50% CPU utilization, with a minimum of 1 pod and a maximum of 10.

Option 2: YAML Configuration

yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

  name: my-app-hpa

spec:

  scaleTargetRef:

    apiVersion: apps/v1

    kind: Deployment

    name: my-app

  minReplicas: 1

  maxReplicas: 10

  metrics:

    – type: Resource

      resource:

        name: cpu

        target:

          type: Utilization

          averageUtilization: 50

Apply with:

bash

kubectl apply -f hpa.yaml

Simulating Load to Trigger HPA

Run a BusyBox pod and generate traffic:

bash

kubectl run -it –rm busybox –image=busybox — /bin/sh

Inside the shell:

bash

while true; do wget -q -O- http://<your-service-ip>; done

Watch your deployment scale up automatically as HPA detects increased CPU usage.

Monitoring HPA

bash

kubectl top pod           # Real-time CPU/memory per pod

kubectl get hpa           # Check if HPA is actively monitoring

kubectl describe hpa <name>  # Details on triggers and scaling decisions

Troubleshooting HPA

If pods are not scaling, check:

  • Is Metrics Server installed and running?
  • Is the workload actually under enough stress?
  • Are resources.requests.cpu defined in your deployment YAML?
  • Is the target utilization threshold realistic?

Use tools like hey or ab for reliable and repeatable load generation during testing.

HPA Best Practices

Choose metrics that match your app’s bottleneck. CPU is the default, but queue length or request rate often provides more meaningful scaling signals for web applications and event-driven services.

Set min and max replicas thoughtfully. minReplicas should guarantee baseline availability. maxReplicas should reflect real infrastructure capacity — not just a safe-sounding number — to prevent runaway scaling during a DDoS or memory leak.

Use cooldowns to avoid instability. Downscale delays (e.g., 5 minutes) prevent premature pod removal when traffic briefly dips. Upscale delays (e.g., 1 minute) give the cluster time to absorb a new wave of pods before reacting again.

Combine with Cluster Autoscaler. When your cluster runs out of node capacity for new pods, the Cluster Autoscaler adds nodes automatically. HPA and Cluster Autoscaler together form a complete horizontal scaling strategy.

When to Use HPA

HPA is the right choice when:

  • Your application handles variable traffic patterns — peaks during business hours, sales events, or viral moments
  • You are running public APIs or web servers where traffic is unpredictable
  • You have microservices in a CI/CD pipeline where each service should scale independently
  • You are running event-driven or batch processing workloads that react to queues or file triggers
  • You need to simulate real-world scaling behavior in staging or load testing environments

Vertical Pod Autoscaling (VPA): Scaling Up

How VPA Works

While HPA adds more pods, VPA makes each pod more capable. Vertical Pod Autoscaling automatically adjusts the CPU and memory requests of running pods based on their actual usage over time. Instead of horizontal expansion, it focuses on right-sizing — ensuring each pod has exactly the resources it needs, no more and no less.

A useful analogy: HPA is like hiring more staff to handle a busy restaurant. VPA is like training each existing staff member to work more efficiently.

VPA is particularly valuable for applications where the number of pods is fixed but resource consumption fluctuates — and it pairs naturally with a mature quality assurance process that monitors application behavior over time.

Prerequisites for VPA

Prerequisites for VPA

Before setting up VPA, ensure you have:

A running Kubernetes cluster — Minikube works well for local testing.

A deployed application — VPA needs something to monitor. A basic deployment with CPU and memory activity is sufficient.

VPA components installed — VPA is not enabled by default in Kubernetes. You must manually install its three core components using manifests from the Kubernetes Autoscaler GitHub repository.

VPA Metrics

VPA bases its decisions on several types of usage data:

CPU usage — tracks processing power consumption per pod over time. Consistently high or low CPU triggers VPA to adjust requests accordingly, preventing throttling or over-provisioning.

Memory usage — monitors memory consumption and adjusts requests (not limits) when usage consistently exceeds what was originally requested. To enforce limits, use resource policies in VPA configuration.

Historical usage patterns — VPA does not rely solely on current metrics. It analyzes historical data to predict future needs, producing more stable and intelligent recommendations over time.

Granular data collection — frequent sampling gives VPA fine-grained visibility into pod behavior, allowing it to detect sudden spikes or sustained drops and act on them reliably.

Core Components of VPA

Recommender — the intelligence layer of VPA. It continuously analyzes CPU and memory usage and produces optimal resource recommendations based on real-time and historical data.

Updater — applies the recommender’s suggestions by restarting pods with new resource values. It only acts when the update mode is set to Auto.

Admission Controller — injects recommended resource values at pod creation time. Even if the updater is inactive, new pods start with optimized settings from day one.

Update Policies — VPA supports three modes:

  • Off — recommendations are generated but never applied. Useful for monitoring only.
  • Initial — recommendations are applied at pod creation but not during runtime.
  • Auto — recommendations are applied dynamically, restarting pods as needed.

Resource Policy Support — you can define minimum and maximum bounds per container, preventing VPA from scaling too far in either direction.

Configuring VPA

VPA is configured via a YAML file that points to your target deployment. A basic configuration looks like this:

yaml

apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

  name: myapp-vpa

spec:

  targetRef:

    apiVersion: “apps/v1”

    kind: Deployment

    name: myapp

  updatePolicy:

    updateMode: “Auto”

For tighter control, include resource policy bounds:

yaml

apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

  name: nginx-vpa

spec:

  targetRef:

    apiVersion: “apps/v1”

    kind: Deployment

    name: nginx-deployment

  updatePolicy:

    updateMode: “Auto”

  resourcePolicy:

    containerPolicies:

    – containerName: nginx

      minAllowed:

        cpu: 100m

        memory: 200Mi

      maxAllowed:

        cpu: 800m

        memory: 1Gi

This configuration keeps the nginx container’s resources within defined bounds while still allowing intelligent adjustment based on actual usage — a critical safety net in shared environments and budget-constrained clusters.

Apply with:

bash

kubectl apply -f vpa-config.yaml

Make sure the VPA components (recommender, updater, and admission controller) are deployed in your cluster before applying the config.

Deploying and Testing VPA

Once your VPA config is applied, verify it is working:

bash

kubectl describe vpa nginx-vpa

This shows current CPU and memory recommendations. To stress-test the system and observe VPA in action, run a load generator:

bash

kubectl run -it –rm loadgen –image=busybox /bin/sh

Run loops or request scripts from inside the shell and watch how pod resource requests change over time. VPA will automatically adjust allocations without any manual input — like a smart thermostat that continuously calibrates based on activity.

Monitoring and Debugging VPA

Check VPA status:

bash

kubectl get vpa

kubectl describe vpa

Common issues to look for:

Insufficient historical data — the recommender needs steady workload traffic over time before making reliable recommendations. New or lightly used deployments may not receive suggestions immediately.

No active load — VPA cannot recommend changes if no resource usage is being recorded.

Admission controller misconfiguration — if new pods are not receiving injected values, verify the admission controller is installed and running correctly.

Conflicts with HPA — if both VPA and HPA attempt to manage the same resource (e.g., both targeting CPU), they will interfere with each other. The safe approach is to assign different resources to each: HPA on CPU, VPA on memory.

When to Use VPA

VPA is the right choice for:

  • Batch processing applications with predictable but heavy workloads
  • Legacy services that cannot scale horizontally due to architecture constraints
  • Applications with pod limits where adding replicas is not possible
  • ML model serving and data pipelines — workloads that are slow to start but resource-intensive during execution

If you are running stateful services or workloads with strict pod count constraints, VPA is often the only viable autoscaling option.

HPA vs VPA: Choosing the Right Strategy

HPAVPA
What it scalesNumber of pod replicasCPU/memory per pod
Best forVariable traffic, stateless appsFixed pod count, resource-heavy workloads
Requires restart?NoYes (in Auto mode)
Metrics Server needed?YesNo (uses its own recommender)
Can be combined?Yes (on different resources)Yes (with HPA targeting different metrics)

In most production environments, HPA and VPA are not mutually exclusive. You can run HPA targeting CPU utilization while VPA manages memory right-sizing — giving you a comprehensive, multi-dimensional autoscaling strategy. For node-level capacity to support either, pair both with the Cluster Autoscaler.

Betatest Solutions helps engineering teams design and implement complete Kubernetes autoscaling architectures. Whether you need cloud-native DevOps services or are looking to optimize an existing cluster, reach out to our team to get started.

Conclusion

Kubernetes pod autoscaling — through HPA and VPA — is one of the most powerful capabilities available to modern engineering teams. HPA handles traffic variability by adding and removing pod replicas in real time. VPA handles resource variability by continuously right-sizing the CPU and memory of each running pod. Together, they eliminate the inefficiency of manual capacity planning and ensure your applications remain performant, resilient, and cost-effective under any workload condition.

The key is knowing which tool fits which problem. Use HPA for stateless, traffic-variable workloads. Use VPA for resource-intensive services with stable pod counts. Combine both, with careful metric assignment, for the most complete autoscaling coverage your cluster can have.

FAQs

1. What is the difference between HPA and VPA in Kubernetes?

HPA (Horizontal Pod Autoscaling) adjusts the number of pod replicas based on demand — adding pods when traffic increases and removing them when it drops. VPA (Vertical Pod Autoscaling) adjusts the CPU and memory resources allocated to each individual pod based on actual usage. HPA scales out; VPA scales up.

2. Can HPA and VPA be used together in the same Kubernetes cluster?

Yes, but with one important constraint: they should not both target the same resource on the same workload. The recommended approach is to configure HPA to manage CPU-based scaling and VPA to manage memory right-sizing. Running both on the same metric (e.g., both managing CPU) causes conflicts and unstable behavior.

3. Do I need Metrics Server for both HPA and VPA?

Metrics Server is required for HPA — without it, HPA cannot collect the CPU and memory data needed to make scaling decisions. VPA uses its own Recommender component to gather usage data and does not depend on Metrics Server to function, though it must still have its three core components (Recommender, Updater, Admission Controller) installed.

4. What are the three VPA update modes and when should I use each?

VPA supports three update modes. Off generates recommendations but never applies them — useful for observing what VPA would do without affecting production. Initial applies recommendations only at pod creation, not during runtime — good for controlled environments. Auto applies recommendations dynamically and restarts pods as needed — the best choice for fully automated resource optimization.

5. What happens if VPA restarts a pod? Will it cause downtime?

In Auto mode, VPA restarts pods to apply new resource values. For applications with multiple replicas, this is generally safe as Kubernetes handles rolling restarts gracefully. For single-replica or stateful applications, a restart may cause brief unavailability. Using Initial mode avoids runtime restarts by only applying recommendations when pods are created.

6. What is the HPA scaling formula?

The core HPA formula is: Desired Replicas = (Current Utilization ÷ Target Utilization) × Current Replicas. For example, if pods are running at 80% CPU utilization and the target is 40%, HPA will double the replica count to redistribute load and restore balance.

7. When should I use Cluster Autoscaler alongside HPA or VPA?

Cluster Autoscaler adds or removes nodes in your Kubernetes cluster based on pod scheduling demand. If HPA tries to add more pods but there is no node capacity to place them on, Cluster Autoscaler provisions new nodes automatically. It is a critical complement to both HPA and VPA in production environments. You can learn more in the Node and Cluster Autoscaling guide.

Let’s Talk Tech & Possibilities!​

Hit Us Up Before Someone Else Builds Your Idea

Related Articles