Tag Archives: Autoscaling with Kubernetes


Autoscaling with Kubernetes

Autoscaling with Kubernetes

Customers using Kubernetes respond to end user requests swiftly and ship software faster than ever before. But what happens when the user builds a service that is even more popular than planned for, and run out of compute? In Kubernetes 1.3, we are proud to announce that we have a solution: autoscaling. On Google Compute Engine (GCE) and Google Container Engine (GKE) and on AWS, Kubernetes will automatically scale up the cluster as soon as user need it, and scale it back down to save money when the user doesn’t need. Let us explore this article and know more about autoscaling in Kubernetes.


In the context of Kubernetes cluster, there are typically two things one want to scale as a user:

Pods: For a given application let’s say you are running X replicas, if more requests come then the group of X pods can handle, it is a good idea to scale to more than X replicas for that application. For this to work faultlessly, the nodes should have enough available resources so that those extra pods can be scheduled and executed successfully.

Nodes: Capacity of all nodes putting together characterizes cluster’s capacity. If the workload demand goes beyond this capacity, then the user would have to add nodes to the cluster and make sure the workload can be scheduled and executed effectively. If the PODs keep scaling, at some point the resources that nodes have available will run out and the user will have to add more nodes to increase overall resources available at the cluster level.

When to Scale?

The choice of when to scale has two parts, one is measuring a certain metric continuously and when the metric crosses a threshold value, then acting on it by scaling a certain resource. For instance, if the user wants to measure the average CPU consumption of their pods and then trigger a scale operation if the CPU consumption crosses 80%. But one metric does not fit all use cases and for different kind of applications, the metric might vary.

So far we only considered the scale-up part, but when the workload usage drops, there should be a way to scale down with poise and without affecting the existing requests being processed.


In case of pods, simply changing the number of replicas in replication controller is enough. In case of nodes, there should be a way to call the cloud provider’s API, create a new instance and make it a part of the cluster, which is relatively non-trivial operation and may take more time comparatively.


With this understanding of autoscaling, let’s discuss about detailed implementation and technical details of Kubernetes autoscaling.


gfhCluster autoscaler is used in Kubernetes to scale cluster specifically nodes dynamically. It watches the pods continuously and if it finds that a pod cannot be scheduled, then based on the PodCondition, it chooses to scale up. This is far more effective than looking at the CPU percentage of nodes in aggregate. Since a node creation can take up to a minute or more depending on the cloud provider and other factors, it may take some time till the pod can be scheduled. Within a cluster, the user might have multiple node pools. Also, the nodes can be spread across AZs in a region and how the user scale might vary based on topology. Cluster Autoscaler provides various flags and ways to pull the node scaling behaviour.

For scaling down, it looks at average utilization on that node, but there are other factors which come into play. For instance, if a pod with pod disruption budget is running on a node which cannot be re-scheduled then the node cannot be removed from the cluster. Cluster autoscaler provides a way to terminate nodes and gives up to 10 minutes for pods to reposition.


Horizontal pod autoscaler (HPA) is a control loop which viewpoints and scales a pod in the deployment. This can be done by creating an HPA object that refers to a deployment controller. The user can also define the threshold and minimum and maximum scale to which the deployment should scale. The original version of HPA which is GA (autoscaling/v1) only supports CPU as a metric that can be monitored. The current version of HPA which is in beta supports memory and other custom metrics. Once the user creates an HPA object and it is able to query the metrics for that pod, the user can see it reporting the details:

$ kubectl get hpa


helloetst-ownay28d Deployment/helloetst-ownay28d 8% / 60% 1 4 1 23h