Architecting Kubernetes clusters — choosing the best autoscaling strategy
Daniele Polencic

Architecting Kubernetes clusters — choosing the best autoscaling strategy

June 2021 measure-wide f4"><em class="i">TL;DR: Scaling pods and nodes in a Kubernetes cluster could take several minutes with the default settings. Learn how to size your cluster nodes, configure the Horizontal and Cluster Autoscaler, and overprovision your cluster for faster scaling.</em></p><p class="lh-copy measure-wide f4"><strong class="b">Table of content:</strong></p><ul><li class="lh-copy f4 mv1 measure-wide"><a href="#when-autoscaling-pods-goes-wrong" target="_self" class="link navy underline hover-sky">When autoscaling pods goes wrong</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#how-the-cluster-autoscaler-works-in-kubernetes" target="_self" class="link navy underline hover-sky">How the Cluster Autoscaler works in Kubernetes</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#exploring-pod-autoscaling-lead-time" target="_self" class="link navy underline hover-sky">Exploring pod autoscaling lead time</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#choosing-the-optimal-instance-size-for-a-kubernetes-node" target="_self" class="link navy underline hover-sky">Choosing the optimal instance size for a Kubernetes node</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#overprovisioning-nodes-in-your-kubernetes-cluster" target="_self" class="link navy underline hover-sky">Overprovisioning nodes in your Kubernetes cluster</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#selecting-the-correct-memory-and-cpu-requests-for-your-pods" target="_self" class="link navy underline hover-sky">Selecting the correct memory and CPU requests for your Pods</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#what-about-downscaling-a-cluster-" target="_self" class="link navy underline hover-sky">What about downscaling a cluster?</a></li><li class="lh-copy f4 mv1 measure-wide"><a href="#why-not-autoscaling-based-on-memory-or-cpu-" target="_self" class="link navy underline hover-sky">Why not autoscaling based on memory or CPU?</a></li></ul><p class="lh-copy measure-wide f4">In Kubernetes, several things are referred to as "autoscaling", including:</p><ul><li class="lh-copy f4 mv1 measure-wide"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Horizontal Pod Autoscaler</a>.</li><li class="lh-copy f4 mv1 measure-wide"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Vertical Pod Autoscaler</a>.</li><li class="lh-copy f4 mv1 measure-wide"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Cluster Autoscaler</a>.</li></ul><p class="lh-copy measure-wide f4">Those autoscalers belong to different categories because they address other concerns.</p><p class="lh-copy measure-wide f4">The <strong class="b">Horizontal Pod Autoscaler (HPA)</strong> is designed to increase the replicas in your deployments.</p><p class="lh-copy measure-wide f4">As your application receives more traffic, you could have the autoscaler adjusting the number of replicas to handle more requests.</p><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-1" id="carousel-1-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="The Horizontal Pod Autoscaler (HPA) inspects metrics such as memory and CPU at a regular interval." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">The Horizontal Pod Autoscaler (HPA) inspects metrics such as memory and CPU at a regular interval.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-1-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-1" id="carousel-1-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="If the metrics pass a user-defined threshold, the autoscaler creates more Pods." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-1-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">If the metrics pass a user-defined threshold, the autoscaler creates more Pods.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4">The <strong class="b">Vertical Pod Autoscaler (VPA)</strong> is useful when you can't create more copies of your Pods, but you still need to handle more traffic.</p><p class="lh-copy measure-wide f4">As an example, you can't scale a database (easily) only by adding more Pods.</p><p class="lh-copy measure-wide f4">A database might require sharding or configuring read-only replicas.</p><p class="lh-copy measure-wide f4">But you can make a database handle more connections by increasing the memory and CPU available to it.</p><p class="lh-copy measure-wide f4">That's precisely the purpose of the vertical autoscaler — increasing the size of the Pod.</p><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-2" id="carousel-2-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="You can't only increase the number of replicas to scale a database in Kubernetes." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">You can't only increase the number of replicas to scale a database in Kubernetes.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-2-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-2" id="carousel-2-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="But you can create a pod that has bigger resources assigned to it. The Vertical Pod Autoscaler can do that automatically." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-2-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">But you can create a pod that has bigger resources assigned to it. The Vertical Pod Autoscaler can do that automatically.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4">Lastly, the <strong class="b">Cluster Autoscaler (CA)</strong>.</p><p class="lh-copy measure-wide f4">When your cluster runs low on resources, the Cluster Autoscaler provision a new compute unit and adds it to the cluster.</p><p class="lh-copy measure-wide f4">If there are too many empty nodes, the cluster autoscaler will remove them to reduce costs.</p><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-3" id="carousel-3-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="When you scale your pods in Kubernetes, you might run out of space." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">When you scale your pods in Kubernetes, you might run out of space.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-3-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-3" id="carousel-3-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="The Cluster Autoscaler is designed to increase the node count in your cluster." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-3-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">The Cluster Autoscaler is designed to increase the node count in your cluster.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-3-2">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-3" id="carousel-3-2" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="You can keep scaling your pods without worrying about the underlying nodes. The cluster autoscaler will create more automatically." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">3</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-3-1"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">You can keep scaling your pods without worrying about the underlying nodes. The cluster autoscaler will create more automatically.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4">While these components all "autoscale" something, they are entirely unrelated to each other.</p><p class="lh-copy measure-wide f4">They all address very different use cases and use other concepts and mechanisms.</p><p class="lh-copy measure-wide f4">And they are developed in separate projects and can be used independently from each other.</p><p class="lh-copy measure-wide f4"><strong class="b">However, scaling your cluster requires fine-tuning the setting of the autoscalers so that they work in concert.</strong></p><p class="lh-copy measure-wide f4">Let's have a look at an example.</p><h2 class="f2 pt5 pb2 mt3" id="when-autoscaling-pods-goes-wrong">When autoscaling pods goes wrong</h2><p class="lh-copy measure-wide f4">Imagine having an application that requires and uses 1.5GB of memory and 0.25 vCPU at all times.</p><p class="lh-copy measure-wide f4">You provisioned a cluster with a single node of 8GB and 2 vCPU — it should be able to fit four pods perfectly (and have a little bit of extra space left).</p><img class="db pv3 center" src="" alt="A single node cluster with 8GB of memory and 2 vCPU" loading="lazy"><p class="lh-copy measure-wide f4">You deploy a single Pod and set up:</p><ol><li class="lh-copy f4 mv1 measure-wide">An <strong class="b">Horizontal Pod Autoscaler</strong> adds a replica every 10 incoming requests (i.e. if you have 40 concurrent requests, it should scale to 4 replicas).</li><li class="lh-copy f4 mv1 measure-wide">A <strong class="b">Cluster Autoscaler</strong> to create more nodes when resources are low.</li></ol><blockquote class="pl3 mh2 bl bw2 b--blue bg-evian pv1 ph4"><p class="lh-copy measure-wide f4">The Horizontal Pod Autoscaler can scale the replicas in your deployment using Custom Metrics such as the queries per second (QPS) from an Ingress controller.</p></blockquote><p class="lh-copy measure-wide f4">You start driving traffic 30 concurrent requests to your cluster and observe the following:</p><ol><li class="lh-copy f4 mv1 measure-wide">The <strong class="b">Horizontal Pod Autoscaler</strong> starts scaling the Pods.</li><li class="lh-copy f4 mv1 measure-wide">Two more Pods are created.</li><li class="lh-copy f4 mv1 measure-wide">The <strong class="b">Cluster Autoscaler</strong> doesn't trigger — no new node is created in the cluster.</li></ol><p class="lh-copy measure-wide f4">It makes sense since there's enough space for one more Pod in that node.</p><img class="db pv3 center" src="" alt="Scaling three replicas in a single node" loading="lazy"><p class="lh-copy measure-wide f4">You further increase the traffic to 40 concurrent requests and observe again:</p><ol><li class="lh-copy f4 mv1 measure-wide">The <strong class="b">Horizontal Pod Autoscaler</strong> creates one more Pod.</li><li class="lh-copy f4 mv1 measure-wide">The Pod is pending and cannot be deployed.</li><li class="lh-copy f4 mv1 measure-wide">The <strong class="b">Cluster Autoscaler</strong> triggers creating a new node.</li><li class="lh-copy f4 mv1 measure-wide">The node is provisioned in 4 minutes. After that, the pending Pod is deployed.</li></ol><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-4" id="carousel-4-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="When you scale to four replicas, the fourth replicas isn't deployed in the first node. Instead, it stays "Pending"." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">When you scale to four replicas, the fourth replicas isn't deployed in the first node. Instead, it stays <em class="i">"Pending"</em>.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-4-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-4" id="carousel-4-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="The autoscaler creates a new node, and the pod is finally deployed." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/2</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-4-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">The autoscaler creates a new node, and the pod is finally deployed.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4"><em class="i">Why is the fourth Pod not deployed in the first node?</em></p><p class="lh-copy measure-wide f4">Pods deployed in your Kubernetes cluster consume resources such as memory, CPU and storage.</p><p class="lh-copy measure-wide f4">However, on the same node, <strong class="b">the operating system and the kubelet require memory and CPU too.</strong></p><p class="lh-copy measure-wide f4">In a Kubernetes worker node's memory and CPU are divided into:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Resources needed to run the operating system and system daemons</strong> such as SSH, systemd, etc.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Resources necessary to run Kubernetes agents</strong> such as the Kubelet, the container runtime, <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">node problem detector</a>, etc.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Resources available to Pods.</strong></li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Resources reserved to the <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">eviction threshold</a></strong>.</li></ol><img class="db pv3 center" src="" alt="Resources in a Kubernetes cluster are consumed by Pods, the operating system, kubelet and eviction threshold" loading="lazy"><p class="lh-copy measure-wide f4">As you can guess, <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">all of those quotas are customisable</a>, but you need to account for them.</p><p class="lh-copy measure-wide f4">In an 8GB and 2 vCPU virtual machine, you can expect:</p><ul><li class="lh-copy f4 mv1 measure-wide">100MB of memory and 0.1 vCPU to be reserved for the operating system.</li><li class="lh-copy f4 mv1 measure-wide">1.8GB of memory and 0.07 vCPU to be reserved for the Kubelet.</li><li class="lh-copy f4 mv1 measure-wide">100MB of memory for the eviction threshold.</li></ul><p class="lh-copy measure-wide f4"><strong class="b">The remaining ~6GB of memory and 1.83 vCPU are usable by the Pods.</strong></p><p class="lh-copy measure-wide f4">If your cluster runs a DeamonSet such as kube-proxy, you should further reduce the available memory and CPU.</p><p class="lh-copy measure-wide f4">Considering kube-proxy has requests of 128MB and 0.1 vCPU, only ~5.9GB of memory and 1.73 vCPU are available to run Pods.</p><p class="lh-copy measure-wide f4">Running a CNI like Flannel and a log collector such as Fluentd will further reduce your resource footprint.</p><p class="lh-copy measure-wide f4">After accounting for all the extra resources, you have space left for only three pods.</p><div class="mv4 mv5-l"><header class="bg-light-gray flex pv2 pl1 br--top br2 relative"><div class="w1 h1 ml1 br-100 bg-dark-red"></div><div class="w1 h1 ml1 br-100 bg-green"></div><div class="w1 h1 ml1 br-100 bg-yellow"></div></header><pre class="code-light-theme relative overflow-auto mv0 br2 br--bottom"><code class="code lh-copy"><span class="standard">OS 100MB, 0.1 vCPU + Kubelet 1.8GB, 0.07 vCPU + Eviction threshold 100MB, 0 vCPU + Daemonsets 128MB, 0.1 vCPU + ====================================== Used 2.1GB, 0.27 vCPU ====================================== Available to Pods 5.9GB, 1.73 vCPU Pod requests 1.5GB, 0.25 vCPU ====================================== Total (4 Pods) 6GB, 1vCPU</span></code></pre></div><p class="lh-copy measure-wide f4">The fourth stays "Pending" unless it can be deployed on another node.</p><p class="lh-copy measure-wide f4"><em class="i">Since the Cluster Autoscaler knows that there's no space for a fourth Pod, why doesn't it provision a new node?</em></p><p class="lh-copy measure-wide f4"><em class="i">Why does it wait for the Pod to be pending before it triggers creating a node?</em></p><h2 class="f2 pt5 pb2 mt3" id="how-the-cluster-autoscaler-works-in-kubernetes">How the Cluster Autoscaler works in Kubernetes</h2><p class="lh-copy measure-wide f4"><strong class="b">The Cluster Autoscaler doesn't look at memory or CPU available when it triggers the autoscaling.</strong></p><p class="lh-copy measure-wide f4">Instead, the Cluster Autoscaler reacts to events and checks for any unschedulable Pods every 10 seconds.</p><p class="lh-copy measure-wide f4">A pod is unschedulable when the scheduler is unable to find a node that can accommodate it.</p><p class="lh-copy measure-wide f4">For example, when a Pod requests 1 vCPU but the cluster has only 0.5 vCPU available, the scheduler marks the Pod as unschedulable.</p><p class="lh-copy measure-wide f4"><strong class="b">That's when the Cluster Autoscaler initiates creating a new node.</strong></p><p class="lh-copy measure-wide f4">The Cluster Autoscaler scans the current cluster and <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">checks if any of the unschedulable pods would fit on in a new node.</a></p><p class="lh-copy measure-wide f4">If you have a cluster with several node types (often also referred to as node groups or node pools), the Cluster Autoscaler will pick one of them using the following strategies:</p><ul><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Random</strong> — picks a node type at random. This is the default strategy.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Most pods</strong> — selects the node group that would schedule the most pods.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Least waste</strong> — selects the node group with the least idle CPU after scale-up.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Price</strong> — select the node group that will cost the least (only works for GCP at the moment).</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Priority</strong> — selects the node group with the highest priority (and you manually assign priorities).</li></ul><p class="lh-copy measure-wide f4">Once the node type is identified, the Cluster Autoscaler will call the relevant API to provision a new compute resource.</p><p class="lh-copy measure-wide f4">If you're using AWS, the Cluster Autoscaler will provision a new EC2 instance.</p><p class="lh-copy measure-wide f4">On Azure, it will create a new Virtual Machine and on GCP, a new Compute Engine.</p><p class="lh-copy measure-wide f4">It may take some time before the created nodes appear in Kubernetes.</p><p class="lh-copy measure-wide f4">Once the compute resource is ready, the <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">node is initialised</a> and added to the cluster where unscheduled Pods can be deployed.</p><p class="lh-copy measure-wide f4"><strong class="b">Unfortunately, provisioning new nodes is usually slow.</strong></p><p class="lh-copy measure-wide f4">It might take several minutes to provision a new compute unit.</p><p class="lh-copy measure-wide f4"><em class="i">But let's dive into the numbers.</em></p><h2 class="f2 pt5 pb2 mt3" id="exploring-pod-autoscaling-lead-time">Exploring pod autoscaling lead time</h2><p class="lh-copy measure-wide f4">The time it takes to create a new Pod on a new Node is determined by four major factors:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Horizontal Pod Autoscaler reaction time.</strong></li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Cluster Autoscaler reaction time.</strong></li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Node provisioning time.</strong></li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Pod creation time.</strong></li></ol><p class="lh-copy measure-wide f4">By default, <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">pods' CPU and memory usage is scraped by kubelet every 10 seconds.</a></p><p class="lh-copy measure-wide f4"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Every minute, the Metrics Server will aggregate those metrics</a> and expose them to the rest of the Kubernetes API.</p><p class="lh-copy measure-wide f4">The Horizontal Pod Autoscaler controller is in charge of checking the metrics and deciding to scale up or down your replicas.</p><p class="lh-copy measure-wide f4">By default, the <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Horizontal Pod Autoscaler checks Pods metrics every 15 seconds.</a></p><p class="lh-copy measure-wide f4">In the worst-case scenario, the Horizontal Pod Autoscaler can take up to 1 minute and a half to trigger the autoscaling (i.e. 10s + 60s + 15s).</p><img class="db pv3 center" src="" alt="The Horizontal Pod Autoscaler can take up to 1 minute and a half to trigger the autoscaling" loading="lazy"><p class="lh-copy measure-wide f4"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">The Cluster Autoscaler checks for unschedulable Pods in the cluster every 10 seconds.</a></p><p class="lh-copy measure-wide f4">Once one or more Pods are detected, it will run an algorithm to decide:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">How many nodes</strong> are necessary to deploy all pending Pods.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">What type of node group</strong> should be created.</li></ol><p class="lh-copy measure-wide f4">The entire process should take:</p><ul><li class="lh-copy f4 mv1 measure-wide"><strong class="b">No more than 30 seconds</strong> on clusters with <strong class="b">less than 100 nodes</strong> with up to 30 pods each. The average latency should be about 5 seconds.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">No more than 60 seconds</strong> on cluster with <strong class="b">100 to 1000 nodes</strong>. The average latency should be about 15 seconds.</li></ul><img class="db pv3 center" src="" alt="The Cluster Autoscaler takes 30 seconds to decide to create a small node" loading="lazy"><p class="lh-copy measure-wide f4">Then, there's the Node provisioning time, which depends mainly on the cloud provider.</p><p class="lh-copy measure-wide f4"><strong class="b">It's pretty standard for a new compute resource to be provisioned in 3 to 5 minutes.</strong></p><img class="db pv3 center" src="" alt="Creating a virtual machine on a cloud provider could take several minutes" loading="lazy"><p class="lh-copy measure-wide f4">Lastly, the Pod has to be created by the container runtime.</p><p class="lh-copy measure-wide f4">Launching a container shouldn't take more than few milliseconds, but <strong class="b">downloading the container image could take several seconds.</strong></p><p class="lh-copy measure-wide f4">If you're not caching your container images, downloading an image from the container registry could take from a couple of seconds up to a minute, depending on the size and number of layers.</p><img class="db pv3 center" src="" alt="Downloading a container image could take time and affect scaling" loading="lazy"><p class="lh-copy measure-wide f4">So the total timing for trigger the autoscaling when there is no space in the current cluster is:</p><ol><li class="lh-copy f4 mv1 measure-wide">The Horizontal Pod Autoscaler might take up to 1m30s to increase the number of replicas.</li><li class="lh-copy f4 mv1 measure-wide">The Cluster Autoscaler should take less than 30 seconds for a cluster with less than 100 nodes and less than a minute for a cluster with more than 100 nodes.</li><li class="lh-copy f4 mv1 measure-wide">The cloud provider might take 3 to 5 minutes to create the computer resource.</li><li class="lh-copy f4 mv1 measure-wide">The container runtime could take up to 30 seconds to download the container image.</li></ol><p class="lh-copy measure-wide f4">In the worse case, with a small cluster, you have:</p><div class="mv4 mv5-l"><header class="bg-light-gray flex pv2 pl1 br--top br2 relative"><div class="w1 h1 ml1 br-100 bg-dark-red"></div><div class="w1 h1 ml1 br-100 bg-green"></div><div class="w1 h1 ml1 br-100 bg-yellow"></div></header><pre class="code-light-theme relative overflow-auto mv0 br2 br--bottom"><code class="code lh-copy"><span class="standard">HPA delay: 1m30s + CA delay: 0m30s + Cloud provider: 4m + Container runtime: 0m30s + ========================= Total 6m30s</span></code></pre></div><p class="lh-copy measure-wide f4">With a cluster with more than 100 nodes, the total delay could be up to 7 minutes.</p><p class="lh-copy measure-wide f4"><em class="i">Are you happy to wait for 7 minutes before you have more Pods to handle a sudden surge in traffic?</em></p><p class="lh-copy measure-wide f4"><em class="i">How can you tune the autoscaling to reduce the 7 minutes scaling time if you need a new node?</em></p><p class="lh-copy measure-wide f4">You could change:</p><ul><li class="lh-copy f4 mv1 measure-wide">The refresh time for the Horizontal Pod Autoscaler (controlled by the <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">--horizontal-pod-autoscaler-sync-period</code> flag, default is 15 seconds).</li><li class="lh-copy f4 mv1 measure-wide">The interval for metrics scraping in the Metrics Server (controlled by the <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">metric-resolution</code> flag, default 60 seconds).</li><li class="lh-copy f4 mv1 measure-wide">How frequently the cluster autoscaler scans for unscheduled Pods (controlled by the <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">scan-interval</code> flag, default 10 seconds).</li><li class="lh-copy f4 mv1 measure-wide">How you cache the image on the local node (<a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">with a tool such as kube-fledged</a>).</li></ul><p class="lh-copy measure-wide f4">But even if you were to tune those settings to a tiny number, you will still be limited by the cloud provider provisioning time.</p><p class="lh-copy measure-wide f4"><em class="i">So, how could you fix that?</em></p><p class="lh-copy measure-wide f4">Since you can't change the provisioning time, you will need a workaround this time.</p><p class="lh-copy measure-wide f4">You could try two things:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Avoid creating new nodes,</strong> if possible.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Creating nodes proactively</strong> so that they are already provisioned when you need them.</li></ol><p class="lh-copy measure-wide f4"><em class="i">Let's have a look at the options one at a time.</em></p><h2 class="f2 pt5 pb2 mt3" id="choosing-the-optimal-instance-size-for-a-kubernetes-node">Choosing the optimal instance size for a Kubernetes node</h2><p class="lh-copy measure-wide f4"><strong class="b">Choosing the right instance type for your cluster has dramatic consequences on your scaling strategy.</strong></p><p class="lh-copy measure-wide f4"><em class="i">Consider the following scenario.</em></p><p class="lh-copy measure-wide f4">You have an application that requests 1GB of memory and 0.1 vCPU.</p><p class="lh-copy measure-wide f4">You provision a node that has 4GB of memory and 1 vCPU.</p><p class="lh-copy measure-wide f4">After reserving memory and CPU for the kubelet, operating system and eviction threshold, you are left with ~2.5GB of memory and 0.7 vCPU that can be used for running Pods.</p><img class="db pv3 center" src="" alt="Choosing a smaller instance can affect scaling" loading="lazy"><p class="lh-copy measure-wide f4">Your node has space for only two Pods.</p><p class="lh-copy measure-wide f4">Every time you scale your replicas, you are likely to incur in up to 7 minutes delay (the lead time to trigger the Horizontal Pod Autoscaler, Cluster Autoscaler and provisioning the compute resource on the cloud provider).</p><p class="lh-copy measure-wide f4"><em class="i">Let's have a look at what happens if you decide to use a 64GB memory and 16 vCPU node instead.</em></p><p class="lh-copy measure-wide f4">After reserving memory and CPU for the kubelet, operating system and eviction threshold, you are left with ~58.32GB of memory and 15.8 vCPU that can be used for running Pods.</p><p class="lh-copy measure-wide f4"><strong class="b">The available space can host 58 Pods, and you are likely to need a new node only when you have more than 58 replicas.</strong></p><img class="db pv3 center" src="" alt="Choosing a larger instance can affect scaling" loading="lazy"><p class="lh-copy measure-wide f4">Also, every time a node is added to the cluster, several pods can be deployed.</p><p class="lh-copy measure-wide f4">There is less chance to trigger <em class="i">again</em> the Cluster Autoscaler (and provisioning new compute units on the cloud provider).</p><p class="lh-copy measure-wide f4">Choosing large instance types also has another benefit.</p><p class="lh-copy measure-wide f4"><strong class="b">The ratio between resource reserved for Kubelet, operating system and eviction threshold and available resources to run Pods is greater.</strong></p><p class="lh-copy measure-wide f4">Have a look at this graph that pictures the memory available to pods.</p><img class="db pv3 center" src="" alt="Memory available to pods based on different instance types" loading="lazy"><p class="lh-copy measure-wide f4">As the instance size increase, you can notice that (in proportion) the resources available to pods increase.</p><p class="lh-copy measure-wide f4">In other words, you are utilising your resources more efficiently than having two instances of half of the size.</p><p class="lh-copy measure-wide f4"><em class="i">Should you select the biggest instance all the time?</em></p><p class="lh-copy measure-wide f4"><strong class="b">There's a peak in efficiency dictated by how many Pods you can have on the node.</strong></p><p class="lh-copy measure-wide f4">Some cloud providers limit the number of Pods to 110 (i.e. GKE). Others have limits dictated by the underlying network on a per-instance basis (i.e. AWS).</p><blockquote class="pl3 mh2 bl bw2 b--blue bg-evian pv1 ph4"><p class="lh-copy measure-wide f4"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">You can inspect the limits from most cloud providers here.</a></p></blockquote><p class="lh-copy measure-wide f4"><strong class="b">And choosing a larger instance type is not always a good option.</strong></p><p class="lh-copy measure-wide f4">You should also consider:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Blast radius</strong> — if you have only a few nodes, then the impact of a failing node is bigger than if you have many nodes.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Autoscaling is less cost-effective</strong> as the next increment is a (very) large node.</li></ol><p class="lh-copy measure-wide f4">Assuming you have selected the right instance type for your cluster, you might still face a delay in provisioning the new compute unit.</p><p class="lh-copy measure-wide f4"><em class="i">How can you work around that?</em></p><p class="lh-copy measure-wide f4"><em class="i">What if instead of creating a new node when it's time to scale, you create the same node ahead of time?</em></p><h2 class="f2 pt5 pb2 mt3" id="overprovisioning-nodes-in-your-kubernetes-cluster">Overprovisioning nodes in your Kubernetes cluster</h2><p class="lh-copy measure-wide f4">If you can afford to have a spare node available at all times, you could:</p><ol><li class="lh-copy f4 mv1 measure-wide">Create a node and leave it empty.</li><li class="lh-copy f4 mv1 measure-wide">As soon as there's a Pod in the empty node, you create another empty node.</li></ol><p class="lh-copy measure-wide f4"><strong class="b">In other words, you teach the autoscaler always to have a spare empty node if you need to scale.</strong></p><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-5" id="carousel-5-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="When you decide to overprovision a cluster, a node is always empty and ready to deploy Pods." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">When you decide to overprovision a cluster, a node is always empty and ready to deploy Pods.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-5-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-5" id="carousel-5-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="As soon as a Pod is created in the empty node, the Cluster Autoscaler creates a new node." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-5-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">As soon as a Pod is created in the empty node, the Cluster Autoscaler creates a new node.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-5-2">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-5" id="carousel-5-2" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="Since creating the Node happens in the background, you will likely not notice the lead time to provision a cloud machine." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">3</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-5-1"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">Since creating the Node happens in the background, you will likely not notice the lead time to provision a cloud machine.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4"><strong class="b">It's a trade-off: you incur an extra cost (one empty compute unit available at all times), but you gain in speed.</strong></p><p class="lh-copy measure-wide f4">With this strategy, you can scale your fleet much quicker.</p><p class="lh-copy measure-wide f4"><em class="i">But there's bad and good news.</em></p><p class="lh-copy measure-wide f4">The bad news is that the Cluster Autoscaler doesn't have this functionality built-in.</p><p class="lh-copy measure-wide f4"><strong class="b">It cannot be configured to be proactive, and there is no flag to "always provision an empty node".</strong></p><p class="lh-copy measure-wide f4">The good news is that you can still fake it.</p><p class="lh-copy measure-wide f4"><em class="i">Let me explain.</em></p><p class="lh-copy measure-wide f4"><strong class="b">You could run a Deployment with enough requests to reserve an entire node.</strong></p><p class="lh-copy measure-wide f4">You could think about this pod as a placeholder — it is meant to reserve space, not use any resource.</p><p class="lh-copy measure-wide f4">As soon as a real Pod is created, you could evict the placeholder and deploy the Pod.</p><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-6" id="carousel-6-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="In an overprovisioned cluster you have a Pod as a placeholder with low priority." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">In an overprovisioned cluster you have a Pod as a placeholder with low priority.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-6-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-6" id="carousel-6-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="As soon as you create more replicas, the scheduler evicts the placeholder pod and deploys the new Pod. The placeholder pod is unschedulable and triggers the Cluster Autoscaler." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-6-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">As soon as you create more replicas, the scheduler evicts the placeholder pod and deploys the new Pod. The placeholder pod is unschedulable and triggers the Cluster Autoscaler.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-6-2">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-6" id="carousel-6-2" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="In the background, the new node is provisioned and the placeholder pod is deployed into it." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">3</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-6-1"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">In the background, the new node is provisioned and the placeholder pod is deployed into it.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4">Notice how this time, you still have to wait 5 minutes for the node to be added to the cluster, but you can keep using the current node.</p><p class="lh-copy measure-wide f4">In the meantime, a new node is provisioned in the background.</p><p class="lh-copy measure-wide f4"><em class="i">How can you achieve that?</em></p><p class="lh-copy measure-wide f4"><strong class="b">Overprovisioning can be configured using deployment running a pod that sleeps forever.</strong></p><div class="mv4 mv5-l"><header class="bg-light-gray flex pv2 pl1 br--top br2 relative"><div class="w1 h1 ml1 br-100 bg-dark-red"></div><div class="w1 h1 ml1 br-100 bg-green"></div><div class="w1 h1 ml1 br-100 bg-yellow"></div><p class="code f6 mv0 black-60 w-100 tc absolute top-0 left-0 h1 pv2">overprovision.yaml</p></header><pre class="code-light-theme relative overflow-auto mv0 br2 br--bottom"><code class="code lh-copy"><span class="standard"><span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> apps/v1 <span class="token key atrule">kind</span><span class="token punctuation">:</span> Deployment <span class="token key atrule">metadata</span><span class="token punctuation">:</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">spec</span><span class="token punctuation">:</span> <span class="token key atrule">replicas</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token key atrule">selector</span><span class="token punctuation">:</span> <span class="token key atrule">matchLabels</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">template</span><span class="token punctuation">:</span> <span class="token key atrule">metadata</span><span class="token punctuation">:</span> <span class="token key atrule">labels</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">spec</span><span class="token punctuation">:</span> <span class="token key atrule">containers</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> pause </span><span class="highlight"> <span class="token key atrule">image</span><span class="token punctuation">:</span> </span><span class="standard"> <span class="token key atrule">resources</span><span class="token punctuation">:</span> </span><span class="highlight"> <span class="token key atrule">requests</span><span class="token punctuation">:</span> <span class="token key atrule">cpu</span><span class="token punctuation">:</span> <span class="token string">'1739m'</span> <span class="token key atrule">memory</span><span class="token punctuation">:</span> <span class="token string">'5.9G'</span></span></code></pre></div><p class="lh-copy measure-wide f4"><strong class="b">You should pay extra attention to the memory and CPU requests.</strong></p><p class="lh-copy measure-wide f4">The scheduler uses those values to decide where to deploy a Pod.</p><p class="lh-copy measure-wide f4">In this particular case, they are used to reserve the space.</p><p class="lh-copy measure-wide f4">You could provision a single large pod that has roughly the requests matching the available node resources.</p><p class="lh-copy measure-wide f4"><strong class="b">Please make sure that you account for resources consumed by the kubelet, operating system, kube-proxy, etc.</strong></p><p class="lh-copy measure-wide f4">If your node instance is 2 vCPU and 8GB of memory and the available space for pods is 1.73 vCPU and ~5.9GB of memory, your pause pod should match the latter.</p><img class="db pv3 center" src="" alt="Sizing sleep pod in overprovisioned clusters" loading="lazy"><p class="lh-copy measure-wide f4">To make sure that the Pod is evicted as soon as a real Pod is created, you can use <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Priorities and Preemptions.</a></p><p class="lh-copy measure-wide f4"><strong class="b">Pod Priority indicates the importance of a Pod relative to other Pods.</strong></p><p class="lh-copy measure-wide f4">When a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to schedule the Pending pod.</p><p class="lh-copy measure-wide f4">You can configure Pod Priorities in your cluster with a PodPriorityClass:</p><div class="mv4 mv5-l"><header class="bg-light-gray flex pv2 pl1 br--top br2 relative"><div class="w1 h1 ml1 br-100 bg-dark-red"></div><div class="w1 h1 ml1 br-100 bg-green"></div><div class="w1 h1 ml1 br-100 bg-yellow"></div><p class="code f6 mv0 black-60 w-100 tc absolute top-0 left-0 h1 pv2">priority.yaml</p></header><pre class="code-light-theme relative overflow-auto mv0 br2 br--bottom"><code class="code lh-copy"><span class="standard"><span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> <span class="token key atrule">kind</span><span class="token punctuation">:</span> PriorityClass <span class="token key atrule">metadata</span><span class="token punctuation">:</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">value</span><span class="token punctuation">:</span> <span class="token number">-1</span> <span class="token key atrule">globalDefault</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">description</span><span class="token punctuation">:</span> <span class="token string">'Priority class used by overprovisioning.'</span></span></code></pre></div><p class="lh-copy measure-wide f4">Since the default priority for a Pod is <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">0</code> and the <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">overprovisioning</code> PriorityClass has a value of <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">-1</code>, those Pods are the first to be evicted when the cluster runs out of space.</p><p class="lh-copy measure-wide f4">PriorityClass also has two optional fields: <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">globalDefault</code> and <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">description</code>.</p><ul><li class="lh-copy f4 mv1 measure-wide">The <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">description</code> is a human-readable memo of what the PriorityClass is about.</li><li class="lh-copy f4 mv1 measure-wide">The <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">globalDefault</code> field indicates that the value of this PriorityClass should be used for Pods without a <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">priorityClassName</code>. Only one PriorityClass with <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">globalDefault</code> set to <code class="code f5 lh-copy bg-near-white br2 pv1 ph2 fs-normal word-nowrap-even-on-whitespace">true</code> can exist in the system.</li></ul><p class="lh-copy measure-wide f4">You can assign the priority to your sleep Pod with:</p><div class="mv4 mv5-l"><header class="bg-light-gray flex pv2 pl1 br--top br2 relative"><div class="w1 h1 ml1 br-100 bg-dark-red"></div><div class="w1 h1 ml1 br-100 bg-green"></div><div class="w1 h1 ml1 br-100 bg-yellow"></div><p class="code f6 mv0 black-60 w-100 tc absolute top-0 left-0 h1 pv2">overprovision.yaml</p></header><pre class="code-light-theme relative overflow-auto mv0 br2 br--bottom"><code class="code lh-copy"><span class="standard"><span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> apps/v1 <span class="token key atrule">kind</span><span class="token punctuation">:</span> Deployment <span class="token key atrule">metadata</span><span class="token punctuation">:</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">spec</span><span class="token punctuation">:</span> <span class="token key atrule">replicas</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token key atrule">selector</span><span class="token punctuation">:</span> <span class="token key atrule">matchLabels</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">template</span><span class="token punctuation">:</span> <span class="token key atrule">metadata</span><span class="token punctuation">:</span> <span class="token key atrule">labels</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> overprovisioning <span class="token key atrule">spec</span><span class="token punctuation">:</span> </span><span class="highlight"> <span class="token key atrule">priorityClassName</span><span class="token punctuation">:</span> overprovisioning </span><span class="standard"> <span class="token key atrule">containers</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> reserve<span class="token punctuation">-</span>resources <span class="token key atrule">image</span><span class="token punctuation">:</span> <span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token key atrule">requests</span><span class="token punctuation">:</span> <span class="token key atrule">cpu</span><span class="token punctuation">:</span> <span class="token string">'1739m'</span> <span class="token key atrule">memory</span><span class="token punctuation">:</span> <span class="token string">'5.9G'</span></span></code></pre></div><p class="lh-copy measure-wide f4"><em class="i">The setup is complete!</em></p><p class="lh-copy measure-wide f4">When there are not enough resources in the cluster, the pause pod is preempted, and new pods take their place.</p><p class="lh-copy measure-wide f4">Since the pause pod become unschedulable, it forces the Cluster Autoscaler to add more nodes to the cluster.</p><p class="lh-copy measure-wide f4"><em class="i">Now that you're ready to overprovision your cluster, it's worth having a look at optimising your applications for scaling.</em></p><h2 class="f2 pt5 pb2 mt3" id="selecting-the-correct-memory-and-cpu-requests-for-your-pods">Selecting the correct memory and CPU requests for your Pods</h2><p class="lh-copy measure-wide f4"><strong class="b">The cluster autoscaler makes scaling decisions based on the presence of pending pods.</strong></p><p class="lh-copy measure-wide f4">The Kubernetes scheduler assigns (or not) a Pod to a Node based on its memory and CPU requests.</p><p class="lh-copy measure-wide f4">Hence, it's essential to set the correct requests on your workloads, or you might be triggering your autoscaler too late (or too early).</p><p class="lh-copy measure-wide f4"><em class="i">Let's have a look at an example.</em></p><p class="lh-copy measure-wide f4">You decide to profile an application, and you found out that:</p><ul><li class="lh-copy f4 mv1 measure-wide">Under average load, the application consumes 512MB of memory and 0.25 vCPU.</li><li class="lh-copy f4 mv1 measure-wide">At peak, the application should consume up to 4GB of memory and 1 vCPU.</li></ul><img class="db pv3 center" src="" alt="Setting the right memory and CPU requests" loading="lazy"><p class="lh-copy measure-wide f4">The limit for your container is 4GB of memory and 1 vCPU.</p><p class="lh-copy measure-wide f4"><em class="i">However, what about the requests?</em></p><p class="lh-copy measure-wide f4">The scheduler uses the Pod's memory and CPU requests to select the best node before creating the Pod.</p><p class="lh-copy measure-wide f4">So you could:</p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Set requests lower than the actual average usage.</strong></li><li class="lh-copy f4 mv1 measure-wide">Be conservative and <strong class="b">assign requests closer to the limit.</strong></li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Set requests to match the actual limits.</strong></li></ol><div class="relative overflow-hidden"><ul class="pl0 list"><li class="mv3"><input type="radio" name="carousel-7" id="carousel-7-0" checked class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="You could assign requests that are lower than the average app consumption." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">1</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><div class="w-20"></div><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">You could assign requests that are <strong class="b">lower</strong> than the average app consumption.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-7-1">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-7" id="carousel-7-1" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="You could assign requests that match the actual usage." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">2</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-7-0"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">You could assign requests that match the actual usage.</p></div><label class="db f6 b black-50 pv3 pointer w-20 tr ttu" for="carousel-7-2">Next <svg viewBox="0 0 10 16" xmlns="" class="pagination-icon ml2"><polyline fill="none" vector-effect="non-scaling-stroke" points="2,2 8,8 2,14"></polyline></svg></label></div></div></li><li class="mv3"><input type="radio" name="carousel-7" id="carousel-7-2" class="dn o-0 absolute bottom-0 left-0"><div class="dn checked-reveal"><div class="aspect-ratio aspect-ratio--4x3"><img src="" alt="You could set the requests so high that they match the limits of your app." class="aspect-ratio--object" loading="lazy"></div><div class="bt b-solid bw2 b--black-70 relative mt0"><div class="bg-black-10 br1 pa1 dib mt2 absolute bottom-1 left-0 z-999"><span class="b black-60">3</span><span class="f7 black-50">/3</span></div></div><div class="flex items-start justify-between bg-evian ph2"><label class="db f6 b black-50 pv3 pointer w-20 ttu" for="carousel-7-1"><svg viewBox="0 0 10 16" xmlns="" class="pagination-icon mr2"><polyline fill="none" vector-effect="non-scaling-stroke" points="8,2 2,8 8,14"></polyline></svg> Previous</label><div class="f5 lh-copy black-90 w-60 center"><p class="lh-copy measure-wide f5">You could set the requests so high that they match the limits of your app.</p></div><div class="w-20"></div></div></div></li></ul></div><p class="lh-copy measure-wide f4"><strong class="b">Defining requests lower than the actual usage is problematic since your nodes will be often overcommitted.</strong></p><p class="lh-copy measure-wide f4">As an example, you can assign 256MB of memory as a memory request.</p><p class="lh-copy measure-wide f4">The Kubernetes scheduler can fit twice as many Pods for each node.</p><p class="lh-copy measure-wide f4">However, Pods use twice as much memory in practice and start competing for resources (CPU) and being evicted (not enough memory on the Node).</p><img class="db pv3 center" src="" alt="Overcommitting nodes" loading="lazy"><p class="lh-copy measure-wide f4"><strong class="b">Overcommitting nodes can lead to excessive evictions, more work for the kubelet and a lot of rescheduling.</strong></p><p class="lh-copy measure-wide f4"><em class="i">What happens if you set the request to the same value of the limit?</em></p><p class="lh-copy measure-wide f4">You can set request and limits to the same values.</p><p class="lh-copy measure-wide f4">In Kubernetes, this is often referred to as <a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Guaranteed Quality of Service class</a> and refers to the fact that it's improbable that the pod will be terminated and evicted.</p><p class="lh-copy measure-wide f4">The Kubernetes scheduler will reserve the entire CPU and memory for the Pod on the assigned node.</p><p class="lh-copy measure-wide f4"><strong class="b">Pods with Guaranteed Quality of Service are stable but also inefficient.</strong></p><p class="lh-copy measure-wide f4">If your app uses 512MB of memory on average, but you reserve 4GB for it, you have 3.5GB unused most of the time.</p><img class="db pv3 center" src="" alt="Overcommitting nodes" loading="lazy"><p class="lh-copy measure-wide f4"><em class="i">Is it worth it?</em></p><p class="lh-copy measure-wide f4">If you want extra stability, yes.</p><p class="lh-copy measure-wide f4">If you want efficiency, you might want to lower the requests and leave a gap between those and the limit.</p><p class="lh-copy measure-wide f4">This is often referred to as <strong class="b">Burstable Quality of Service class</strong> and refers to the fact that the Pod baseline consumption can occasionally burst into using more memory and CPU.</p><p class="lh-copy measure-wide f4">When your requests match the app's actual usage, the scheduler will pack your pods in your nodes efficiently.</p><p class="lh-copy measure-wide f4"><strong class="b">Occasionally, the app might require more memory or CPU.</strong></p><ol><li class="lh-copy f4 mv1 measure-wide">If there are resources in the Node, the app will use them before returning to the baseline consumption.</li><li class="lh-copy f4 mv1 measure-wide">If the node is low on resources, the pod will compete for resources (CPU), and the kubelet might try to evict the Pod (memory).</li></ol><p class="lh-copy measure-wide f4"><em class="i">Should you use Guaranteed or Burstable quality of Service?</em></p><p class="lh-copy measure-wide f4"><em class="i">It depends.</em></p><ol><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Use Guaranteed Quality of Service (requests equal to limits) when you want to minimise rescheduling and evictions for the Pod.</strong> An excellent example is a Pod for a database.</li><li class="lh-copy f4 mv1 measure-wide"><strong class="b">Use Burstable Quality of Service (requests to match actual average usage) when you want to optimise your cluster and use the resources wisely.</strong> If you have a web application or a REST API, you might want to use a Burstable Quality of Service.</li></ol><p class="lh-copy measure-wide f4"><em class="i">How do you select the correct requests and limits values?</em></p><p class="lh-copy measure-wide f4"><strong class="b">You should profile the application and measure memory and CPU consumption when idle, under load and at peak.</strong></p><p class="lh-copy measure-wide f4">A more straightforward strategy consists of deploying the Vertical Pod Autoscaler and wait for it to suggest the correct values.</p><p class="lh-copy measure-wide f4">The Vertical Pod Autoscaler collects the data from the Pod and applies a regression model to extrapolate requests and limits.</p><p class="lh-copy measure-wide f4"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">You can learn more about how to do that in this article.</a></p><h2 class="f2 pt5 pb2 mt3" id="what-about-downscaling-a-cluster-">What about downscaling a cluster?</h2><p class="lh-copy measure-wide f4"><strong class="b">Every 10 seconds, the Cluster Autoscaler decides to remove a node only when the request utilization falls below 50%.</strong></p><p class="lh-copy measure-wide f4">In other words, for all the pods on the same node, it sums the CPU and memory requests.</p><p class="lh-copy measure-wide f4">If they are lower than half of the node's capacity, the Cluster Autoscaler will consider the current node for downscaling.</p><blockquote class="pl3 mh2 bl bw2 b--blue bg-evian pv1 ph4"><p class="lh-copy measure-wide f4">It's worth noting that the Cluster Autoscaler does not consider actual CPU and memory usage or limits and instead only looks at resource requests.</p></blockquote><p class="lh-copy measure-wide f4">Before the node is removed, the Cluster Autoscaler executes:</p><ul><li class="lh-copy f4 mv1 measure-wide"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Pods checks</a> to make sure that the Pods can be moved to other nodes.</li><li class="lh-copy f4 mv1 measure-wide"><a href="" target="_blank" rel="noreferrer" class="link navy underline hover-sky">Nodes checks</a> to prevent nodes from being destroyed prematurely.</li></ul><p class="lh-copy measure-wide f4">If the checks pass, the Cluster Autoscaler will remove the node from the cluster.</p><h2 class="f2 pt5 pb2 mt3" id="why-not-autoscaling-based-on-memory-or-cpu-">Why not autoscaling based on memory or CPU?</h2><p class="lh-copy measure-wide f4"><strong class="b">CPU or memory-based cluster autoscalers don't care about pods when scaling up and down.</strong></p><p class="lh-copy measure-wide f4">Imagine having a cluster with a single node and setting up the autoscaler to add a new node with the CPU reaches 80% of the total capacity.</p><p class="lh-copy measure-wide f4">You decide to create a Deployment with 3 replicas.</p><p class="lh-copy measure-wide f4">The combined resource usage for the three pods reaches 85% of the CPU.</p><p class="lh-copy measure-wide f4">A new node is provisioned.</p><p class="lh-copy measure-wide f4"><em class="i">What if you don't need any more pods?</em></p><p class="lh-copy measure-wide f4">You have a full node idling — not great.</p><p class="lh-copy measure-wide f4"><strong class="b">Usage of these type of autoscalers with Kubernetes is discouraged.</strong></p><h2 class="f2 pt5 pb2 mt3" id="summary">Summary</h2><p class="lh-copy measure-wide f4">Defining and implementing a successful scaling strategy in Kubernetes requires you to master several subjects:</p><ul><li class="lh-copy f4 mv1 measure-wide">Allocatable resources in Kubernetes nodes.</li><li class="lh-copy f4 mv1 measure-wide">Fine-tuning refresh intervals for Metrics Server, Horizontal Pod Autoscaler and Cluster Autoscalers.</li><li class="lh-copy f4 mv1 measure-wide">Architecting cluster and node instance sizes.</li><li class="lh-copy f4 mv1 measure-wide">Container image caching.</li><li class="lh-copy f4 mv1 measure-wide">Application benchmarking and profiling.</li></ul><p class="lh-copy measure-wide f4">But with the proper monitoring tool, you can iteratively test your scaling strategy and tune the speed and costs of your cluster.</p></article><div class="mb4 mb5-l mw8 center"><ul 