CPU limits , CPU request and aggressive throttling in Kubernetes

8 min readMay 12, 2023

What is CPU Throttling

CPU throttling means that applications are granted more constrained resources when they are near to the container’s CPU limit. In some cases, container throttling occurs even when CPU utilization is not close to the limit due to bugs in the Linux kernel.

Consider a single-threaded application running on a limited CPU with a processing time of 200ms per operation. The following diagram shows an application that completes the request:

Now consider an application with a CPU limit of 0.4 CPUs. The application will only receive about 40ms of runtime for each 100ms. This means that instead of completing the request in 200ms, it will take a total of 440ms. This means the application is experiencing CPU throttling.

Preventing Errors by Detecting Containers Without CPU Limits

The first step in setting appropriate Kubernetes resource limits is to discover containers without limits.

Finding containers without CPU limits by namespace

Use this query to discover containers without CPU limits in a specific namespace.

sum by (namespace)
(count by (namespace,pod,container)(kube_pod_container_info{container!=""})
unless sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"}))

Finding containers with tight CPU limits

This technique aims to avoid CPU throttling by identifying containers that have CPU limits close to their actual utilization.

Use this query to find containers with CPU utilization close to the limit:

(sum by
(namespace,pod,container)(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / 
sum by(namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

Checking if the cluster has enough capacity

Kubernetes makes sure that pods are only scheduled on a node if that node has enough resources for the aggregate requests of all the container’s pods. This also means that the node commits to each container the CPU and memory resources specified in its resource request.

Consider a Kubernetes cluster where the sum of all resource requests is greater than the resources available in the cluster. This is known as “overcommitting”. When the cluster is overcommitted, pods might work well in normal circumstances, but when there are high loads, containers can start using resources up to the limit. This will cause certain pods to evict, and in extreme cases, nodes can die due to resource starvation in the cluster.

To check for CPU overcommits in the cluster, use the following query:

100 * sum(kube_pod_container_resource_limits{container!="",resource="cpu"} ) /
sum(kube_node_status_capacity_cpu_cores)

Understanding Kubernetes request and limit

CPU “Request” is just for scheduling purposes. It’s like the container’s wishlist, used mainly to find the best node suitable for it. Whereas CPU “Limit” is the rental contract. Once we find a node for the container, it absolutely cannot go over the limit.

How Kubernetes request & limit is implemented

Kubernetes uses kernel throttling to implement CPU limit. If an application goes above the limit, it gets throttled (aka fewer CPU cycles). Memory requests and limits, on the other hand, are implemented differently, and it’s easier to detect. You only need to check if your pod’s last restart status is OOMKilled. But CPU throttling is not easy to identify, because k8s only exposes usage metrics and not cgroup related metrics.

CPU Request

For the sake of simplicity, let’s discuss how it organized in a four-core machine.

The k8s uses a cgroup to control the resource allocation(for both memory and CPU ). It has a hierarchy model and can only use the resource allocated to the parent. The details are stored in a virtual filesystem (/sys/fs/cgroup). In the case of CPU it’s /sys/fs/cgroup/cpu,cpuacct/*.

The k8s uses cpu.share file to allocate the CPU resources. In this case, the root cgroup inherits 4096 CPU shares, which are 100% of available CPU power(1 core = 1024; this is fixed value). The root cgroup allocate its share proportionally based on children’s cpu.share and they do the same with their children and so on. In typical Kubernetes nodes, there are three cgroup under the root cgroup, namely system.slice, user.slice, and kubepods. The first two are used to allocate the resource for critical system workloads and non-k8s user space programs. The last one, kubepods is created by k8s to allocate the resource to pods.

If you look at the above graph, you can see that first and second cgroups have 1024 share each, and the kubepod has 4096. Now, you may be thinking that there is only 4096 CPU share available in the root, but the total of children’s shares exceeds that value (6144). The answer to this question is, this value is logical, and the Linux scheduler (CFS) uses this value to allocate the CPU proportionally. In this case, the first two cgroups get 680 (16.6% of 4096) each, and kubepod gets the remaining 2736. But in idle case, the first two cgroup would not be using all allocated resources. The scheduler has a mechanism to avoid the wastage of unused CPU shares. Scheduler releases the unused CPU to the global pool so that it can allocate to the cgroups that are demanding for more CPU power(it does in batches to avoid the accounting penalty). The same workflow will be applied to all grandchildren as well.

This mechanism will make sure that CPU power is shared fairly, and no one can steal the CPU from others.

CPU Limit

Even though the k8s config for Limit and Requests looks similar, the implementation is entirely different; this is the most misguiding and less documented part.

The k8s uses CFS’s quota mechanism to implement the limit. The config for the limit is configured in two files cfs_period_us and cfs_quota_us(next to cpu.share) under the cgroup directory.

Unlike cpu.share, the quota is based on time period and not based on available CPU power. cfs_period_us is used to define the time period, it’s always 100000us (100ms). k8s has an option to allow to change this value but still alpha and feature gated. The scheduler uses this time period to reset the used quota. The second file, cfs_quota_us is used to denote the allowed quota in the quota period.

Please note that it also configured in us unit. Quota can exceed the quota period. Which means you can configure quota more than 100ms.

Let’s discuss two scenarios on 16 core machines (Omio’s most common machine type).

Scenario 1: 2 thread and 200ms limit. No throttling

Scenario 2: 10 thread and 200ms limit. throttling starts after 20ms and only receive cpu power after 80ms.

Let’s say you have configured 2 core as CPU limit; the k8s will translate this to 200ms. That means the container can use a maximum of 200ms CPU time without getting throttled.

And here starts all misunderstanding. As I said above, the allowed quota is 200ms, which means if you are running ten parallel threads on 12 core machine (see the second figure) where all other pods are idle, your quota will exceed the limit in 20ms (i.e. 10 * 20ms = 200ms), and all threads running under that pod will get throttled for next 80ms (stop the world). To make the situation worse, the scheduler has a bug that is causing unnecessary throttling and prevents the container from reaching the allowed quota.

Checking the throttling rate of your pods

Just login to the pod and run cat /sys/fs/cgroup/cpu/cpu.stat.

nr_periods — Total schedule period
nr_throttled — Total throttled period out of nr_periods
throttled_time — Total throttled time in ns

So what really happens?

We end up with a high throttle rate on multiple applications — up to 50% more than what we assumed the limits were set for!

This cascades as various errors — Readiness probe failures, Container stalls, Network disconnections and timeouts within service calls — all in all leading to reduced latency and increased error rates.

Fix and Impact

Simple. We disabled CPU limits until the latest kernel with bugfix was deployed across all our clusters.

Immediately, we found a huge reduction in error rates (HTTP 5xx) of our services:

HTTP Error Rates (5xx)

p95 Response time

Utilization costs

What’s the catch?

This is the catch. We risk some containers hogging up all CPUs in a machine. If you have a good application stack in place (e.g. proper JVM tuning, Go tuning, Node VM tuning) — then this is not a problem, you can live with this for a long time. But if you have applications that are either poorly optimized, or simply not optimized (FROM java:latest) — then results can backfire. At Omio we have automated base Dockerfiles with sane defaults for our primary language stacks, so this was not an issue for us.

Please do monitor USE (Utilization, Saturation and Errors) metrics, API latencies and error rates, and make sure your results match expectations.

Which Issues Can Occur if You Don’t Specify the CPU Limit in Kubernetes?

If you do not specify a CPU limit, the container can use all the CPU resources available on the node. This can cause containers with high CPU utilization to slow down other containers on the same node and use all available CPU, and may even cause Kubernetes components such as the kubelet to become unresponsive. The node then enters a NotReady state, causing its pods to be rescheduled on another node.

By setting limits on all containers, you can avoid most of the following problems:

Out of Memory (OOM) issues — can cause a node to go down, affecting the stability of the cluster. For example, applications with memory leaks can cause OOM problems. However, memory limits on containers can prevent memory leaks within a container from affecting the node.
CPU starvation — applications that are too CPU-intensive can affect all applications on the same node. Other applications can slow down or become unresponsive.
Pod eviction — when a node runs out of resources, the node initiates an eviction process that terminates pods. The first pods evicted are those that have no resource requests.
Financial waste — if there is no need for resource requests or limits, and there are no errors, this probably means you have over-provisioned the cluster and are overpaying for hardware resources.