When you specify a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs. When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. For more details about the difference between requests and limits, see Resource QoS.
CPU and memory are each a resource type. A resource type has a base unit. CPU is specified in units of cores, and memory is specified in units of bytes.
CPU and memory are collectively referred to as compute resources, or just resources. Compute resources are measurable quantities that can be requested, allocated, and consumed. They are distinct from API resources. API resources, such as Pods and Services are objects that can be read and modified through the Kubernetes API server.
Each Container of a Pod can specify one or more of the following:
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
Although requests and limits can only be specified on individual Containers, it is convenient to talk about Pod resource requests and limits. A Pod resource request/limit for a particular resource type is the sum of the resource requests/limits of that type for each Container in the Pod.
Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to:
Fractional requests are allowed. A Container with
spec.containers[].resources.requests.cpu
of 0.5
is guaranteed half as much
CPU as one that asks for 1 CPU. The expression 0.1
is equivalent to the
expression 100m
, which can be read as “one hundred millicpu”. Some people say
“one hundred millicores”, and this is understood to mean the same thing. A
request with a decimal point, like 0.1
, is converted to 100m
by the API, and
precision finer than 1m
is not allowed. For this reason, the form 100m
might
be preferred.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
Limits and requests for memory
are measured in bytes. You can express memory as
a plain integer or as a fixed-point integer using one of these SI suffixes:
E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi,
Mi, Ki. For example, the following represent roughly the same value:
128974848, 129e6, 129M, 123Mi
Here’s an example. The following Pod has two Containers. Each Container has a request of 0.25 cpu and 64MiB (226 bytes) of memory Each Container has a limit of 0.5 cpu and 128MiB of memory. You can say the Pod has a request of 0.5 cpu and 128 MiB of memory, and a limit of 1 core and 256MiB of memory.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: db
image: mysql
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
- name: wp
image: wordpress
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
When the kubelet starts a Container of a Pod, it passes the CPU and memory limits to the container runtime.
When using Docker:
The spec.containers[].resources.requests.cpu
is converted to its core value,
which is potentially fractional, and multiplied by 1024. This number is used
as the value of the
--cpu-shares
flag in the docker run
command.
The spec.containers[].resources.limits.cpu
is converted to its millicore value,
multiplied by 100000, and then divided by 1000. This number is used as the value
of the --cpu-quota
flag in the docker run
command. The [--cpu-period
] flag is set to 100000,
which represents the default 100ms period for measuring quota usage. The
kubelet enforces cpu limits if it is started with the
[--cpu-cfs-quota
] flag set to true. As of Kubernetes version 1.2, this flag
defaults to true.
The spec.containers[].resources.limits.memory
is converted to an integer, and
used as the value of the
--memory
flag in the docker run
command.
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage.
To determine whether a Container cannot be scheduled or is being killed due to resource limits, see the Troubleshooting section.
The resource usage of a Pod is reported as part of the Pod status.
If optional monitoring is configured for your cluster, then Pod resource usage can be retrieved from the monitoring system.
If the scheduler cannot find any node where a Pod can fit, the Pod remains unscheduled until a place can be found. An event is produced each time the scheduler fails to find a place for the Pod, like this:
$ kubectl describe pod frontend | grep -A 3 Events
Events:
FirstSeen LastSeen Count From Subobject PathReason Message
36s 5s 6 {scheduler } FailedScheduling Failed for reason PodExceedsFreeCPU and possibly others
In the preceding example, the Pod named “frontend” fails to be scheduled due to insufficient CPU resource on the node. Similar error messages can also suggest failure due to insufficient memory (PodExceedsFreeMemory). In general, if a Pod is pending with a message of this type, there are several things to try:
cpu: 1
, then a Pod with a limit of cpu: 1.1
will
never be scheduled.You can check node capacities and amounts allocated with the
kubectl describe nodes
command. For example:
$ kubectl.sh describe nodes e2e-test-minion-group-4lw4
Name: e2e-test-minion-group-4lw4
[ ... lines removed for clarity ...]
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 2
memory: 7679792Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 1800m
memory: 7474992Ki
pods: 110
[ ... lines removed for clarity ...]
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system fluentd-gcp-v1.38-28bv1 100m (5%) 0 (0%) 200Mi (2%) 200Mi (2%)
kube-system kube-dns-3297075139-61lj3 260m (13%) 0 (0%) 100Mi (1%) 170Mi (2%)
kube-system kube-proxy-e2e-test-... 100m (5%) 0 (0%) 0 (0%) 0 (0%)
kube-system monitoring-influxdb-grafana-v4-z1m12 200m (10%) 200m (10%) 600Mi (8%) 600Mi (8%)
kube-system node-problem-detector-v0.1-fj7m3 20m (1%) 200m (10%) 20Mi (0%) 100Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
680m (34%) 400m (20%) 920Mi (12%) 1070Mi (14%)
In the preceding output, you can see that if a Pod requests more than 1120m CPUs or 6.23Gi of memory, it will not fit on the node.
By looking at the Pods
section, you can see which Pods are taking up space on
the node.
The amount of resources available to Pods is less than the node capacity, because
system daemons use a portion of the available resources. The allocatable
field
NodeStatus
gives the amount of resources that are available to Pods. For more information, see
Node Allocatable Resources.
The resource quota feature can be configured to limit the total amount of resources that can be consumed. If used in conjunction with namespaces, it can prevent one team from hogging all the resources.
Your Container might get terminated because it is resource-starved. To check
whether a Container is being killed because it is hitting a resource limit, call
kubectl describe pod
on the Pod of interest:
[12:54:41] $ ./cluster/kubectl.sh describe pod simmemleak-hra99
Name: simmemleak-hra99
Namespace: default
Image(s): saadali/simmemleak
Node: kubernetes-node-tf0f/10.240.216.66
Labels: name=simmemleak
Status: Running
Reason:
Message:
IP: 10.244.2.75
Replication Controllers: simmemleak (1/1 replicas created)
Containers:
simmemleak:
Image: saadali/simmemleak
Limits:
cpu: 100m
memory: 50Mi
State: Running
Started: Tue, 07 Jul 2015 12:54:41 -0700
Last Termination State: Terminated
Exit Code: 1
Started: Fri, 07 Jul 2015 12:54:30 -0700
Finished: Fri, 07 Jul 2015 12:54:33 -0700
Ready: False
Restart Count: 5
Conditions:
Type Status
Ready False
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {scheduler } scheduled Successfully assigned simmemleak-hra99 to kubernetes-node-tf0f
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD pulled Pod container image "gcr.io/google_containers/pause:0.8.0" already present on machine
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD created Created with docker id 6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD started Started with docker id 6a41280f516d
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} spec.containers{simmemleak} created Created with docker id 87348f12526a
In the preceding example, the Restart Count: 5
indicates that the simmemleak
Container in the Pod was terminated and restarted five times.
You can call get pod
with the -o go-template=...
option to fetch the status
of previously terminated Containers:
[13:59:01] $ ./cluster/kubectl.sh get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}' simmemleak-60xbc
Container Name: simmemleak
LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]]
You can see that the Container was terminated because of reason:OOM Killed
,
where OOM
stands for Out Of Memory.
Kubernetes version 1.5 introduces Opaque integer resources. Opaque integer resources allow cluster operators to advertise new node-level resources that would be otherwise unknown to the system.
Users can consume these resources in Pod specs just like CPU and memory. The scheduler takes care of the resource accounting so that no more than the available amount is simultaneously allocated to Pods.
Note: Opaque integer resources are Alpha in Kubernetes version 1.5. Only resource accounting is implemented; node-level isolation is still under active development.
Opaque integer resources are resources that begin with the prefix
pod.alpha.kubernetes.io/opaque-int-resource-
. The API server
restricts quantities of these resources to whole numbers. Examples of
valid quantities are 3
, 3000m
and 3Ki
. Examples of invalid
quantities are 0.5
and 1500m
.
There are two steps required to use opaque integer resources. First, the cluster operator must advertise a per-node opaque resource on one or more nodes. Second, users must request the opaque resource in Pods.
To advertise a new opaque integer resource, the cluster operator should
submit a PATCH
HTTP request to the API server to specify the available
quantity in the status.capacity
for a node in the cluster. After this
operation, the node’s status.capacity
will include a new resource. The
status.allocatable
field is updated automatically with the new resource
asynchronously by the kubelet. Note that because the scheduler uses the
node status.allocatable
value when evaluating Pod fitness, there may
be a short delay between patching the node capacity with a new resource and the
first pod that requests the resource to be scheduled on that node.
Example:
Here is an HTTP request that advertises five “foo” resources on node k8s-node-1
.
PATCH /api/v1/nodes/k8s-node-1/status HTTP/1.1
Accept: application/json
Content-Type: application/json-patch+json
Host: k8s-master:8080
[
{
"op": "add",
"path": "/status/capacity/pod.alpha.kubernetes.io~1opaque-int-resource-foo",
"value": "5"
}
]
Note: In the preceding request, ~1
is the encoding for the character /
in the patch path. The operation path value in JSON-Patch is interpreted as a
JSON-Pointer. For more details, see
IETF RFC 6901, section 3.
To consume an opaque resource in a Pod, include the name of the opaque
resource as a key in the spec.containers[].resources.requests
map.
The Pod is scheduled only if all of the resource requests are
satisfied, including cpu, memory and any opaque resources. The Pod will
remain in the PENDING
state as long as the resource request cannot be met by
any node.
Example:
The Pod below requests 2 cpus and 1 “foo” (an opaque resource.)
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: myimage
resources:
requests:
cpu: 2
pod.alpha.kubernetes.io/opaque-int-resource-foo: 1
Kubernetes version 1.5 only allows resource quantities to be specified on a Container. It is planned to improve accounting for resources that are shared by all Containers in a Pod, such as emptyDir volumes.
Kubernetes version 1.5 only supports Container requests and limits for CPU and memory. It is planned to add new resource types, including a node disk space resource, and a framework for adding custom resource types.
Kubernetes supports overcommitment of resources by supporting multiple levels of Quality of Service.
In Kubernetes version 1.5, one unit of CPU means different things on different cloud providers, and on different machine types within the same cloud providers. For example, on AWS, the capacity of a node is reported in ECUs, while in GCE it is reported in logical cores. We plan to revise the definition of the cpu resource to allow for more consistency across providers and platforms.
Get hands-on experience assigning CPU and RAM resources to a container.