This tutorial demonstrates Apache Zookeeper on Kubernetes using StatefulSets, PodDisruptionBudgets, and PodAntiAffinity.
After this tutorial, you will know the following.
Before starting this tutorial, you should be familiar with the following Kubernetes concepts.
You will require a cluster with at least four nodes, and each node will require at least 2 CPUs and 4 GiB of memory. In this tutorial you will cordon and drain the cluster’s nodes. This means that all Pods on the cluster’s nodes will be terminated and evicted, and the nodes will, temporarily, become unschedulable. You should use a dedicated cluster for this tutorial, or you should ensure that the disruption you cause will not interfere with other tenants.
This tutorial assumes that your cluster is configured to dynamically provision PersistentVolumes. If your cluster is not configured to do so, you will have to manually provision three 20 GiB volumes prior to starting this tutorial.
Apache ZooKeeper is a distributed, open-source coordination service for distributed applications. ZooKeeper allows you to read, write, and observe updates to data. Data are organized in a file system like hierarchy and replicated to all ZooKeeper servers in the ensemble (a set of ZooKeeper servers). All operations on data are atomic and sequentially consistent. ZooKeeper ensures this by using the Zab consensus protocol to replicate a state machine across all servers in the ensemble.
The ensemble uses the Zab protocol to elect a leader, and data can not be written until a leader is elected. Once a leader is elected, the ensemble uses Zab to ensure that all writes are replicated to a quorum before they are acknowledged and made visible to clients. Without respect to weighted quorums, a quorum is a majority component of the ensemble containing the current leader. For instance, if the ensemble has three servers, a component that contains the leader and one other server constitutes a quorum. If the ensemble can not achieve a quorum, data can not be written.
ZooKeeper servers keep their entire state machine in memory, but every mutation is written to a durable WAL (Write Ahead Log) on storage media. When a server crashes, it can recover its previous state by replaying the WAL. In order to prevent the WAL from growing without bound, ZooKeeper servers will periodically snapshot their in memory state to storage media. These snapshots can be loaded directly into memory, and all WAL entries that preceded the snapshot may be safely discarded.
The manifest below contains a Headless Service, a ConfigMap, a PodDisruptionBudget, and a StatefulSet.
zookeeper.yaml
|
---|
|
Open a command terminal, and use
kubectl create
to create the
manifest.
kubectl create -f http://k8s.io/docs/tutorials/stateful-application/zookeeper.yaml
This creates the zk-headless
Headless Service, the zk-config
ConfigMap,
the zk-budget
PodDisruptionBudget, and the zk
StatefulSet.
service "zk-headless" created
configmap "zk-config" created
poddisruptionbudget "zk-budget" created
statefulset "zk" created
Use kubectl get
to watch the
StatefulSet controller create the StatefulSet’s Pods.
kubectl get pods -w -l app=zk
Once the zk-2
Pod is Running and Ready, use CRTL-C
to terminate kubectl.
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
The StatefulSet controller creates three Pods, and each Pod has a container with a ZooKeeper 3.4.9 server.
As there is no terminating algorithm for electing a leader in an anonymous network, Zab requires explicit membership configuration in order to perform leader election. Each server in the ensemble needs to have a unique identifier, all servers need to know the global set of identifiers, and each identifier needs to be associated with a network address.
Use kubectl exec
to get the hostnames
of the Pods in the zk
StatefulSet.
for i in 0 1 2; do kubectl exec zk-$i -- hostname; done
The StatefulSet controller provides each Pod with a unique hostname based on its
ordinal index. The hostnames take the form <statefulset name>-<ordinal index>
.
As the replicas
field of the zk
StatefulSet is set to 3
, the Set’s
controller creates three Pods with their hostnames set to zk-0
, zk-1
, and
zk-2
.
zk-0
zk-1
zk-2
The servers in a ZooKeeper ensemble use natural numbers as unique identifiers, and
each server’s identifier is stored in a file called myid
in the server’s
data directory.
Examine the contents of the myid
file for each server.
for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done
As the identifiers are natural numbers and the ordinal indices are non-negative integers, you can generate an identifier by adding one to the ordinal.
myid zk-0
1
myid zk-1
2
myid zk-2
3
Get the FQDN (Fully Qualified Domain Name) of each Pod in the zk
StatefulSet.
for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done
The zk-headless
Service creates a domain for all of the Pods,
zk-headless.default.svc.cluster.local
.
zk-0.zk-headless.default.svc.cluster.local
zk-1.zk-headless.default.svc.cluster.local
zk-2.zk-headless.default.svc.cluster.local
The A records in Kubernetes DNS resolve the FQDNs to the Pods’ IP addresses. If the Pods are rescheduled, the A records will be updated with the Pods’ new IP addresses, but the A record’s names will not change.
ZooKeeper stores its application configuration in a file named zoo.cfg
. Use
kubectl exec
to view the contents of the zoo.cfg
file in the zk-0
Pod.
kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg
For the server.1
, server.2
, and server.3
properties at the bottom of
the file, the 1
, 2
, and 3
correspond to the identifiers in the
ZooKeeper servers’ myid
files. They are set to the FQDNs for the Pods in
the zk
StatefulSet.
clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
tickTime=2000
initLimit=10
syncLimit=2000
maxClientCnxns=60
minSessionTimeout= 4000
maxSessionTimeout= 40000
autopurge.snapRetainCount=3
autopurge.purgeInteval=0
server.1=zk-0.zk-headless.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-headless.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-headless.default.svc.cluster.local:2888:3888
Consensus protocols require that the identifiers of each participant be unique. No two participants in the Zab protocol should claim the same unique identifier. This is necessary to allow the processes in the system to agree on which processes have committed which data. If two Pods were launched with the same ordinal, two ZooKeeper servers would both identify themselves as the same server.
When you created the zk
StatefulSet, the StatefulSet’s controller created
each Pod sequentially, in the order defined by the Pods’ ordinal indices, and it
waited for each Pod to be Running and Ready before creating the next Pod.
kubectl get pods -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
The A records for each Pod are only entered when the Pod becomes Ready. Therefore,
the FQDNs of the ZooKeeper servers will only resolve to a single endpoint, and that
endpoint will be the unique ZooKeeper server claiming the identity configured
in its myid
file.
zk-0.zk-headless.default.svc.cluster.local
zk-1.zk-headless.default.svc.cluster.local
zk-2.zk-headless.default.svc.cluster.local
This ensures that the servers
properties in the ZooKeepers’ zoo.cfg
files
represents a correctly configured ensemble.
server.1=zk-0.zk-headless.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-headless.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-headless.default.svc.cluster.local:2888:3888
When the servers use the Zab protocol to attempt to commit a value, they will either achieve consensus and commit the value (if leader election has succeeded and at least two of the Pods are Running and Ready), or they will fail to do so (if either of the aforementioned conditions are not met). No state will arise where one server acknowledges a write on behalf of another.
The most basic sanity test is to write some data to one ZooKeeper server and to read the data from another.
Use the zkCli.sh
script to write world
to the path /hello
on the zk-0
Pod.
kubectl exec zk-0 zkCli.sh create /hello world
This will write world
to the /hello
path in the ensemble.
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
Created /hello
Get the data from the zk-1
Pod.
kubectl exec zk-1 zkCli.sh get /hello
The data that you created on zk-0
is available on all of the servers in the
ensemble.
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
As mentioned in the ZooKeeper Basics section, ZooKeeper commits all entries to a durable WAL, and periodically writes snapshots in memory state, to storage media. Using WALs to provide durability is a common technique for applications that use consensus protocols to achieve a replicated state machine and for storage applications in general.
Use kubectl delete
to delete the
zk
StatefulSet.
kubectl delete statefulset zk
statefulset "zk" deleted
Watch the termination of the Pods in the StatefulSet.
get pods -w -l app=zk
When zk-0
if fully terminated, use CRTL-C
to terminate kubectl.
zk-2 1/1 Terminating 0 9m
zk-0 1/1 Terminating 0 11m
zk-1 1/1 Terminating 0 10m
zk-2 0/1 Terminating 0 9m
zk-2 0/1 Terminating 0 9m
zk-2 0/1 Terminating 0 9m
zk-1 0/1 Terminating 0 10m
zk-1 0/1 Terminating 0 10m
zk-1 0/1 Terminating 0 10m
zk-0 0/1 Terminating 0 11m
zk-0 0/1 Terminating 0 11m
zk-0 0/1 Terminating 0 11m
Reapply the manifest in zookeeper.yaml
.
kubectl apply -f http://k8s.io/docs/tutorials/stateful-application/zookeeper.yaml
The zk
StatefulSet will be created, but, as they already exist, the other API
Objects in the manifest will not be modified.
statefulset "zk" created
Error from server (AlreadyExists): error when creating "zookeeper.yaml": services "zk-headless" already exists
Error from server (AlreadyExists): error when creating "zookeeper.yaml": configmaps "zk-config" already exists
Error from server (AlreadyExists): error when creating "zookeeper.yaml": poddisruptionbudgets.policy "zk-budget" already exists
Watch the StatefulSet controller recreate the StatefulSet’s Pods.
kubectl get pods -w -l app=zk
Once the zk-2
Pod is Running and Ready, use CRTL-C
to terminate kubectl.
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
Get the value you entered during the sanity test,
from the zk-2
Pod.
kubectl exec zk-2 zkCli.sh get /hello
Even though all of the Pods in the zk
StatefulSet have been terminated and
recreated, the ensemble still serves the original value.
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
The volumeClaimTemplates
field, of the zk
StatefulSet’s spec
, specifies a
PersistentVolume that will be provisioned for each Pod.
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
The StatefulSet controller generates a PersistentVolumeClaim for each Pod in the StatefulSet.
Get the StatefulSet’s PersistentVolumeClaims.
kubectl get pvc -l app=zk
When the StatefulSet recreated its Pods, the Pods’ PersistentVolumes were remounted.
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
datadir-zk-0 Bound pvc-bed742cd-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
datadir-zk-1 Bound pvc-bedd27d2-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
datadir-zk-2 Bound pvc-bee0817e-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
The volumeMounts
section of the StatefulSet’s container template
causes the
PersistentVolumes to be mounted to the ZooKeeper servers’ data directories.
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
When a Pod in the zk
StatefulSet is (re)scheduled, it will always have the
same PersistentVolume mounted to the ZooKeeper server’s data directory.
Even when the Pods are rescheduled, all of the writes made to the ZooKeeper
servers’ WALs, and all of their snapshots, remain durable.
As noted in the Facilitating Leader Election and Achieving Consensus sections, the servers in a ZooKeeper ensemble require consistent configuration in order to elect a leader and form a quorum. They also require consistent configuration of the Zab protocol in order for the protocol to work correctly over a network. You can use ConfigMaps to achieve this.
Get the zk-config
ConfigMap.
kubectl get cm zk-config -o yaml
apiVersion: v1
data:
client.cnxns: "60"
ensemble: zk-0;zk-1;zk-2
init: "10"
jvm.heap: 2G
purge.interval: "0"
snap.retain: "3"
sync: "5"
tick: "2000"
The env
field of the zk
StatefulSet’s Pod template
reads the ConfigMap
into environment variables. These variables are injected into the containers
environment.
env:
- name : ZK_ENSEMBLE
valueFrom:
configMapKeyRef:
name: zk-config
key: ensemble
- name : ZK_HEAP_SIZE
valueFrom:
configMapKeyRef:
name: zk-config
key: jvm.heap
- name : ZK_TICK_TIME
valueFrom:
configMapKeyRef:
name: zk-config
key: tick
- name : ZK_INIT_LIMIT
valueFrom:
configMapKeyRef:
name: zk-config
key: init
- name : ZK_SYNC_LIMIT
valueFrom:
configMapKeyRef:
name: zk-config
key: tick
- name : ZK_MAX_CLIENT_CNXNS
valueFrom:
configMapKeyRef:
name: zk-config
key: client.cnxns
- name: ZK_SNAP_RETAIN_COUNT
valueFrom:
configMapKeyRef:
name: zk-config
key: snap.retain
- name: ZK_PURGE_INTERVAL
valueFrom:
configMapKeyRef:
name: zk-config
key: purge.interval
The entry point of the container invokes a bash script, zkGenConfig.sh
, prior to
launching the ZooKeeper server process. This bash script generates the
ZooKeeper configuration files from the supplied environment variables.
command:
- sh
- -c
- zkGenConfig.sh && zkServer.sh start-foreground
Examine the environment of all of the Pods in the zk
StatefulSet.
for i in 0 1 2; do kubectl exec zk-$i env | grep ZK_*;echo""; done
All of the variables populated from zk-config
contain identical values. This
allows the zkGenConfig.sh
script to create consistent configurations for all
of the ZooKeeper servers in the ensemble.
ZK_ENSEMBLE=zk-0;zk-1;zk-2
ZK_HEAP_SIZE=2G
ZK_TICK_TIME=2000
ZK_INIT_LIMIT=10
ZK_SYNC_LIMIT=2000
ZK_MAX_CLIENT_CNXNS=60
ZK_SNAP_RETAIN_COUNT=3
ZK_PURGE_INTERVAL=0
ZK_CLIENT_PORT=2181
ZK_SERVER_PORT=2888
ZK_ELECTION_PORT=3888
ZK_USER=zookeeper
ZK_DATA_DIR=/var/lib/zookeeper/data
ZK_DATA_LOG_DIR=/var/lib/zookeeper/log
ZK_LOG_DIR=/var/log/zookeeper
ZK_ENSEMBLE=zk-0;zk-1;zk-2
ZK_HEAP_SIZE=2G
ZK_TICK_TIME=2000
ZK_INIT_LIMIT=10
ZK_SYNC_LIMIT=2000
ZK_MAX_CLIENT_CNXNS=60
ZK_SNAP_RETAIN_COUNT=3
ZK_PURGE_INTERVAL=0
ZK_CLIENT_PORT=2181
ZK_SERVER_PORT=2888
ZK_ELECTION_PORT=3888
ZK_USER=zookeeper
ZK_DATA_DIR=/var/lib/zookeeper/data
ZK_DATA_LOG_DIR=/var/lib/zookeeper/log
ZK_LOG_DIR=/var/log/zookeeper
ZK_ENSEMBLE=zk-0;zk-1;zk-2
ZK_HEAP_SIZE=2G
ZK_TICK_TIME=2000
ZK_INIT_LIMIT=10
ZK_SYNC_LIMIT=2000
ZK_MAX_CLIENT_CNXNS=60
ZK_SNAP_RETAIN_COUNT=3
ZK_PURGE_INTERVAL=0
ZK_CLIENT_PORT=2181
ZK_SERVER_PORT=2888
ZK_ELECTION_PORT=3888
ZK_USER=zookeeper
ZK_DATA_DIR=/var/lib/zookeeper/data
ZK_DATA_LOG_DIR=/var/lib/zookeeper/log
ZK_LOG_DIR=/var/log/zookeeper
One of the files generated by the zkGenConfig.sh
script controls ZooKeeper’s logging.
ZooKeeper uses Log4j, and, by default,
it uses a time and size based rolling file appender for its logging configuration.
Get the logging configuration from one of Pods in the zk
StatefulSet.
kubectl exec zk-0 cat /usr/etc/zookeeper/log4j.properties
The logging configuration below will cause the ZooKeeper process to write all of its logs to the standard output file stream.
zookeeper.root.logger=CONSOLE
zookeeper.console.threshold=INFO
log4j.rootLogger=${zookeeper.root.logger}
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n
This is the simplest possible way to safely log inside the container. As the application’s logs are being written to standard out, Kubernetes will handle log rotation for you. Kubernetes also implements a sane retention policy that ensures application logs written to standard out and standard error do not exhaust local storage media.
Use kubectl logs
to retrieve the last
few log lines from one of the Pods.
kubectl logs zk-0 --tail 20
Application logs that are written to standard out or standard error are viewable
using kubectl logs
and from the Kubernetes Dashboard.
2016-12-06 19:34:16,236 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740
2016-12-06 19:34:16,237 [myid:1] - INFO [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749
2016-12-06 19:34:26,156 [myid:1] - INFO [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750
2016-12-06 19:34:26,226 [myid:1] - INFO [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)
2016-12-06 19:34:36,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)
2016-12-06 19:34:36,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)
Kubernetes also supports more powerful, but more complex, logging integrations with Google Cloud Logging and ELK. For cluster level log shipping and aggregation, you should consider deploying a sidecar container to rotate and ship your logs.
The best practices with respect to allowing an application to run as a privileged user inside of a container are a matter of debate. If your organization requires that applications be run as a non-privileged user you can use a SecurityContext to control the user that the entry point runs as.
The zk
StatefulSet’s Pod template
contains a SecurityContext.
securityContext:
runAsUser: 1000
fsGroup: 1000
In the Pods’ containers, UID 1000 corresponds to the zookeeper user and GID 1000 corresponds to the zookeeper group.
Get the ZooKeeper process information from the zk-0
Pod.
kubectl exec zk-0 -- ps -elf
As the runAsUser
field of the securityContext
object is set to 1000,
instead of running as root, the ZooKeeper process runs as the zookeeper user.
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S zookeep+ 1 0 0 80 0 - 1127 - 20:46 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
0 S zookeep+ 27 1 0 80 0 - 1155556 - 20:46 ? 00:00:19 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
By default, when the Pod’s PersistentVolume is mounted to the ZooKeeper server’s data directory, it is only accessible by the root user. This configuration prevents the ZooKeeper process from writing to its WAL and storing its snapshots.
Get the file permissions of the ZooKeeper data directory on the zk-0
Pod.
kubectl exec -ti zk-0 -- ls -ld /var/lib/zookeeper/data
As the fsGroup
field of the securityContext
object is set to 1000,
the ownership of the Pods’ PersistentVolumes is set to the zookeeper group,
and the ZooKeeper process is able to successfully read and write its data.
drwxr-sr-x 3 zookeeper zookeeper 4096 Dec 5 20:45 /var/lib/zookeeper/data
The ZooKeeper documentation documentation indicates that “You will want to have a supervisory process that manages each of your ZooKeeper server processes (JVM).” Utilizing a watchdog (supervisory process) to restart failed processes in a distributed system is a common pattern. When deploying an application in Kubernetes, rather than using an external utility as a supervisory process, you should use Kubernetes as the watchdog for your application.
Restart Policies control how Kubernetes handles process failures for the entry point of the container in a Pod. For Pods in a StatefulSet, the only appropriate RestartPolicy is Always, and this is the default value. For stateful applications you should never override the default policy.
Examine the process tree for the ZooKeeper server running in the zk-0
Pod.
kubectl exec zk-0 -- ps -ef
The command used as the container’s entry point has PID 1, and the ZooKeeper process, a child of the entry point, has PID 23.
UID PID PPID C STIME TTY TIME CMD
zookeep+ 1 0 0 15:03 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
zookeep+ 27 1 0 15:03 ? 00:00:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
In one terminal watch the Pods in the zk
StatefulSet.
kubectl get pod -w -l app=zk
In another terminal, kill the ZooKeeper process in Pod zk-0
.
kubectl exec zk-0 -- pkill java
The death of the ZooKeeper process caused its parent process to terminate. As the RestartPolicy of the container is Always, the parent process was relaunched.
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 0 21m
zk-1 1/1 Running 0 20m
zk-2 1/1 Running 0 19m
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Error 0 29m
zk-0 0/1 Running 1 29m
zk-0 1/1 Running 1 29m
If your application uses a script (such as zkServer.sh) to launch the process that implements the application’s business logic, the script must terminate with the child process. This ensures that Kubernetes will restart the application’s container when the process implementing the application’s business logic fails.
Configuring your application to restart failed processes is not sufficient to keep a distributed system healthy. There are many scenarios where a system’s processes can be both alive and unresponsive, or otherwise unhealthy. You should use liveness probes in order to notify Kubernetes that your application’s processes are unhealthy and should be restarted.
The Pod template
for the zk
StatefulSet specifies a liveness probe.
livenessProbe:
exec:
command:
- "zkOk.sh"
initialDelaySeconds: 15
timeoutSeconds: 5
The probe calls a simple bash script that uses the ZooKeeper ruok
four letter
word to test the server’s health.
ZK_CLIENT_PORT=${ZK_CLIENT_PORT:-2181}
OK=$(echo ruok | nc 127.0.0.1 $ZK_CLIENT_PORT)
if [ "$OK" == "imok" ]; then
exit 0
else
exit 1
fi
In one terminal window, watch the Pods in the zk
StatefulSet.
kubectl get pod -w -l app=zk
In another window, delete the zkOk.sh
script from the file system of Pod zk-0
.
kubectl exec zk-0 -- rm /opt/zookeeper/bin/zkOk.sh
When the liveness probe for the ZooKeeper process fails, Kubernetes will automatically restart the process for you, ensuring that unhealthy processes in the ensemble are restarted.
kubectl get pod -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 0 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Running 0 1h
zk-0 0/1 Running 1 1h
zk-0 1/1 Running 1 1h
Readiness is not the same as liveness. If a process is alive, it is scheduled and healthy. If a process is ready, it is able to process input. Liveness is a necessary, but not sufficient, condition for readiness. There are many cases, particularly during initialization and termination, when a process can be alive but not ready.
If you specify a readiness probe, Kubernetes will ensure that your application’s processes will not receive network traffic until their readiness checks pass.
For a ZooKeeper server, liveness implies readiness. Therefore, the readiness
probe from the zookeeper.yaml
manifest is identical to the liveness probe.
readinessProbe:
exec:
command:
- "zkOk.sh"
initialDelaySeconds: 15
timeoutSeconds: 5
Even though the liveness and readiness probes are identical, it is important to specify both. This ensures that only healthy servers in the ZooKeeper ensemble receive network traffic.
ZooKeeper needs a quorum of servers in order to successfully commit mutations to data. For a three server ensemble, two servers must be healthy in order for writes to succeed. In quorum based systems, members are deployed across failure domains to ensure availability. In order to avoid an outage, due to the loss of an individual machine, best practices preclude co-locating multiple instances of the application on the same machine.
By default, Kubernetes may co-locate Pods in a StatefulSet on the same node. For the three server ensemble you created, if two servers reside on the same node, and that node fails, the clients of your ZooKeeper service will experience an outage until at least one of the Pods can be rescheduled.
You should always provision additional capacity to allow the processes of critical
systems to be rescheduled in the event of node failures. If you do so, then the
outage will only last until the Kubernetes scheduler reschedules one of the ZooKeeper
servers. However, if you want your service to tolerate node failures with no downtime,
you should use a PodAntiAffinity
annotation.
Get the nodes for Pods in the zk
Stateful Set.
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
All of the Pods in the zk
StatefulSet are deployed on different nodes.
kubernetes-minion-group-cxpk
kubernetes-minion-group-a5aq
kubernetes-minion-group-2g2d
This is because the Pods in the zk
StatefulSet contain a
PodAntiAffinity annotation.
scheduler.alpha.kubernetes.io/affinity: >
{
"podAntiAffinity": {
"requiredDuringSchedulingRequiredDuringExecution": [{
"labelSelector": {
"matchExpressions": [{
"key": "app",
"operator": "In",
"values": ["zk-headless"]
}]
},
"topologyKey": "kubernetes.io/hostname"
}]
}
}
The requiredDuringSchedulingRequiredDuringExecution
field tells the
Kubernetes Scheduler that it should never co-locate two Pods from the zk-headless
Service in the domain defined by the topologyKey
. The topologyKey
kubernetes.io/hostname
indicates that the domain is an individual node. Using
different rules, labels, and selectors, you can extend this technique to spread
your ensemble across physical, network, and power failure domains.
In this section you will cordon and drain nodes. If you are using this tutorial on a shared cluster, be sure that this will not adversely affect other tenants.
The previous section showed you how to spread your Pods across nodes to survive unplanned node failures, but you also need to plan for temporary node failures that occur due to planned maintenance.
Get the nodes in your cluster.
kubectl get nodes
Use kubectl cordon
to
cordon all but four of the nodes in your cluster.
kubectl cordon < node name >
Get the zk-budget
PodDisruptionBudget.
kubectl get poddisruptionbudget zk-budget
The min-available
field indicates to Kubernetes that at least two Pods from
zk
StatefulSet must be available at any time.
NAME MIN-AVAILABLE ALLOWED-DISRUPTIONS AGE
zk-budget 2 1 1h
In one terminal, watch the Pods in the zk
StatefulSet.
kubectl get pods -w -l app=zk
In another terminal, get the nodes that the Pods are currently scheduled on.
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
kubernetes-minion-group-pb41
kubernetes-minion-group-ixsl
kubernetes-minion-group-i4c4
Use kubectl drain
to cordon and
drain the node on which the zk-0
Pod is scheduled.
kubectl drain $(kubectl get pod zk-0 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-minion-group-pb41" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-pb41, kube-proxy-kubernetes-minion-group-pb41; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-o5elz
pod "zk-0" deleted
node "kubernetes-minion-group-pb41" drained
As there are four nodes in your cluster, kubectl drain
, succeeds and the
zk-0
is rescheduled to another node.
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
Keep watching the StatefulSet’s Pods in the first terminal and drain the node on which
zk-1
is scheduled.
kubectl drain $(kubectl get pod zk-1 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data "kubernetes-minion-group-ixsl" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-ixsl, kube-proxy-kubernetes-minion-group-ixsl; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-voc74
pod "zk-1" deleted
node "kubernetes-minion-group-ixsl" drained
The zk-1
Pod can not be scheduled. As the zk
StatefulSet contains a
PodAntiAffinity
annotation preventing co-location of the Pods, and as only
two nodes are schedulable, the Pod will remain in a Pending state.
kubectl get pods -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
zk-1 1/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
Continue to watch the Pods of the stateful set, and drain the node on which
zk-2
is scheduled.
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-minion-group-i4c4" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
WARNING: Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4
There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.
pod/zk-2
Use CRTL-C
to terminate to kubectl.
You can not drain the third node because evicting zk-2
would violate zk-budget
. However,
the node will remain cordoned.
Use zkCli.sh
to retrieve the value you entered during the sanity test from zk-0
.
kubectl exec zk-0 zkCli.sh get /hello
The service is still available because its PodDisruptionBudget is respected.
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x200000002
ctime = Wed Dec 07 00:08:59 UTC 2016
mZxid = 0x200000002
mtime = Wed Dec 07 00:08:59 UTC 2016
pZxid = 0x200000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
Use kubectl uncordon
to uncordon the first node.
kubectl uncordon kubernetes-minion-group-pb41
node "kubernetes-minion-group-pb41" uncordoned
zk-1
is rescheduled on this node. Wait until zk-1
is Running and Ready.
kubectl get pods -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
zk-1 1/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 12m
zk-1 0/1 ContainerCreating 0 12m
zk-1 0/1 Running 0 13m
zk-1 1/1 Running 0 13m
Attempt to drain the node on which zk-2
is scheduled.
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
node "kubernetes-minion-group-i4c4" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-minion-group-i4c4, kube-proxy-kubernetes-minion-group-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
pod "heapster-v1.2.0-2604621511-wht1r" deleted
pod "zk-2" deleted
node "kubernetes-minion-group-i4c4" drained
This time kubectl drain
succeeds.
Uncordon the second node to allow zk-2
to be rescheduled.
kubectl uncordon kubernetes-minion-group-ixsl
node "kubernetes-minion-group-ixsl" uncordoned
You can use kubectl drain
in conjunction with PodDisruptionBudgets to ensure that your service
remains available during maintenance. If drain is used to cordon nodes and evict pods prior to
taking the node offline for maintenance, services that express a disruption budget will have that
budget respected. You should always allocate additional capacity for critical services so that
their Pods can be immediately rescheduled.
kubectl uncordon
to uncordon all the nodes in your cluster.