K8s – Stacked etcd to External – Zero Downtime

Because sometimes you start off with stacked etcd nodes, and then decide you really wanted external.

First and foremost, my blog disclaimer is in full effect, especially on this one. Use this info at your own risk! In this post, I’ll cover the steps to convert a Kubeadm deployed stacked etcd cluster into one consuming external etcd nodes, with no downtime.

While this guide is based on Kubeadm clusters, the process can be applied to any cluster if you have access to the etcd config and certs/keys.

Note: This will split the etcd service away from the Kubeadm upgrade process. You will need to ensure etcd upgrades and version compatibility manually. You will also need to update your kubeadm-config ConfigMap to tell Kubeadm that the etcd is external.

To lay some groundwork, let’s review the implementation details of a stacked etcd cluster deployed by Kubeadm. On each control plane node, Kubeadm will create a static pod manifest in the /etc/kubernetes/manifests directory, named etcd.yaml.

This pod has two hostPath volumeMounts, /var/lib/etcd and /etc/kubernetes/pki/etcd. The /var/lib/etcd directory isĀ  where the etcd service will store the db and supporting files.

The /etc/kubernetes/pki/etcd directory is where Kubeadm places a ca.key and ca.crt (A separate CA from the one used for the kube-apiserver is used for etcd self-signed certs). Kubeadm then generates the key pairs and certificates with this CA cert, per node. The server key and cert provide transport layer security. The client and peer certificates/keys enable node to node and user to node authentication. We configure etcd to trust/authenticate any certificate that has been issued by the trusted CA.

The etcd CA cert must be the same on each node. So with a common etcd CA signing cert/key, each node receives a generated server, client, and peer certificateĀ  that all other etcd nodes trust.

Ok, that’s the TLS side of how stacked etcd is setup by Kubeadm. With this understanding, we can use the etcd ca cert and key to configure external nodes to also be authenticated as members of the etcd cluster.

The second detail to look into is how the stacked cluster is configured to cluster. Each time we add a control plane node with Kubeadm, it will read the etcd cluster member list, and generate an –initial-cluster config parameter for the new etcd instance. the –initial-… options are ignored by etcd after first init. If we look in the static pod etcd manifest for the first control plane in our cluster, we’ll see –initial-cluster equals just that first node. When we add another cp node, the –initial-cluster will have both the first node and the new node listed, and so on.

I considered two options for this exercise. Either snapshot and restore with hard cutover or add external etcd nodes to existing cluster and then phase out the stacked node. I decided that the second option would be the least disruptive (But it also carries the most risk, so I will take a snapshot just before beginning the task).

I’ll create three external etcd nodes, configure them with certificates signed by the kubeadm generated etcd CA cert, and join them to the existing etcd cluster. They will sync with the existing nodes. Then I’ll configure the kube-apiserver static pod manifests to point to the external etcd nodes.

(Another interesting Kubeadm stacked implementation detail is that each control plane node only communicates with its colocated etcd node, via localhost address. The etcd client used by Kubernetes implements client-side load balancing, so we can provide multiple etcd node addresses to it.. I’m still up in the air on whether or not a managed LB in front of etcd nodes would be better.)

Once kube-apiserver is pointed at the external nodes only, I’ll remove the stacked members from the etcd cluster, and remove the static pod manifests for etcd from the control plane hosts. The result is zero downtime reconfiguration of the Kubernetes cluster etcd infrastructure.

Link to repo with directions and more details.