Beyond the K8s Downward API – Expanding runtime cluster values for containers

In this post, I’m taking on the task of getting cluster object properties to a container at initialization. Specifically, values not exposed by the downward api. I am addressing a very specific need to add region and az specific values to a replicated database instance. But this pattern can be used for many other cases.

I am working with a replicated YugaByte redis database, in a K8s cluster spanned across AWS and GCP, with a micro-service app leveraging it for the shopping cart service. If you’ve worked with YugaByte, or other replicated databases, you’ll likely be familiar with  conveying the az, rack, host. that the workload is running in via metadata. In this case, I wanted to dynamically add the metadata to each pod at scheduling to convey the CSP, region, and availability zone they were scheduled to.

This is not an overly complex exercise, but it does require some knowledge of basic K8s constructs. For example, I won’t go into what a configmap is or how to use jq. There are plenty of online resources if there is a step you’re unfamiliar with. There are a few posts here and there about this pattern, but you’re hard pressed to find one detailing each step. In this post I’ll explain exactly what I did with all of the steps you need to replicate for any cluster value.

While we have the downward api to help with this type of task, it is still limited in what data you can inject. Node labels are not included. But with a combination of values that are available via downward api and api server requests, we can get any value we desire.

K8s is region and zone aware, so that information exists by default via node labels. The SaaS offering I used to set this cluster up adds CSP topology label to the nodes, so I have all the info I need. The basic pattern is to leverage a configmap for the database container command string (stored as a shell script), modify it from init container, and then consume from main container.

  1. Create required role binding and a configmap to store shell script
  2. Create statefulset (or whatever you’re using for your pods), set the serviceaccount and define volumes for the configmap and emptydir
  3. Define an init container
  4. Mount configmap and an emptydir volume
  5. Copy shell script from configmap to emptydir volume
  6. Perform whatever operations (e.g. api server queries, string editing, etc.) you need
  7. Make changes to the shell script on the emptydir mount
  8. Mount same emptydir to main container and execute the shell script

Let’s look at the full process in detail. For this use case the fields to define CSP/region/zone in the string are pre-populated with placeholders (%CSP%, %REGION%, %ZONE%) -> mount an emptydir volume in the init container plus the configmap as a volume, configure the main container to mount the same emptydir -> within the init container install jq for json parsing, curl for api calls, and sed for string manipulation (in this case I just installed jq, as the other two are included in the image I used) -> copy the configmap shell script to the emptydir volume (configmap volumes are not writeable, so we copy it to another volume) -> use the downward api exposed node name to decide which node to retrieve labels for -> curl the nodes labels via api server (this requires some RBAC setup, I will explain that in the detailed steps below) -> pipe result  into jq to extract the value we want and assign to variable -> sed the shell script to replace the placeholder text with the value stored in the variable -> init container is done -> main container now uses command entry point with updated shell script from shared emptydir,.

First we need some things setup for this to work. We’ll begin with creating a service account, clusterrole, and clusterrolebinding. We give the service account just enough permissions to get the info we need. In K8s, every namespace automatically has a default service account assigned to pods. Modifying default isn’t a great idea, so we create a new one and then assign it to the pod. Here you see the serviceaccount and clusterrole privileges required. clusterrole is required because nodes are cluster scoped.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nodereader
  namespace: yugabyte
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: crb-read-nodes
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: node-reader
subjects:
- kind: ServiceAccount
  name: nodereader
  namespace: yugabyte

The next part we need is our configmap. We are not using the configmap for key/value pairs here. We are using it to store a shell script that will be the basis of our main container’s entry command.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: yb-master
data:
  init-yb.sh: |
    #!/bin/sh
    /home/yugabyte/bin/yb-master --fs_data_dirs=/mnt/disk0,/mnt/disk1 \
    --rpc_bind_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local \
    --server_broadcast_addresses=${HOSTNAME}.yb-masters.${NAMESPACE}.svc.cluster.local:7100 \
    --webserver_interface=0.0.0.0 \
    --master_addresses=yb-master-0.yb-masters:7100,yb-master-1.yb-masters:7100,yb-master-2.yb-masters:7100 \
    --replication_factor=3 \
    --enable_ysql=true \
    --metric_node_name=${HOSTNAME} \
    --memory_limit_hard_bytes=1824522240 \
    --stderrthreshold=0 \
    --num_cpus=1 \
    --undefok=num_cpus,enable_ysql \
    --default_memory_limit_to_ram_ratio=0.85 \
    --placement_cloud=%CSP% \
    --placement_region=%REGION% \
    --placement_zone=%ZONE%\

This script is used to start a YugaByte database cluster. We could take this config further and provide additional configuration (e.g. Dynamically set the replication factor based on the statefulset replicas value, add rpc_bind_addressess, etc.).

You can see the downward api variables like HOSTNAME being used. Unlike enclosing these variables in (” “), when used in a configmap as a shell script we use {” “}. At the bottom of the script you see the placeholders I will find and replace with sed.

Ok, we have our RBAC requirements setup and a templated shell script in a configmap. Next we’ll configure our statefulset to use these pieces. I’m not showing the entire statefulset below, just the parts involved.

Starting with the serviceaccountname in the statefulset spec, this is what gives our init container permissions to retrieve node information via the previous clusterrolebinding. When a container is created in this statefulset it will have this service account associated with it and add the service account’s certificate and key in the containers file structure, at a predictable location.

---
apiVersion: apps/v1
kind: StatefulSet
  spec:
    template:
      metadata:
        labels:
          app: "yb-master"
      spec:
        serviceAccountName: nodereader

The next pertinent part of our sts is the init container. For this I used photonOS image (because it’s small), and then install jq via curl. Obviously we’d want to build an image with the required bins if using for production. The init container mounts both the shared emptydir volume and the configmap as a volume and copies the configmap shell script to the writeable shared volume. Notice the .data value from the configmap is the file name we use. In this case, I’ve mounted the configmap as a volume with mountpath /cm-yb/ and the emptydir volume with mountpath /tmp/env/.

      initContainers:
      - name: init-topology
        image: "photon:3.0-20210108"
        imagePullPolicy: IfNotPresent
        command:
          - "sh"
          - "-c"
          - |
            cp /cm-yb/init-yb.sh /tmp/env/init-yb.sh

Curl install jq, a tool to work with JSON strings. Used in the next step to strip out the values I need from the api server’s JSON responses.

            curl -L'#' -o /usr/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.5/jq-linux64 && chmod +x /usr/bin/jq

For brevity, I am just showing the first value (CSP). Here we create a global env var called CSP and set it’s value to the return result of an api server get nodes. Curl is used with -s for silent mode to prevent the service accounts token from being sent to the logs. You could use kubectl for this rather than combining curl and jq and the api call output would automatically be suppressed. The container’s service account (nodereader) token is inserted into the string via cat from the known location in the container’s file structure. This authenticates the request with required permissions. We pipe the api server’s response into jq to retrieve just the value we want.

[Tip: Use kubectl –v=8 (or 6, 7, 9) with your kubectl commands to see verbose output when you’re working out api get/post strings]

            export CSP=$(curl -sv --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"https://$KUBERNETES_PORT_443_TCP_ADDR:$KUBERNETES_PORT_443_TCP_PORT/api/v1/nodes/$K8S_NODE | jq '.metadata.labels."topology.SaaS/csp"')

Now we use sed to find/replace the %CSP% in the copied shell script with the env var we created.

            sed -i "s/%CSP%/$CSP/g" /tmp/env/init-yb.sh

You can see in the curl commands above, I am using a number of variables that are available via the downward api to build the api server path and refer to correct K8s node for our labels. These are defined in the spec like this:

        env:
        - name: K8S_NODE
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName 

Finally, we mount the shared emptydir in our main container, and run the updated shell script as the entry point to launch our database server.

      containers:
      - name: "yb-master"
        image: "yugabytedb/yugabyte:2.5.1.0-b153"
        command:
          - "sh"
          - "-c"
          - |
            /tmp/env/init-yb.sh

That’s it, this pattern can be used to get any value in the cluster and inject it into your container at creation.