Kubernetes Networking – Nodes and Pods

I’ve been procrastinating with preparing for my CNCF Certified Kubernetes Administrator certification. Figure it was time to get to it and thought a series of blog posts on the various topics I’m digging into would be of interest to others.

Beginning with K8s networking, I’ll go into the details of the various layers of networking, how they come together, and how k8s leverages them to provide us with functioning container services.

This is a big topic, so I’ve decided to take a multi-post approach with it. I’ll start with basic networking of nodes and pods. I’ll cover network policies in another post. And then I’ll cover services, deployments, service discovery, etc. in a final post.

Note: Ultimately, I’d like to tie these posts into a corresponding lab where you can follow along with tasks to reinforce the concepts through hands-on exercises. Working with the VMware CNABU, we hope to have an Essential PKS HOL pod where you’ll be able to spin up a full Kubernetes environment and leverage exercises we post to our Github repo. Stay tuned for that. If you’d like to check out the Essential PKS HOL, it’s available at: https://labs.hol.vmware.com/HOL/catalogs/lab/6639 . This lab will spin up an Essential PKS environment on vSphere in ~30 seconds.

Kubernetes imposes the following fundamental requirement on any networking implementation (barring any intentional network segmentation policies [more on network policies later]):

  • pods on a node can communicate with all pods on all nodes without NAT
  • agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
  • the IP that a Pod sees itself as is the same IP that others see it as

At a minimum, k8s requires three, non-overlapping, IPv4 CIDR ranges (I won’t cover IPv6 in this post). One for the nodes (masters, etcd, and workers), one for the pods, and one for service addresses (more on services in another post).

For the node network, we’re implementing networking for our hosts that are running the k8s services. This network will have IP addresses assigned to physical and/or VM NICs (Depending on if you plan to run k8s on bare metal, VMs, or both). This is no different than the networking connecting any hosts, running any service/application on the network.

The number of node addresses you’ll need will vary by desired architecture for your final cluster. The simplest architecture would be a  single node, with single stacked control-plane. That is, a single master node running the etcd service and configured to schedule pods on the master. This is only acceptable for a dev/test implementation.

In a production deployment of k8s, you’d likely deploy a multi-master architecture, perhaps with external etcd nodes. This requires a load balancer in front of the control plane nodes. And finally, you’d have a number of worker nodes to schedule pods on.

K8s cluster design and architecture aren’t the focus of this post, so I’ll address that in one dedicated to design and deploy (There are a couple of pitfalls to avoid when using kubeadm to init your first control-plane node if you intend to expand to multi-master in the future).. Here, I’ll focus on a simple, single node stacked control-plane with a couple of worker nodes. In this case, we’d need three node IP addresses (One master, plus two worker nodes), with full network connectivity among all machines in the cluster.

There is really no special configuration required, other than to make sure your node ip addrs don’t conflict with other addresses, and to account for a load balanced control-plane during installation. For this simple config, we just need to have our hosts configured with a container runtime and configured via kubeadm to initialize or join the cluster. The configured IP addrs of the nodes will be consumed into the k8s etcd database for services like kube-scheduler and others to lookup. note: If we were to deploy a multi-master cluster, we would need to take into account the load balancer address for the cluster api address.

For pod networking, we have a number of options.. This is thanks to the container network interface. Pods will use ‘software defined’ networking that is implemented by a container network interface plugin. To understand this better, it’s good to start with the basics of container networking and the linux network namespace.

When we create a container, we create a linux named network namespace. This namespace creates a new network stack that is oblivious to the root network namespace. When a process is started inside the new net namespace, it does not see the processes in the root namespace, it can only communicate over its namespace network interface to the root namespace. Essentially, the process in the created net namespace sees itself as being on a different host than the processes running in the root namespace.

A caveat of creating a net namespace is that it will  have only a loopback address on its eth0 interface until a process is running. So if we look at how k8s creates a pod, the pause container should make sense.When k8s creates a pod, it places a pause container (runs nothing other than a paused service) to bring up the containers eth0 interface. This also allows k8s to restart any and all containers (other than pause) in a pod without having the network stack torn down.

So a pod is constructed of a process, or processes (container or containers), running in a network namespace (there are other namespaces and cgroups involved), which has a single IP addr assigned to its eth0 interface. A corresponding virtual ethernet adapter is created in the root network namespace that the container eth0 uses to patch into the root namespace networking with.

Note: You can use linux cli commands to visually inspect all of these constructs on your k8s nodes. As per the note about working on a lab environment with exercises at the beginning of this post, I hope to put together some exercises to walk through those tasks.

How this connection from container eth0 -> corresponding veth -> root network namespace eth0 occurs is dependent on the CNI plugin and its mode. In some cases, the veth interface will connect to a virtual bridge. This bridge will connect all containers (pods in the case of k8s) to each other on the local host and provide a default gateway path to the root namespace eth0 for external communication.In other cases, the veth will connect directly to the root namespace eth0 interface. In this case, a more complex routing implementation is being utilized. Calico with BGP is an example of this.

I found both of these resources helpful to better understand what’s actually going on inside container networking:
http://ifeanyi.co/posts/linux-namespaces-part-4/
https://www.youtube.com/watch?v=6v_BDHIgOY8&feature=youtu.be

I found this Linux Foundation collateral useful in understanding flannel and calico a bit deeper. Calico’s IP-IP mode is an interesting model with some odd approaches that just work. https://events.linuxfoundation.org/wp-content/uploads/2018/07/Packet_Walks_In_Kubernetes-v4.pdf

So, how does k8s assign IP addrs to pods, on multiple hosts, that are unique and can provide connections between all pods without NAT? This again comes down to the plugin being used.

Some will create one big mesh between all of the nodes in the cluster and implement a flat network space. Others will assign a subset of a larger IP CIDR to each host and then create routing entries at the virtual bridge. Others will create overlay networks (VXLAN, Geneve, etc.) between the nodes. NSX-T creates an IP range per k8s namespace, again, another more complex implementation, but with inherent benefit in aligning with the intent of k8s namespaces. There are more than a handful of ways plugins make these k8s directives true:

  • pods on a node can communicate with all pods on all nodes without NAT
  • agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
  • the IP that a Pod sees itself as is the same IP that others see it as

The above k8s directives define the base requirement for a CNI plugin, but they don’t restrict it. There are plugins that only cover L2, others that go to L3, and still others that go higher. At a minimum, a network plugin needs to address the directives. How it does it is not k8s concern, which allows for great diversity and innovation in this area.

If you follow along with Kelsey Hightower’s Kubernetes the Hard Way, you’ll see that during the bootstrapping of the worker nodes, he configures the CNI and supplies a pre-populated pod CIDR block. This is specific to running k8s in the Google cloud. If you are installing on-prem with Kubeadm, you’d init your cluster and then install the CNI. In this example from a previous post, you can see I apply Weave. Each overlay has its own way of configuring the pod CIDR. Weave has a default value of 10.32.0.0/12. If that range is going to interfere with other CIDR ranges on the local network, it must be changed.

As services will be covered in a future post, I won’t dig deep on the topic right now. I will mention here that the service CIDR block is something that has potential implications. For now, know that you should avoid overlap between other network CIDR blocks and the cluster service CIDR block (You can change the service network CIDR block as a method of this avoidance).

If installing with kubeadm, you can pass the following flag with the init procedure to define the cluster service (VIP) IP CIDR blcok:

kubeadm init –service-cidr [string] (The default, when installed via kubeadm is 10.96.0.0/12.)

So we have three CIDR blocks to plan for at a minimum: the node IP block, pod IP block, and the clusterIP service block. Pods receive an IP address that allow the containers running in it to communicate with other containers in other pods. The nodes are able to communicate with each other and the pods they house, and there is no overlap in the IP ranges so to avoid routing issues.

As you can see/know, a pod is the construct that allows multiple containers to share an IP address (as though they were on the same host). It’s really just a matter of creating a net namespace and then running those containers in the same namespace.

While k8s networking imperatives state that all pods should be able to communicate with all others in the cluster sans NAT, we’ll see that modern day k8s doesn’t drive us to have pods communicate directly with each other, and in fact tends to encourage communication via virtual addresses that utilize DNAT and SNAT translation. That will be covered in the topic of kube-proxy and services.

That’s it for this post, next post I’ll cover services and service discovery. Then I’ll follow up with network policies, kube-proxy, and iptables, and finally wrap up with tying it altogether.