Kubernetes Networking – Nodes and Pods – Sample Packet Walk

In my previous post, I covered the basics of Kubernetes networking. I thought it would be useful to follow up with a visual walk through of a packet’s path from pod to a remote node. In this example, I’m using K8s deployed with one master, three worker nodes, and nsx-t networking at both the node and pods level.

You can click the images in this post to open a larger view in a new tab. As you can see, I have three worker nodes with IP addrs 172.15.0.3 – 172.15.0.5. The node IP addressing for this cluster is configured using the 172.15.0.0/24 CIDR.

Next, let’s look at one of the pods I have running. This is the kube-dns pod. As you can see, it’s running in the kube-system namespace, on worker node 172.15.0.3 (As seen in the list of worker nodes above), has an IP addr assigned of 172.16.1.2, and hosts a container called kubedns.

The kube-dns pod has multiple containers, but we will use the kubedns container to initiate a traceroute from within the pod. The NSX-T network container plugin implements a subnet per namespace. So we can assume that any pod running in the kube-system namespace will reside on the same subnet as our kube-dns pod.

Lets take a closer look at our worker node at 172.15.0.3. First we’ll use the kube-api server to describe the state of the node, then we’ll ssh into the node and examine it’s runtime config. From the below image (kubectl describe node 172.15.0.3), you can see there are nine pods scheduled on it. Of them, we see kube-dns as is expected from the previous describe pod operation.

From the node console, you will see that it has a number of network interfaces. One physical, the rest virtual (since the node is actually a VM, all the interfaces are virtual at one layer or another, the OS in the VM thinks eth0 is physical though).

With the ‘IP addr’ command, we see the usual loopback and eth0 interface for the node. We also see a number of virtual interfaces. Given the CNI plugin in use here is the NSX NCP, there are a few odd interface occurences that exist to enable the methods it employs. Not super important to understand how a packet travels from pod-to-pod or pod-to-node, but I’ll explain it a bit.

The veth labeled ‘nsx-container’ with the same IP and MAC as eth0 represents the NSX node agent connection from something called the hyperbus on the esxi host. Hyperbus and node agent work together to assign network addresses to pods, derived from the NSX control plane. You’ll see one veth on vlan 4094 that represents the hyperbus interface on the esxi host (VMK) and on veth on vlan 4094 for the node agent interface in the node.

Then, we have virtual interface ‘nsx-agent-outer’ and nine additional virtual ethernet interfaces. In my previous post, I referred to the veth pairs formed from the eth0 inside a pod to a virtual port in the root network namespace. As you’ll recall from our describe node above, this node has nine pods scheduled on it. The last nine interfaces listed above are the root net namespace veth pairs for the nine pods.

The use of VLAN tagging per pod allows NSX’s distributed firewall to micro-segment down to the pod level, even when they reside on the same subnet and on the same worker node. Namespace to namespace would be a simple L3 E/W rule. Where most other CNI plugins use IP tables to enforce firewalling, NSX plugin utilizes its distributed firewall.

If you look at the NSX logical switch and corresponding port, it will show no VLAN ID configured. But if you look at the attached VIF, you’ll see Traffic Tag and that is what represents the VLAN ID that is conveyed on the OVS MAC table.The VLAN is only required for the final leg of the circuit, from overlay termination at VM NIC to the pods in the VM. So the VLAN tagging occurs when the packet leaves the NSX logical port and traverses the OVS switch. In reverse, the reverse occurs.

The nsx-agent-outer interface is the link from the OVS switch to the node eth0 interface (note: this is my interpretation of the nsx-agent-outer interface. If I hear otherwise, I will update this description.). In the end, I didn’t intend for this to be a post on NSX container networking, but a walk through the path a packet takes from pods to nodes, or pods to pods. I just happen to have NSX configured in this cluster.

So, we have our worker nodes running on the NSX overlay with 172.15.0.0/24 addresses, and our pods running in namespaces that are given a subset of the 172.16.0.0/16 CIDR block. If we need to get traffic out of the overlay, we would traverse through an uplink edge connection that ties physical and overlay together. For communication between pods and nodes, we will always stay within the overlay network.

Let’s take a look inside a container running inside the kube-system namespace.First, we can see the container sees its IP addr as 172.16.1.2. As we know all pods in the kube-system namespace receive IP addrs in the scope of 172.16.1.0/24 CIDR, this is expected. Another pod running in this namespace would receive 172.16.1.3. 172.16.1..1 shows as a neighbor, this is the default gateway for this subnet and is an endpoint on an NSX tier 1 router. All pods on this subnet will be connected to an NSX virtual switch that is also connected to a tier 1 router. The tier 1 router will then be connected to a tier 0 router which is connected other subnets and the external physical network via a VLAN uplink interface.

Ok, we have a pod running in its own netns on the worker node. It has an IP of 172.16.1.2 (Take note of MAC address as well). The pod interface patches through to the virtual ethernet port assigned to the open virtual switch in the root net namespace. The OVS switch has a virtual port with interfaces VLAN mapped to the host TEP. Simple, right?

Essentially, NSX-T has logical (overlay) switches on each ESXi host, those switches have ports assigned to pods, the logical switch port is tagged with a unique VLAN and mapped into the OVS switch inside the k8s node, then OVS switch port maps to a veth that is paired with the pod’s eth0 interface.

We can correlate this by evaluating the MAC table on the OVS switch with the VIF ports assigned to the k8s namespace NSX switch to see how VLAN tagging is being applied. I’m using ovs-appctl on the worker node to dump the MAC table and then showing the corresponding VIF for the pod on NSX manager. We can see the NSX port assigned to the pod is associated by the MAC address and VLAN cached on the OVS switch interface.

MAC table on OVS switch located on worker node. This is the point where the pod eth0 pairs with the veth in the root namespace. Recall the node interfaces I printed out above, those are connected to the OVS to form the veth pair from the pods. The MAC 02:50:56:56:44:52 is the NSX distributed router. So we see one of those for each pod MAC. That is the link from POD to OVS to distributed router to logical switch.

Corresponding port in NSX. So the pod IP addr is correlated all the way from pod, through OVS switch, to NSX switch with same IP and MAC.:

From the traceroute, I am sending traffic from the pod to another worker node. The reported path is 172.16.1.1 -> 100.64.96.10 -> 100.64.96.5 -> destination. As we can see, the first hop is the 172.16.1.1 (default gateway as we’re attempting to reach an IP off of the pod network. As mentioned, the 172.16.1.1 address is on an interface of a t1 router. The TEP stretches the network from the host location of t1 router to the shared connection on the node OVS switch via VLAN tagging, so the packet is sent to the virtual wire, traverses OVS, to t1 router on host at interface 172.16.1.1.

The NSX manager interface above shows the t1 router that was automatically created when the kube-system namespace was created. There are two interfaces on this router. You’ll recognize the 172.16.1.1, the 100.64.96.11/31 port is an inter-connect to the t0 router I discussed previously. This is how we traverse up, and then back down or out. And this is the flow you see in the traceroute path above. The packet goes from 172.16.1.2 (pod) -> 172.16.1.1 (default gateway address) -> 100.64.96.10. Once we traverse to the physical network, the transport zone node (ESXi host in this case) will encapsulate the packet, send it to the next transport zone node (destination esxi host where 172.15.0.4 k8s node is running), and then that host will strip off the encapsulation to route via overlay again.

Picking up at the 100.64.96.10 hop (72.16.1.1 -> 100.64.96.10 -> 100.64.96.5 -> destination), we see the next hop is 100.64.96.5, which takes us to down from the t0 router to another t1 router. This one is connecting to the 172.15..0.0/24 subnet (The one that we are trying to get to).

Note: I’ve also highlighted the t0_uplink_1 interface above. Had we been sending to an address off of the overlay, this is where we would have seen the traffic flow to.

Below is the t1 router we land on via the 100.64.96.5 hop before finally being routed to the 172.15.0.4 node’s subnet we are addressing.

That’s it, somewhat complicated if you don’t have a background in Kubernetes and/or NSX-T, but really not too bad overall. The advantage of other CNI plugins is the reduced complexity in setting up and configuring as compared to NSX-T. The control plane is really where the largest differences lay, and there is a lot of automation gained once NSX is configured. All of the switches, routers, subnet assignments, etc. above are created and decommissioned automatically.

Whichever CNI plugin you go with, the end result comes back to the few simple K8s directives on pod and node communication I covered in the previous post. K8s doesn’t dictate how to make those directives so. The container networking is pretty much the same at the pod level given any plugin. Separate net namespace for pod, eth0 patched to root namespace via veth pair. From there, things can go in many directions.