One place for hosting & domains

      Networking

      Kubernetes Networking Under the Hood


      Introduction

      Kubernetes is a powerful container orchestration system that can manage the deployment and operation of containerized applications across clusters of servers. In addition to coordinating container workloads, Kubernetes provides the infrastructure and tools necessary to maintain reliable network connectivity between your applications and services.

      The Kubernetes cluster networking documentation states that the basic requirements of a Kubernetes network are:

      • all containers can communicate with all other containers without NAT
      • all nodes can communicate with all containers (and vice-versa) without NAT
      • the IP that a container sees itself as is the same IP that others see it as

      In this article we will discuss how Kubernetes satisfies these networking requirements within a cluster: how data moves inside a pod, between pods, and between nodes.

      We will also show how a Kubernetes Service can provide a single static IP address and DNS entry for an application, easing communication with services that may be distributed among multiple constantly scaling and shifting pods.

      If you are unfamiliar with the terminology of Kubernetes pods and nodes or other basics, our article An Introduction to Kubernetes covers the general architecture and components involved.

      Let’s first take a look at the networking situation within a single pod.

      Pod Networking

      In Kubernetes, a pod is the most basic unit of organization: a group of tightly-coupled containers that are all closely related and perform a single function or service.

      Networking-wise, Kubernetes treats pods similar to a traditional virtual machine or a single bare-metal host: each pod receives a single unique IP address, and all containers within the pod share that address and communicate with each other over the lo loopback interface using the localhost hostname. This is achieved by assigning all of the pod’s containers to the same network stack.

      This situation should feel familiar to anybody who has deployed multiple services on a single host before the days of containerization. All the services need to use a unique port to listen on, but otherwise communication is uncomplicated and has low overhead.

      Pod to Pod Networking

      Most Kubernetes clusters will need to deploy multiple pods per node. Pod to pod communication may happen between two pods on the same node, or between two different nodes.

      Pod to Pod Communication on One Node

      On a single node you can have multiple pods that need to communicate directly with each other. Before we trace the route of a packet between pods, let’s inspect the networking setup of a node. The following diagram provides an overview, which we will walk through in detail:

      Networking overview of a single Kubernetes node

      Each node has a network interface – eth0 in this example – attached to the Kubernetes cluster network. This interface sits within the node’s root network namespace. This is the default namespace for networking devices on Linux.

      Just as process namespaces enable containers to isolate running applications from each other, network namespaces isolate network devices such as interfaces and bridges. Each pod on a node is assigned its own isolated network namespace.

      Pod namespaces are connected back to the root namespace with a virtual ethernet pair, essentially a pipe between the two namespaces with an interface on each end (here we’re using veth1 in the root namespace, and eth0 within the pod).

      Finally, the pods are connected to each other and to the node’s eth0 interface via a bridge, br0 (your node may use something like cbr0 or docker0). A bridge essentially works like a physical ethernet switch, using either ARP (address resolution protocol) or IP-based routing to look up other local interfaces to direct traffic to.

      Let’s trace a packet from pod1 to pod2 now:

      • pod1 creates a packet with pod2‘s IP as its destination
      • The packet travels over the virtual ethernet pair to the root network namespace
      • The packet continues to the bridge br0
      • Because the destination pod is on the same node, the bridge sends the packet to pod2‘s virtual ethernet pair
      • the packet travels through the virtual ethernet pair, into pod2‘s network namespace and the pod’s eth0 network interface

      Now that we’ve traced a packet from pod to pod within a node, let’s look at how pod traffic travels between nodes.

      Pod to Pod Communication Between Two Nodes

      Because each pod in a cluster has a unique IP, and every pod can communicate directly with all other pods, a packet moving between pods on two different nodes is very similar to the previous scenario.

      Let’s trace a packet from pod1 to pod3, which is on a different node:

      Networking diagram between two Kubernetes nodes

      • pod1 creates a packet with pod3‘s IP as its destination
      • The packet travels over the virtual ethernet pair to the root network namespace
      • The packet continues to the bridge br0
      • The bridge finds no local interface to route to, so the packet is sent out the default route toward eth0
      • Optional: if your cluster requires a network overlay to properly route packets to nodes, the packet may be encapsulated in a VXLAN packet (or other network virtualization technique) before heading to the network. Alternately, the network itself may be set up with the proper static routes, in which case the packet travels to eth0 and out the the network unaltered.
      • The packet enters the cluster network and is routed to the correct node.
      • The packet enters the destination node on eth0
      • Optional: if your packet was encapsulated, it will be de-encapsulated at this point
      • The packet continues to the bridge br0
      • The bridge routes the packet to the destination pod’s virtual ethernet pair
      • The packet passes through the virtual ethernet pair to the pod’s eth0 interface

      Now that we are familiar with how packets are routed via pod IP addresses, let’s take a look at Kubernetes services and how they build on top of this infrastructure.

      Pod to Service Networking

      It would be difficult to send traffic to a particular application using just pod IPs, as the dynamic nature of a Kubernetes cluster means pods can be moved, restarted, upgraded, or scaled in and out of existence. Additionally, some services will have many replicas, so we need some way to load balance between them.

      Kubernetes solves this problem with Services. A Service is an API object that maps a single virtual IP (VIP) to a set of pod IPs. Additionally, Kubernetes provides a DNS entry for each service’s name and virtual IP, so services can be easily addressed by name.

      The mapping of virtual IPs to pod IPs within the cluster is coordinated by the kube-proxy process on each node. This process sets up either iptables or IPVS to automatically translate VIPs into pod IPs before sending the packet out to the cluster network. Individual connections are tracked so packets can be properly de-translated when they return. IPVS and iptables can both do load balancing of a single service virtual IP into multiple pod IPs, though IPVS has much more flexibility in the load balancing algorithms it can use.

      Note: this translation and connection tracking processes happens entirely in the Linux kernel. kube-proxy reads from the Kubernetes API and updates iptables ip IPVS, but it is not in the data path for individual packets. This is more efficient and higher performance than previous versions of kube-proxy, which functioned as a user-land proxy.

      Let’s follow the route a packet takes from a pod, pod1 again, to a service, service1:

      Networking diagram between two Kubernetes nodes, showing DNAT translation of virtual IPs

      • pod1 creates a packet with service1‘s IP as its destination
      • The packet travels over the virtual ethernet pair to the root network namespace
      • The packet continues to the bridge br0
      • The bridge finds no local interface to route the packet to, so the packet is sent out the default route toward eth0
      • Iptables or IPVS, set up by kube-proxy, match the packet’s destination IP and translate it from a virtual IP to one of the service’s pod IPs, using whichever load balancing algorithms are available or specified
      • Optional: your packet may be encapsulated at this point, as discussed in the previous section
      • The packet enters the cluster network and is routed to the correct node.
      • The packet enters the destination node on eth0
      • Optional: if your packet was encapsulated, it will be de-encapsulated at this point
      • The packet continues to the bridge br0
      • The packet is sent to the virtual ethernet pair via veth1
      • The packet passes through the virtual ethernet pair and enters the pod network namespace via its eth0 network interface

      When the packet returns to node1 the VIP to pod IP translation will be reversed, and the packet will be back through the bridge and virtual interface to the correct pod.

      Conclusion

      In this article we’ve reviewed the internal networking infrastructure of a Kubernetes cluster. We’ve discussed the building blocks that make up the network, and detailed the hop-by-hop journey of packets in different scenarios.

      For more information about Kubernetes, take a look at our Kubernetes tutorials tag and the official Kubernetes documentation.



      Source link

      How To Inspect Kubernetes Networking


      Introduction

      Kubernetes is a container orchestration system that can manage containerized applications across a cluster of server nodes. Maintaining network connectivity between all the containers in a cluster requires some advanced networking techniques. In this article, we will briefly cover some tools and techniques for inspecting this networking setup.

      These tools may be useful if you are debugging connectivity issues, investigating network throughput problems, or exploring Kubernetes to learn how it operates.

      If you want to learn more about Kubernetes in general, our guide An Introduction to Kubernetes covers the basics. For a networking-specific overview of Kubernetes, please read Kubernetes Networking Under the Hood.

      Getting Started

      This tutorial will assume that you have a Kubernetes cluster, with kubectl installed locally and configured to connect to the cluster.

      The following sections contain many commands that are intended to be run on a Kubernetes node. They will look like this:

      • echo 'this is a node command'

      Commands that should be run on your local machine will have the following appearance:

      • echo 'this is a local command'

      Note: Most of the commands in this tutorial will need to be run as the root user. If you instead use a sudo-enabled user on your Kubernetes nodes, please add sudo to run the commands when necessary.

      Finding a Pod’s Cluster IP

      To find the cluster IP address of a Kubernetes pod, use the kubectl get pod command on your local machine, with the option -o wide. This option will list more information, including the node the pod resides on, and the pod’s cluster IP.

      Output

      NAME READY STATUS RESTARTS AGE IP NODE hello-world-5b446dd74b-7c7pk 1/1 Running 0 22m 10.244.18.4 node-one hello-world-5b446dd74b-pxtzt 1/1 Running 0 22m 10.244.3.4 node-two

      The IP column will contain the internal cluster IP address for each pod.

      If you don't see the pod you're looking for, make sure you're in the right namespace. You can list all pods in all namespaces by adding the flag --all-namespaces.

      Finding a Service's IP

      We can find a Service IP using kubectl as well. In this case we will list all services in all namespaces:

      • kubectl get service --all-namespaces

      Output

      NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.32.0.1 <none> 443/TCP 6d kube-system csi-attacher-doplugin ClusterIP 10.32.159.128 <none> 12345/TCP 6d kube-system csi-provisioner-doplugin ClusterIP 10.32.61.61 <none> 12345/TCP 6d kube-system kube-dns ClusterIP 10.32.0.10 <none> 53/UDP,53/TCP 6d kube-system kubernetes-dashboard ClusterIP 10.32.226.209 <none> 443/TCP 6d

      The service IP can be found in the CLUSTER-IP column.

      Finding and Entering Pod Network Namespaces

      Each Kubernetes pod gets assigned its own network namespace. Network namespaces (or netns) are a Linux networking primitive that provide isolation between network devices.

      It can be useful to run commands from within a pod's netns, to check DNS resolution or general network connectivity. To do so, we first need to look up the process ID of one of the containers in a pod. For Docker, we can do that with a series of two commands. First, list the containers running on a node:

      Output

      CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 173ee46a3926 gcr.io/google-samples/node-hello "/bin/sh -c 'node se…" 9 days ago Up 9 days k8s_hello-world_hello-world-5b446dd74b-pxtzt_default_386a9073-7e35-11e8-8a3d-bae97d2c1afd_0 11ad51cb72df k8s.gcr.io/pause-amd64:3.1 "/pause" 9 days ago Up 9 days k8s_POD_hello-world-5b446dd74b-pxtzt_default_386a9073-7e35-11e8-8a3d-bae97d2c1afd_0 . . .

      Find the container ID or name of any container in the pod you're interested in. In the above output we're showing two containers:

      • The first container is the hello-world app running in the hello-world pod
      • The second is a pause container running in the hello-world pod. This container exists solely to hold onto the pod's network namespace

      To get the process ID of either container, take note of the container ID or name, and use it in the following docker command:

      • docker inspect --format '{{ .State.Pid }}' container-id-or-name

      Output

      14552

      A process ID (or PID) will be output. Now we can use the nsenter program to run a command in that process's network namespace:

      • nsenter -t your-container-pid -n ip addr

      Be sure to use your own PID, and replace ip addr with the command you'd like to run inside the pod's network namespace.

      Note: One advantage of using nsenter to run commands in a pod's namespace – versus using something like docker exec – is that you have access to all of the commands available on the node, instead of the typically limited set of commands installed in containers.

      Finding a Pod's Virtual Ethernet Interface

      Each pod's network namespace communicates with the node's root netns through a virtual ethernet pipe. On the node side, this pipe appears as a device that typically begins with veth and ends in a unique identifier, such as veth77f2275 or veth01. Inside the pod this pipe appears as eth0.

      It can be useful to correlate which veth device is paired with a particular pod. To do so, we will list all network devices on the node, then list the devices in the pod's network namespace. We can then correlate device numbers between the two listings to make the connection.

      First, run ip addr in the pod's network namespace using nsenter. Refer to the previous section Finding and Entering Pod Network Namespaces
      for details on how to do this:

      • nsenter -t your-container-pid -n ip addr

      Output

      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default link/ether 02:42:0a:f4:03:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.244.3.4/24 brd 10.244.3.255 scope global eth0 valid_lft forever preferred_lft forever

      The command will output a list of the pod's interfaces. Note the if11 number after eth0@ in the example output. This means this pod's eth0 is linked to the node's 11th interface. Now run ip addr in the node's default namespace to list out its interfaces:

      Output

      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever . . . 7: veth77f2275@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master docker0 state UP group default link/ether 26:05:99:58:0d:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::2405:99ff:fe58:db9/64 scope link valid_lft forever preferred_lft forever 9: vethd36cef3@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master docker0 state UP group default link/ether ae:05:21:a2:9a:2b brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::ac05:21ff:fea2:9a2b/64 scope link valid_lft forever preferred_lft forever 11: veth4f7342d@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master docker0 state UP group default link/ether e6:4d:7b:6f:56:4c brd ff:ff:ff:ff:ff:ff link-netnsid 2 inet6 fe80::e44d:7bff:fe6f:564c/64 scope link valid_lft forever preferred_lft forever

      The 11th interface is veth4f7342d in this example output. This is the virtual ethernet pipe to the pod we're investigating.

      Inspecting Conntrack Connection Tracking

      Prior to version 1.11, Kubernetes used iptables NAT and the conntrack kernel module to track connections. To list all the connections currently being tracked, use the conntrack command:

      To watch continuously for new connections, use the -E flag:

      To list conntrack-tracked connections to a particular destination address, use the -d flag:

      • conntrack -L -d 10.32.0.1

      If your nodes are having issues making reliable connections to services, it's possible your connection tracking table is full and new connections are being dropped. If that's the case you may see messages like the following in your system logs:

      /var/log/syslog

      Jul 12 15:32:11 worker-528 kernel: nf_conntrack: table full, dropping packet.
      

      There is a sysctl setting for the maximum number of connections to track. You can list out your current value with the following command:

      • sysctl net.netfilter.nf_conntrack_max

      Output

      net.netfilter.nf_conntrack_max = 131072

      To set a new value, use the -w flag:

      • sysctl -w net.netfilter.nf_conntrack_max=198000

      To make this setting permanent, add it to the sysctl.conf file:

      /etc/sysctl.conf

      . . .
      net.ipv4.netfilter.ip_conntrack_max = 198000
      

      Inspecting Iptables Rules

      Prior to version 1.11, Kubernetes used iptables NAT to implement virtual IP translation and load balancing for Service IPs.

      To dump all iptables rules on a node, use the iptables-save command:

      Because the output can be lengthy, you may want to pipe to a file (iptables-save > output.txt) or a pager (iptables-save | less) to more easily review the rules.

      To list just the Kubernetes Service NAT rules, use the iptables command and the -L flag to specify the correct chain:

      • iptables -t nat -L KUBE-SERVICES

      Output

      Chain KUBE-SERVICES (2 references) target prot opt source destination KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.32.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.32.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain KUBE-SVC-XGLOHA7QRQ3V22RZ tcp -- anywhere 10.32.226.209 /* kube-system/kubernetes-dashboard: cluster IP */ tcp dpt:https . . .

      Querying Cluster DNS

      One way to debug your cluster DNS resolution is to deploy a debug container with all the tools you need, then use kubectl to exec nslookup on it. This is described in the official Kubernetes documentation.

      Another way to query the cluster DNS is using dig and nsenter from a node. If dig is not installed, it can be installed with apt on Debian-based Linux distributions:

      First, find the cluster IP of the kube-dns service:

      • kubectl get service -n kube-system kube-dns

      Output

      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-dns ClusterIP 10.32.0.10 <none> 53/UDP,53/TCP 15d

      The cluster IP is highlighted above. Next we'll use nsenter to run dig in the a container namespace. Look at the section Finding and Entering Pod Network Namespaces for more information on this:

      • nsenter -t 14346 -n dig kubernetes.default.svc.cluster.local @10.32.0.10

      This dig command looks up the Service's full domain name of service-name.namespace.svc.cluster.local and specifics the IP of the cluster DNS service IP (@10.32.0.10).

      Looking at IPVS Details

      As of Kubernetes 1.11, kube-proxy can configure IPVS to handle the translation of virtual Service IPs to pod IPs. You can list the translation table of IPs with ipvsadm:

      Output

      IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 100.64.0.1:443 rr -> 178.128.226.86:443 Masq 1 0 0 TCP 100.64.0.10:53 rr -> 100.96.1.3:53 Masq 1 0 0 -> 100.96.1.4:53 Masq 1 0 0 UDP 100.64.0.10:53 rr -> 100.96.1.3:53 Masq 1 0 0 -> 100.96.1.4:53 Masq 1 0 0

      To show a single Service IP, use the -t option and specify the desired IP:

      • ipvsadm -Ln -t 100.64.0.10:53

      Output

      Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 100.64.0.10:53 rr -> 100.96.1.3:53 Masq 1 0 0 -> 100.96.1.4:53 Masq 1 0 0

      Conclusion

      In this article we’ve reviewed some commands and techniques for exploring and inspecting the details of your Kubernetes cluster's networking. For more information about Kubernetes, take a look at our Kubernetes tutorials tag and the official Kubernetes documentation.



      Source link