Tunneling inter-host networking through a Docker Swarm Overlay network

Extending on Laurent Bernaille’s excellent 3-part deep dive series on Docker’s overlay networks I wanted to experiment with how we could utilise Docker’s overlay network to tunnel traffic between the nodes in a Swarm outside of the container environment, essentially treating the Overlay network as a mesh VPN between host nodes.

Prepare the host

The first step is to symlink Docker’s netns mount points to a location where we can use the iproute2 utilities to interact with them:

ln -s /var/run/docker/netns /var/run/netns

We can now view Docker’s network namespaces:

ip netns

Comparing output with Docker’s networks, we can identify the relevant netns as they’re named according to Docker’s network ID (look for or1wj1px3q below):

~# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
ed31264a1f4f        bridge              bridge              local
5ef35596d5b1        docker_gwbridge     bridge              local
or1wj1px3q8b        testoverlay         overlay             swarm <-
bf9f478ebd5d        host                host                local
scybkysot08x        ingress             overlay             swarm
a9871bc3532d        none                null                local
~# ip netns
fe5b42ad2e7e (id: 3)
1-or1wj1px3q (id: 2) <-
lb_or1wj1px3
1-scybkysot0 (id: 0)
ingress_sbox (id: 1)

Now that we’ve identified the correct netns for the overlay network, we need to add a virtual interface to the host to give us access into the overlay network’s namespace.

Accessing the Overlay network from the Docker host

Create a new veth pair

veth interfaces are created in pairs and allow us to pass traffic between namespaces. First, we’ll create a new veth pair and move one end into the relevant namespace for the Docker overlay network we want to connect into:

ip link add dev veth1 type veth peer name veth2
ip link set dev veth2 netns 1-or1wj1px3q

Assign an IP address and set the MAC address on the host interface

First we allocate our host interface (veth1) an IP address in the Overlay network’s subnet

ip a a 10.0.0.100/24 dev veth1

For consistency, we use the same MAC address scheme as Docker: the last 4 bytes of the MAC address match the IP address of the interface and the second one is the VXLAN ID:

ip link set dev veth1 address 02:42:0a:00:00:64

Add veth2 to Docker’s bridge device

Bridge devices act like small switches, so to connect us into the Overlay network we need to plug one end of our veth pair (veth2 — the one we migrated into Docker’s netns) into Docker’s bridge device.

ip netns exec 1-or1wj1px3q ip link set master br0 veth2

This is the same method used by Docker to connect containers into the local Overlay network on each node.

Fix MTU

To allow for the size of the VXLAN headers, we need to drop the MTU on our interfaces:

ip netns exec 1-or1wj1px3q ip link set mtu 1450 veth2
ip link set mtu 1450 dev veth1

Bring the interfaces up

Now everything’s configured, the last step for connecting our local host into the overlay network is to bring up the virtual interfaces:

ip netns exec 1-or1wj1px3q ip link set up dev veth2
ip link set up dev veth1

If everything is configured correctly, we’re now able to access containers running on our local host and attached to the overlay network directly. You can test this by firing up a container attached to the network (assuming the network was created with the attachable flag) and pinging the assigned container IP directly from the host.

One of the key features of Docker’s Overlay networks is the ability for containers to communicate across different host nodes, so the next step is to enable forwarding over the VXLAN tunnel for our new interfaces.

Configure forwarding over VXLAN overlay

Once all of the nodes in the Swarm have been configured to access the overlay network locally (as above), we need to tell the kernel how traffic should be passed between hosts. Usually Swarm manages this for us, but since we’re working outside of the Swarm context we need to configure this manually.

Create permanent ARP entries

As we don’t have a Layer 2 network spanning the network namespaces on our nodes, we need to manually create ARP table entries to tell the kernel through which device to send traffic destined for the other side of our VXLAN tunnel. Here we add the IP and MAC address of the veth1 (host side) interface on the remote side of the tunnel, and tell the kernel to forward traffic via the vxlan0 device. This device is created by Docker as part of the Swarm overlay network.

ip netns exec 1-or1wj1px3q ip n a 10.0.0.101 lladdr 02:42:0a:00:00:65 nud permanent dev vxlan0

This needs to be completed on each of the nodes in the Swarm, with an entry added for every other node. Note that since the overlay network is part of the distributed Swarm config they share the same network ID and hence the network namespaces share a common name across host nodes.

Configure forwarding

With the kernel now forwarding traffic to the vxlan0 device, we just need to configure our bridge’s forwarding database to pass the traffic to the correct remote host over the VXLAN tunnel.

Again for each node, we need to configure the correct remote endpoint to forward our traffic to. Here we tell the forwarding database that MAC address 02:42:0a:00:00:65 (the veth1 interface on the remote host with overlay IP address 10.0.0.101) resides on the remote host with IP x.x.x.x (where x.x.x.x is the IP address used as Swarm’s advertise-addr)

ip netns exec 1-or1wj1px3q bridge fdb add 02:42:0a:00:00:65 dev vxlan0 dst x.x.x.x self permanent

We can now test connectivity by pinging the remote host’s overlay IP from our local host:

~# ping 10.0.0.100
PING 10.0.0.100 (10.0.0.100) 56(84) bytes of data.
64 bytes from 10.0.0.100: icmp_seq=1 ttl=64 time=10.4 ms
64 bytes from 10.0.0.100: icmp_seq=2 ttl=64 time=10.0 ms

Summary

We’ve demonstrated how Docker’s Overlay networks use VXLANs and network namespaces to provide container network isolation and inter-host communication, as well as how we can manipulate these to tunnel traffic between host nodes across one of those overlay networks.

Note that bridging a host interface into an overlay network is definitely not the recommended way to expose containers to the Docker host, or to the Docker host’s network. There are also other factors to consider such as IP addresses conflicting with Docker’s IPAM and persistence — Docker is clever enough to only configure an overlay network on a host when it’s required by a running container, so it’s not necessarily safe to rely on existence of the overlay network on any given node.