Docker networking on Linux

When you run a an application under docker, you have a few different mechanisms you can choose from to provide networking connectivity.

This article digs into some of the details of two of the most common mechanisms, while trying to estimate the cost of each.

Using -p

The most common way to provide network connectivity to a docker container is to use the -p parameter to docker run. For example, by running:

docker run --rm -d -p 10000:10000 envoyproxy/envoy

you have exposed port 10000 of an envoy container on port 10000 of your host machine.

Let's see how this works. As root, from your host, run:

netstat -ntlp

and look for port 10000. You'll probably see something like:

[...]
tcp6   0  0 :::10000    :::*   LISTEN   31541/docker-proxy  
[...]

this means that port 10000 is open by a process called docker-proxy, not envoy.

Like the name implies, docker-proxy is a networking proxy similar to many others: an userspace application that listens on a port, forwarding bytes and connections back and forth as necessary.

From the standpoint of the networking stack, this is significantly different from what happens with a non-dockerized application. Instead of having: network card -> kernel -> application, we now have network card -> kernel -> application (docker proxy) -> kernel -> application.

You can see the benchmarks below, but unsurprisingly, this is not only introducing a significant performance bottleneck, but it is also costing us much more CPU and memory just to get packets in and out of a container.

Faster -p

docker-proxy is unsurprisingly one of the least loved components of docker. On linux, modern versions of docker support using iptables instead of a proxy. The idea is simple: rather than an userspace application proxying the connections on behalf of your container, the kernel is configured to modify them through NAT rules and route them appropriately.

This feature, however, is not enabled by default. By looking through the history of related bugs, it seems like it can tickle bugs on older kernels, or there are some corner cases by which this does not always work correctly.

In any case, you can enable the option by creating (or editing) /etc/docker/daemon.json to have:

{
    "userland-proxy": false,
    "iptables": true
}

If not done automatically, it may also be necessary to run:

/sbin/sysctl net.ipv4.conf.docker0.route_localnet=1

--network host

Another way to provide network connectivity to your job is to not use the -p parameter at all, and instead use --network host.

While -p starts a proxy to forward connections and data back and forth between the host and your application, --network host tells docker that you want your container to share the network configuration of the host.

This means that if the docker container opens port 9000, port 9000 will be open on your host directly - with nothing in between.

The main problem with this is that with -p x:y, two containers can use the same port, as long as they get mapped to different ports on the host.

But with --network host, instead, no two containers can use the same port.

So you need to be careful to only use containers that don't have conflicting port numbers. Further, you don't want to accidentally expose ports that your containers may have opened.

There are a few tricks you can use here. For example:

  • With the envoy image, you can supply a different configuration file, with whatever port you like. Either by creating a derived image (recommended), or by supplying -v to override the config file (-v local/envoy.yaml:/etc/envoy/envoy.yaml) at container run time.

  • You can use environment variables in the Dockerfile or your scripts, to pass down a port number (and address) to bind to. For example, docker run -e PORT=9000 will provide a $PORT variable with the number 9000 in it. If you consistently use it, you can easily move docker containers to use different ports.

  • To avoid accidentally export ports, you can also bind to 127.0.0.1. Instead of having your applications listen for any connection arriving, on 0.0.0.0, it is a good habit to configure them to only accept from 127.0.0.1, unless you desire them to be exposed. This works really well with something like envoy, where the main ports, 80 and 443, are exposed, while all other "backend ports" are bound to 127.0.0.1.

Some benchmarks

Just to get an idea of the cost of each method, I fired up a quick iperf3 on my laptop.

The full results are below, but in short:

  • Plain -p (with docker-proxy), yields ~42 Gbps with 14% system idle.
  • Fast -p (with iptables, userland-proxy: false), yields ~54 Gbps with 49% system idle.
  • Host networking, yields ~70 Gbps, with 49% system idle.

Adjusting for CPU load, we have that host networking is ~2.8 times more efficient than docker-proxy, and ~1.3 times more efficient than -p using iptables rules.

It is important to note that this test was ingress heavy, while traffic is often egress heavy, and that I only used 10 connections.

Plain -p (with docker-proxy)

# iperf3 -c 0 -P 10 -p 9000 -t 60
...
[SUM]  39.00-40.00  sec  4.88 GBytes  41.9 Gbits/sec    0

# mpstat -P ALL 1
...
08:02:56 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:02:57 PM  all    5.63    0.00   63.17    0.26    0.00   16.88    0.00    0.00    0.00   14.07
08:02:57 PM    0    7.29    0.00   52.08    1.04    0.00   13.54    0.00    0.00    0.00   26.04
08:02:57 PM    1    0.00    0.00   80.00    0.00    0.00   20.00    0.00    0.00    0.00    0.00
08:02:57 PM    2    3.03    0.00   68.69    0.00    0.00   19.19    0.00    0.00    0.00    9.09
08:02:57 PM    3   12.50    0.00   51.04    0.00    0.00   14.58    0.00    0.00    0.00   21.88

Using fast -p ("userland-proxy": false)

# iperf3 -c 0 -P 10 -p 9000 -t 60
...
[SUM]  58.00-59.00  sec  6.30 GBytes  54.1 Gbits/sec    0

# mpstat -P ALL 1
...
07:58:21 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:58:22 PM  all    1.26    0.00   39.70    0.00    0.00    9.80    0.00    0.00    0.00   49.25
07:58:22 PM    0    2.00    0.00   70.00    0.00    0.00   28.00    0.00    0.00    0.00    0.00
07:58:22 PM    1    2.00    0.00   87.00    0.00    0.00   11.00    0.00    0.00    0.00    0.00
07:58:22 PM    2    1.01    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.99
07:58:22 PM    3    0.00    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   98.99

Host networking

# iperf3 -c 0 -P 10 -p 9000 -t 60
...
[SUM]   9.00-10.00  sec  8.15 GBytes  70.0 Gbits/sec    0

# mpstat -P ALL 1
...
08:05:07 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:05:08 PM  all    1.01    0.00   42.96    0.25    0.00    6.78    0.00    0.00    0.00   48.99
08:05:08 PM    0    0.00    0.00    0.00    1.03    0.00    0.00    0.00    0.00    0.00   98.97
08:05:08 PM    1    0.00    0.00   88.00    0.00    0.00   12.00    0.00    0.00    0.00    0.00
08:05:08 PM    2    3.00    0.00   82.00    0.00    0.00   15.00    0.00    0.00    0.00    0.00
08:05:08 PM    3    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02

Notes

export vs publish

docker run supports -p (--publish) and --export. There is conflicting information online on the exact meaning of export and publish.

From my understanding, --export (EXPORT in the Dockerfile) is just declaring that on a specific port there is a service running. This is used by -P (to publish all ports - how does it know which ones are all ports? Thanks to EXPORT!), and by service bindings.

It does not seem to affect container to container communication. At least on linux, each container gets an IP address, and with the default bridging network and no other specific setting, it does not seem like there are iptables rules or other configurations impeding the communication.

Getting the IP address

To get the ip address of a container, you can use:

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' name

or just look at the output of:

docker inspect name

Finding out port owners with docker-proxy

Given that netstat -ntlp just shows docker-proxy, how can you know which container has the port open?

One simple way is to just run:

docker ps

and peek at the PORTS column. It will show you which ports are mapped to which container:

CONTAINER ID  IMAGE             COMMAND   CREATED       STATUS       PORTS     NAMES
614c350d87dc  envoyproxy/envoy  "envoy"   4 hours ago   Up 4 hours   9000/tcp  friendly_almeida

Given that there is a docker-proxy instance per docker container, with ps aux you can also peek at the command line to see the IP and ports a docker-proxy instance is tied to:

# ps aux
root     11747  0.0  0.0 474160  8216 ?        Sl   16:18   0:00 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9000 -container-ip 172.17.0.2 -container-port 9000
root     11836  0.0  0.0 547892  6228 ?        Sl   16:18   0:00 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9001 -container-ip 172.17.0.3 -container-port 9000

You can also peek at what's happening in the networking layer of the container by 1) discovering the namespace id used by the container, and 2) running commands in it.

A good way to do so is to run:

docker inspect -f '{{.State.Pid}}' friendly_almeida

where friendly_almeida is the name of the container, followed by:

nsenter -t 655 -n netstat -ntlp
nsenter -t 655 -n ip a show

for example, run as root. nsenter is particularly handy as it allows to run arbitrary commands from your host in the container of the docker app, like:

nsenter -t 655 -a ps aux

dockerd listening on the port

Even when using "userspace-proxy": false, with netstat -ntlp you can see dockerd listening on the ports you pass with -p.

This was extremely confusing to me, but after a bit of research, it turns out it does so only to allocate the port, so host applications will not be able to listen on it - which is a good idea, given that iptables is configured to modify that traffic and get it delivered to the container instead.

Something even more confusing here is that dockerd will listen on the ipv6 address ::1 (which also works for ipv4), while iptables rules will be installed for ipv4 only.

If you test with ::1 instead of 127.0.0.1, you'll land on this proxy (which does nothing) instead of the real port number.


Other posts

  • Getting back to use openldap While trying to get ldap torture back in shape, I had to learn again how to get slapd up and running with a reasonable configs. Here's a few things I ...
  • Ldap & slapd Back in 2004 I was playing a lot with OpenLDAP . Getting it to run reliably turned out more challenging than I had originally planned for: BerkeleyD...
Technology/LDAP