When you run a an application under docker, you have a few different mechanisms you can choose from to provide networking connectivity.
This article digs into some of the details of two of the most common mechanisms, while trying to estimate the cost of each.
The most common way to provide network connectivity to a docker
container is to use the -p
parameter to docker run
. For
example, by running:
docker run --rm -d -p 10000:10000 envoyproxy/envoy
you have exposed port 10000 of an envoy
container
on port 10000 of your host machine.
Let's see how this works. As root, from your host, run:
netstat -ntlp
and look for port 10000. You'll probably see something like:
[...] tcp6 0 0 :::10000 :::* LISTEN 31541/docker-proxy [...]
this means that port 10000 is open by a process called docker-proxy, not envoy.
Like the name implies, docker-proxy
is a networking proxy
similar to many others: an userspace application that listens on
a port, forwarding bytes and connections back and forth as necessary.
From the standpoint of the networking stack, this is significantly different from what happens with a non-dockerized application. Instead of having: network card -> kernel -> application, we now have network card -> kernel -> application (docker proxy) -> kernel -> application.
You can see the benchmarks below, but unsurprisingly, this is not only introducing a significant performance bottleneck, but it is also costing us much more CPU and memory just to get packets in and out of a container.
docker-proxy
is unsurprisingly one of the least loved components
of docker. On linux, modern versions of docker support using iptables
instead of a proxy. The idea is simple: rather than an userspace
application proxying the connections on behalf of your container,
the kernel is configured to modify them through NAT rules and route
them appropriately.
This feature, however, is not enabled by default. By looking through the history of related bugs, it seems like it can tickle bugs on older kernels, or there are some corner cases by which this does not always work correctly.
In any case, you can enable the option by creating (or editing)
/etc/docker/daemon.json
to have:
{ "userland-proxy": false, "iptables": true }
If not done automatically, it may also be necessary to run:
/sbin/sysctl net.ipv4.conf.docker0.route_localnet=1
Another way to provide network connectivity to your job is to not use the
-p
parameter at all, and instead use --network host
.
While -p
starts a proxy to forward connections and data back
and forth between the host and your application, --network host
tells docker that you want your container to share the network
configuration of the host.
This means that if the docker container opens port 9000, port 9000 will be open on your host directly - with nothing in between.
The main problem with this is that with -p x:y
, two containers can use the
same port, as long as they get mapped to different ports on the host.
But with --network host
, instead, no two containers can use the same port.
So you need to be careful to only use containers that don't have conflicting port numbers. Further, you don't want to accidentally expose ports that your containers may have opened.
There are a few tricks you can use here. For example:
With the envoy
image, you can supply a different configuration
file, with whatever port you like. Either by creating a derived
image (recommended), or by supplying -v
to override the config file
(-v local/envoy.yaml:/etc/envoy/envoy.yaml
) at container run
time.
You can use environment variables in the Dockerfile or your scripts,
to pass down a port number (and address) to bind to. For example,
docker run -e PORT=9000
will provide a $PORT
variable with the
number 9000 in it. If you consistently use it, you can easily
move docker containers to use different ports.
To avoid accidentally export ports, you can also bind to
127.0.0.1
. Instead of having your applications listen for
any connection arriving, on 0.0.0.0, it is a good habit
to configure them to only accept from 127.0.0.1, unless you
desire them to be exposed.
This works really well with something like envoy, where
the main ports, 80 and 443, are exposed, while all other
"backend ports" are bound to 127.0.0.1.
Just to get an idea of the cost of each method, I fired up a quick iperf3
on my laptop.
The full results are below, but in short:
Adjusting for CPU load, we have that host networking is ~2.8 times more efficient than docker-proxy, and ~1.3 times more efficient than -p using iptables rules.
It is important to note that this test was ingress heavy, while traffic is often egress heavy, and that I only used 10 connections.
# iperf3 -c 0 -P 10 -p 9000 -t 60 ... [SUM] 39.00-40.00 sec 4.88 GBytes 41.9 Gbits/sec 0 # mpstat -P ALL 1 ... 08:02:56 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 08:02:57 PM all 5.63 0.00 63.17 0.26 0.00 16.88 0.00 0.00 0.00 14.07 08:02:57 PM 0 7.29 0.00 52.08 1.04 0.00 13.54 0.00 0.00 0.00 26.04 08:02:57 PM 1 0.00 0.00 80.00 0.00 0.00 20.00 0.00 0.00 0.00 0.00 08:02:57 PM 2 3.03 0.00 68.69 0.00 0.00 19.19 0.00 0.00 0.00 9.09 08:02:57 PM 3 12.50 0.00 51.04 0.00 0.00 14.58 0.00 0.00 0.00 21.88
# iperf3 -c 0 -P 10 -p 9000 -t 60 ... [SUM] 58.00-59.00 sec 6.30 GBytes 54.1 Gbits/sec 0 # mpstat -P ALL 1 ... 07:58:21 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 07:58:22 PM all 1.26 0.00 39.70 0.00 0.00 9.80 0.00 0.00 0.00 49.25 07:58:22 PM 0 2.00 0.00 70.00 0.00 0.00 28.00 0.00 0.00 0.00 0.00 07:58:22 PM 1 2.00 0.00 87.00 0.00 0.00 11.00 0.00 0.00 0.00 0.00 07:58:22 PM 2 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99 07:58:22 PM 3 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 98.99
# iperf3 -c 0 -P 10 -p 9000 -t 60 ... [SUM] 9.00-10.00 sec 8.15 GBytes 70.0 Gbits/sec 0 # mpstat -P ALL 1 ... 08:05:07 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 08:05:08 PM all 1.01 0.00 42.96 0.25 0.00 6.78 0.00 0.00 0.00 48.99 08:05:08 PM 0 0.00 0.00 0.00 1.03 0.00 0.00 0.00 0.00 0.00 98.97 08:05:08 PM 1 0.00 0.00 88.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 08:05:08 PM 2 3.00 0.00 82.00 0.00 0.00 15.00 0.00 0.00 0.00 0.00 08:05:08 PM 3 0.99 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 98.02
docker run
supports -p
(--publish
) and --export
. There is
conflicting information online on the exact meaning of export and
publish.
From my understanding, --export
(EXPORT
in the Dockerfile) is
just declaring that on a specific port there is a service running.
This is used by -P
(to publish all ports - how does it know which
ones are all ports? Thanks to EXPORT!), and by service bindings.
It does not seem to affect container to container communication. At least on linux, each container gets an IP address, and with the default bridging network and no other specific setting, it does not seem like there are iptables rules or other configurations impeding the communication.
To get the ip address of a container, you can use:
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' name
or just look at the output of:
docker inspect name
Given that netstat -ntlp
just shows docker-proxy, how can you
know which container has the port open?
One simple way is to just run:
docker ps
and peek at the PORTS column. It will show you which ports are mapped to which container:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 614c350d87dc envoyproxy/envoy "envoy" 4 hours ago Up 4 hours 9000/tcp friendly_almeida
Given that there is a docker-proxy
instance per docker container,
with ps aux
you can also peek at the command line to see the
IP and ports a docker-proxy
instance is tied to:
# ps aux root 11747 0.0 0.0 474160 8216 ? Sl 16:18 0:00 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9000 -container-ip 172.17.0.2 -container-port 9000 root 11836 0.0 0.0 547892 6228 ? Sl 16:18 0:00 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9001 -container-ip 172.17.0.3 -container-port 9000
You can also peek at what's happening in the networking layer of the container by 1) discovering the namespace id used by the container, and 2) running commands in it.
A good way to do so is to run:
docker inspect -f '{{.State.Pid}}' friendly_almeida
where friendly_almeida
is the name of the container, followed by:
nsenter -t 655 -n netstat -ntlp nsenter -t 655 -n ip a show
for example, run as root. nsenter
is particularly handy as it allows
to run arbitrary commands from your host in the container of the docker
app, like:
nsenter -t 655 -a ps aux
Even when using "userspace-proxy": false
, with netstat -ntlp
you
can see dockerd listening on the ports you pass with -p
.
This was extremely confusing to me, but after a bit of research, it turns out it does so only to allocate the port, so host applications will not be able to listen on it - which is a good idea, given that iptables is configured to modify that traffic and get it delivered to the container instead.
Something even more confusing here is that dockerd will listen on the ipv6 address ::1 (which also works for ipv4), while iptables rules will be installed for ipv4 only.
If you test with ::1 instead of 127.0.0.1, you'll land on this proxy (which does nothing) instead of the real port number.