TCP connections between docker containers timeout after ~10000 connections

Quinn · April 28, 2022, 5:35pm

I have 2 containers defined in a docker-compose.yml file, and am using them to run some rspec tests in a CI pipeline - the first container executes the tests, the second is an nginx container that is configured to perform redirects for many different paths... but I'm running into some unexpected behavior when executing them.

The tests consist of roughly ~12000 URLs, and the first container performs a GET request to the nginx container for each of them, checking to see that it redirects to the expected location.

Strangeness occurs when ~10000 of the URL's have successfully been checked - the remainder of the attempted connections to the nginx container die with a read timeout. Looking at the nginx logs reveals it thinks nothing is wrong; it serves the expected responses up until the read-timeouts, and then there will be 2 entries for the first failed test URL, and nothing after it. The network connectivity just ceases to work anymore past that point.

When I execute docker-compose up on my local machine, the tests pass, and everything is roses. When I try to run the same command with the same docker-compose.yml on one of my CI agents, we run into the above problem - ~10000 test successes, and then read-timeout between the containers.

So far, I've tried redefining the test so that the requests don't all hit nginx within 10 seconds (e.g. a 0.05 seconds delay between each request, or a 1 seconds delay every 100 requests, etc)... I expected this might give docker time to recycle some connections, but all it did was make the build take longer to fail at the same point. I've also made sure that the latest version of docker is running on my CI agents (18.09), as I saw some github issues describing what I thought were similar problems to mine, and were resolved by upgrading docker to a current version- not in my case, it seems.

I'm simply not sure where to look next - having upgraded the CI agents with the latest docker, and redefining the test several times, I'm running out of ideas on what could be causing this. It certainly seems docker related, since things work fine outside a docker context, but the docker logs don't appear to indicate anything unexpected either.

My questions: has anyone else run into this kind of thing before? Where should I look next to find the root cause of this?

Thanks in advance.

yodo.me · April 28, 2022, 5:40pm

OK, so it turns out the problem is with docker-composer, specifically this issue: https://github.com/docker/compose/issues/6018

Was able to get around the issue by directing all nginx's log output to /dev/null rather than stdout/stderr