Grafana Part2 - waiting sockets

So the next step was to write a simple loop in bash and get every second the data. That seemed nice, but it had some big problem: You can’t pipe constantly data to a network connection. At least without any help. Yacks…

Here a 2021 note to this: is not quite true with the pipe. However, I just learned it quite recently that the pipe has an aggressive cache. My data were just cached until a threshold is reached. Using that little data it takes quite some time to reach it. I just didn’t know back then.

Back to the topic, my data were send to the graphite service, and I was happy. Here is the catch: For some reason the connection started to hang. It seldom happened, but it happened. It took me a moment to figure out why. Just until I check the output of the netstat. There was awaiting something odd for me. My list of connections were filed up with a for me unknown TCP state: TCP_ WAIT.

The kernel needs to wait

I did some searching through and found some useful things. The kernel has a security extension to prevent TCP port reuse attacks. It does so by waiting up to 30 seconds before the kernel closes a closed port.

There is a great stack overflow post that explain this in more detail[0] and a blog post[1] too.

When you learn about a TCP connection you learn the various states, however, they do not tell you the full truth, because in reality the OS has to cope with some more failure states than being reached. So TCP_WAIT is a rather uncommon situation, but it is necessary.

The reason for this feature exists is a security one: An attack might try to control the port that are used for connection made from the host to another system. By exhausting the port he tries to guess the possible ports and potentially forget connections. With TCP_WAIT the kernel avoid this because a given port is re-used after a given period of time. I think an example for that is the Kaminsky attack. But it also helps in coping with TCP TIME_WAIT Assassination[2].

Do not mix this up with a SYN Flood attacks. Because the connection is unclosed in such an attack, the kernel needs waits until the connection timeouts. The attack try to avoid sending a FIN package. The important different here is that there is a client (or clients) that create many connections to a socket and closes them! Mean that the kernel has finish the connection. BUT after completing it, it says the socket remains ‘occupied’ by keep these sockets open in the TCP_WAIT state.

In short TCP_WAIT describes not a half open connection, but an already completed one. It just waits a while before it allows re-using the socket again. We can see this in the State Diagram below:

How long does the kernel wait before it allow this to reuse? Depending on the Maximum Segment Lifetime (MSL). It should be around 1 up to 2 minutes. Some write it is even something like 3 minutes. Tast last entry I found on my system can be found in /proc/sys/net/ipv6/tcp_fin_timeout and consist of 1 minute.

Implications

While this seems good to know, I feared that my router may turn shut because it couldn’t handle any new connections. So I’ve started to dig into some alternatives. At first sight the CPU recycling seems to be right thing. However, this is also a bad idea that was removed in more recent kennel. After some searches further, I’ve decided to compose a more proper solution for these problem. Instead of handling it on the server side, why not just fix my shit? But more of this in the next post.

So far,
Akendo

[0] https://networkengineering.stackexchange.com/questions/19581/what-is-the-purpose-of-time-wait-in-tcp-connection-tear-down
[1] https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux#purpose
[2] https://tools.ietf.org/html/rfc1337
[3] http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html
[4 ] https://serverfault.com/questions/23385/huge-amount-of-time-wait-connections-says-netstat