poor performance on TCP/IP socket (Ubuntu Lucid) on Linode

I'm running a custom TCP/IP socket application which listens on port 8000 by default. The app works but I'm getting poor throughput on that socket (~650ms for 8k transfer).

I'm not seeing any other obvious performance issues with my Linode. eg. Ubuntu repository access and file download.

Is there some throttling that Ubuntu or the Linode imposes that I'm not aware of?

Thanks in advance for your response.

bob

19 Replies

It's most probably your app… Does it work better on a different computer?

@icanbob:

Is there some throttling that Ubuntu or the Linode imposes that I'm not aware of?
None that you're hitting, by those numbers. There are a couple of safety valves in place, but it takes work to hit them.

@hybinet:

It's most probably your app… Does it work better on a different computer?

This is a mature app that has been used in networking situations for 10 years. To give a feel for how far off these numbers are, the exact same app running against the Linode TCP/IP stack (ie. connecting to localhost) yields a result of 30 usec. Local messaging without using the TCP/IP stack on same Linode gives 8usec for 8k messages.

Connecting across the Net to the Linode one would expect performance lower than the localhost, possibly by an order of magnitude. My results are showing 4 orders of magnitude slower than the local Linode TCP/IP stack.

bob

@icanbob:

My results are showing 4 orders of magnitude slower than the local Linode TCP/IP stack.
Um…no.

An "order of magnitude" is TEN TIMES the compared number.

http://en.wikipedia.org/wiki/Orderofmagnitude

And how do you measure 8/1,000,000 seconds or 30/1,000,000 seconds with any type of accuracy on a VPS system?

@vonskippy:

Um…no.
The OP is right about four orders of magnitude. 650 milliseconds is approximately 21,600 times 30 microseconds.

OP: Where are you accessing this socket from? What is the latency between the two locations? Or is this on the same machine, just using the public interface instead of localhost? If the difference between localhost and the public interface is that big, you should probably open a ticket.

Oopsie - I thought he was saying 30 usec was 4 orders of magnitude greater then 8 usec.

Sure it's not just because there's a lot of distance in the middle? I could see ~600 ms being a minimum amount of time for a full TCP session, once handshake, slow start, and shutdown are accounted for.

Thinking about this a bit more, it would be very interesting to see tcpdump output from both the server and client end (preferably with synchronized clocks)… assuming both are quiet,

tcpdump -n -i eth0 port not 22

That will hopefully pinpoint exactly where the delay is coming in.

@hoopycat:

Thinking about this a bit more, it would be very interesting to see tcpdump output from both the server and client end (preferably with synchronized clocks)… assuming both are quiet,

tcpdump -n -i eth0 port not 22

That will hopefully pinpoint exactly where the delay is coming in.

My client side is an older MEPIS system and it appears that only libpcap is installed. I did install tcpdump on the Linode and reran a test with exchange of 10 8K messages. The resulting Linode side tcpdump file is at here (wget www.icanprogram.com/downloads/dump.gz). I am not experienced in analysing this tcpdump output, so any help is appreciated.

Interestingly, the 10 8k messages took an average of 185msec round trip today (which is a little better than the 650msec I was seeing yesterday but still very slow compared to the Linode loopback test).

Interestingly if I use the domain name assigned to the Linode to setup the socket the time goes up by a factor of 2, even though I'm timing the transaction from after the socket is opened (at least the ftime() call is after the socket opening section in the code).

Thanks in advance for all your help.

bob

If your test works like this:

1) Establish a connection

2) Send 8KB of data

3) Close the connection

4) Repeat

then you're going to get hundreds of milliseconds of latency, yes. You're effectively using a connection-oriented protocol to do the job of a datagram-oriented protocol. There are various reliable datagram protocols available, some of them even built into the Linux kernel (like RDS).

Here are the round-trips that would be required to request a server to send a small amount of data to a client. I'll assume 50ms of latency in each direction (100ms round-trip):

1) Client host sends SYN packet to remote host. 50ms pass.

2) Server receives SYN and replies with SYN/ACK. 50ms pass.

3) Client receives SYN/ACK and sends ACK. 50ms pass.

4) Server receives ACK. What is called the "three-way handshake" is complete. The server starts sending data. A minimum of 50ms pass for the first data to reach the client.

5) The client sends one or more ACK packets for the data that it received. 50ms pass.

6) The server receives the final ACK, and knowing that all data has been sent, sends a FIN packet to terminate the connection. 50ms pass.

7) The client receives the FIN packet, and sends an ACK to acknowledge it, and sends its own FIN packet to close the local side of the connection. 50ms pass.

8 ) The server receives the ACK to its FIN, and receives the client's FIN request. The server sends its own ACK in reply. 50ms pass.

9) The client receives the server's ACK. The connection is now closed, and the client can consider the connection properly terminated.

The total time required for this procedure is 8x50ms, or 400ms. This is because TCP is designed for long-lived connections, not short datagram bursts. Technically, in the above exchange, the client knows that it has all data after only 50ms, and can actually go and use the data, even though it won't know if the server has closed its side of the connection for another 50ms.

Here is how an unreliable (in that there is no delivery guarantee) UDP exchange would work (again, 50ms latency each way, 100ms round-trip):

1) Client sends a datagram requesting data. 50ms pass.

2) Server sends data in reply, with the data tagged with some sort of EOF to let the client know there's no more data. 50ms pass.

3) Client receives data.

Total elapsed time is 100ms, much lower than the 400ms that TCP required.

It's possible to do reliable UDP with no increase in latency over normal UDP (so long as there is no packetloss).

You never actually mentioned what the round-trip latency between your two hosts is, hence why I picked the 50ms of latency in each way. But it's important to understand that, because you're (mis)using TCP, there is not one round-trip involved here. There are multiple round-trips required to establish/use/tear down the TCP session. So, the difference you're seeing between localhost (where the latency for a round-trip is near-zero) and the internet (where the round-trip latency is immensely higher) makes perfect sense. That's how networking works.

Relative to 13:31:40.548003, here's how the times look:

13:31:40.548003 (0.000000): SYN received

13:31:40.548093 (0.000090): SYN+ACK sent

13:31:40.660762 (0.112759): ACK received

13:31:40.660987 (0.112984): Payload 1 received (168 bytes)

13:31:40.661060 (0.113057): Payload 1 ack sent

13:31:40.661828 (0.113825): Payload 2 sent (40 bytes)

13:31:40.758238 (0.210235): Payload 2 ack received

I gotta run, so I'm not going to compute beyond there, but it's taking about ~100ms for each round-trip and we're already 210ms into it with only 208 payload bytes exchanged. The window will eventually expand such that it won't wait for an ack before sending more payload, but it will start off slow to avoid overdoing it.

Consider moving closer to your Linode, or use UDP instead of TCP, or keep the connections up longer :-)

@Guspaz:

If your test works like this:

1) Establish a connection

2) Send 8KB of data

3) Close the connection

4) Repeat

then you're going to get hundreds of milliseconds of latency, yes.

The code is open source and is available from (http://www.icanprogram.com/simpl). The test is performed in the following manner.

a) open a TCP/IP socket

b) mark the time

c) loop N times

  • write the 8k message from client side

  • write the 8k response from Linode side

d) mark time on loop exit

e) close the socket

f) report the results

Allow me to explain.

My test consists of 4 processes (2 on each end of the TCP/IP socket). Two of those processes are SIMPL modules (from simpl/benchmarks/src in source above): sender is on my end; receiver is on the Linode end. SIMPL is a Send/Receive/Reply (QNX style) blocking protocol so the sender is blocked while the message is transported over the TCP/IP socket and the response is transported back. The other two processes are surrogates (simpl/surrogates/tcp/src in source above). The one on my side "pretends" to be the receiver. The one on the Linode side "pretends" to be the sender.

The loopback test which exchanges 8k messages in 30usec uses exactly the same 4 processes. The only difference is the socket is opened internally to the Linode node.

In the subject test the socket is opened between my desktop and the Linode and by necessity involves many other network bits. However, so does a file transfer from the Ubuntu repository and the Linode. In my test very little data transfer is occurring (8k x 2) in comparison but it appears to take 100's of milliseconds per transaction.

I have used SIMPL in many different networked configurations (mostly captive networks) over the past 10 years. To my knowledge this is the first time it has been tried from a Linode. This is the first time I'm seeing this kind of latency. I'd love to understand its source.

Thanks again in advance for all your help.

bob

@hoopycat:

Relative to 13:31:40.548003, here's how the times look:

13:31:40.548003 (0.000000): SYN received

13:31:40.548093 (0.000090): SYN+ACK sent

13:31:40.660762 (0.112759): ACK received

13:31:40.660987 (0.112984): Payload 1 received (168 bytes)

13:31:40.661060 (0.113057): Payload 1 ack sent

13:31:40.661828 (0.113825): Payload 2 sent (40 bytes)

13:31:40.758238 (0.210235): Payload 2 ack received

I gotta run, so I'm not going to compute beyond there, but it's taking about ~100ms for each round-trip and we're already 210ms into it with only 208 payload bytes exchanged. The window will eventually expand such that it won't wait for an ack before sending more payload, but it will start off slow to avoid overdoing it.

Consider moving closer to your Linode, or use UDP instead of TCP, or keep the connections up longer :-)

This sounds like the section of code responsible for setting up the socket and surrogates and not the section inside my timing loop. The tcpdump covers the entire time window including the socket setup and shutdown, not just my timing loop. Those payloads should be 8k in size.

Can you help me see those in the tcpdump output?

Thanks again in advance for all your help.

bob

@icanbob:

Can you help me see those in the tcpdump output?

Most likely, the client->server burst is at 13:31:40.774513 through 13:31:40.939390:

13:31:40.774513 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [.], seq 169:1617, ack 41, win 1460, options [nop,nop,TS val 2900805 ecr 771
000001], length 1448
13:31:40.782050 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [.], seq 1617:3065, ack 41, win 1460, options [nop,nop,TS val 2900805 ecr 77
1000001], length 1448
13:31:40.782114 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [.], ack 3065, win 396, options [nop,nop,TS val 771000121 ecr 2900805], leng
th 0
13:31:40.859548 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [P.], seq 3065:4513, ack 41, win 1460, options [nop,nop,TS val 2900885 ecr 7
71000121], length 1448
13:31:40.877071 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [.], seq 4513:5961, ack 41, win 1460, options [nop,nop,TS val 2900885 ecr 77
1000121], length 1448
13:31:40.877138 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [.], ack 5961, win 577, options [nop,nop,TS val 771000216 ecr 2900885], leng
th 0
13:31:40.882025 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [.], seq 5961:7409, ack 41, win 1460, options [nop,nop,TS val 2900885 ecr 77
1000121], length 1448
13:31:40.921460 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [.], ack 7409, win 667, options [nop,nop,TS val 771000261 ecr 2900885], length 0
13:31:40.939390 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [P.], seq 7409:8293, ack 41, win 1460, options [nop,nop,TS val 2900970 ecr 771000216], length 884

Lines 1 and 2 are the first hunk (~3k) of data arriving at the Linode, and line 3 is the acknowledgement leaving your Linode. Note the pause between lines 3 and 4: that is the sender waiting to get the acknowledgment from the Linode. If you have ntpd running and the clocks are well-synced on both ends, take a look at tcpdumps on both ends simultaneously: you'll be able to see the delay.

For completeness, the server->client burst is at 13:31:40.939764 through 13:31:41.011796:

13:31:40.939764 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [.], seq 41:2937, ack 8293, win 758, options [nop,nop,TS val 771000279 ecr 2900970], length 2896
13:31:40.939799 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [P.], seq 2937:5833, ack 8293, win 758, options [nop,nop,TS val 771000279 ecr 2900970], length 2896
13:31:41.011725 IP 208.83.215.196.4910 > 74.207.236.47.8001: Flags [.], ack 2937, win 2908, options [nop,nop,TS val 2901046 ecr 771000279], length 0
13:31:41.011782 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [.], seq 5833:7281, ack 8293, win 758, options [nop,nop,TS val 771000351 ecr 2901046], length 1448
13:31:41.011796 IP 74.207.236.47.8001 > 208.83.215.196.4910: Flags [P.], seq 7281:8165, ack 8293, win 758, options [nop,nop,TS val 771000351 ecr 2901046], length 884

Haven't looked at the code yet… have to get back to the grind for a bit.

@hoopycat:

@icanbob:

Can you help me see those in the tcpdump output?

Most likely, the client->server burst is at 13:31:40.774513 through 13:31:40.939390:

If I understand the dump my 8k message got broken into chunks somewhere between my desktop and the Linode. As an experiment I retried my test using a 512byte message size. The result was between 100-150msec for a round trip of this 512 bytes to and from the Linode. It would seem that the multiple TCP/IP packets are not significantly responsible for the latency here.

Presumably the Ubuntu repository transfer stimulated by "aptitude install" would be broken into multiple packets as well. If it took 100msec to transfer 0.5k I'd have waited a very long time for any package.

I'll try to see if I can get a system with tcpdump installed so that I can capture the sender end as well.

Thanks once again for all the assistance so far.

bob

@icanbob:

If I understand the dump my 8k message got broken into chunks somewhere between my desktop and the Linode. As an experiment I retried my test using a 512byte message size. The result was between 100-150msec for a round trip of this 512 bytes to and from the Linode. It would seem that the multiple TCP/IP packets are not significantly responsible for the latency here.

IP is packet-based, so any sufficiently long message will need to get split up. The limit for how much can be crammed into a single packet is the MTU, and it's usually somewhere around 1500 bytes.

100 to 150 ms is about what I'd expect, in your case.

> Presumably the Ubuntu repository transfer stimulated by "aptitude install" would be broken into multiple packets as well. If it took 100msec to transfer 0.5k I'd have waited a very long time for any package.

There's a couple advantages to a HTTP GET of a large file. Once the TCP handshake is done, there's (usually) only one payload packet from the client to the server, and then the server can send the reply without any further interaction from the client at the HTTP level.

Also, there's a TCP congestion avoidance feature that's working against you, called slow-start. It takes a few round trips for a TCP session to get up to full speed, so as to avoid overloading the network to the point of losing packets. Considering aptitude is pulling in fairly large .deb files and it reuses the same TCP session for sequential downloads, it will get to full speed pretty fast.

What sort of Internet connection are you using, where are you located, and where is your Linode located? There is an absolute lower limit on RTT due to the speed of light through fiber optic cable (roughly 15 ms per 1000 miles), but you'll always see more than that across the Internet. Nothing ever takes a straight line path, packets get queued in various places, there's modulation/demodulation delays, etc.

mtr is a very awesome tool, and installing/using mtr to look at the network path between your two endpoints will likely be very enlightening. And you'll find at least a half-dozen more questions ;-)

I don't know exactly how ping works with respect to setup and tear down of sockets. On my system ping yields between 90-160msec. So I guess that is what it is.

Would ipv6 help? As far as I understand the actual socket coding isn't all that different.

Thanks in advance for you help.

bob

@icanbob:

I don't know exactly how ping works with respect to setup and tear down of sockets. On my system ping yields between 90-160msec. So I guess that is what it is.
ping uses a separate protocol, ICMP, that is datagram oriented and a very thin layer above IP itself (so more like UDP than TCP). So there's no "session" or corresponding setup/teardown. A single ICMP packet (IP datagram) is transmitted to the target host, which then responds with another single ICMP packet. So by and large it's a reasonably low level measurement of network latency (as a round trip time).

– David

IPv6 probably won't help in your situation; if it takes upwards of 150 ms for a round-trip exchange between two points, you're probably doing as good as you can with TCP.

There is a tradeoff between low latency and high reliability, and TCP is designed for high reliability. It might not be fast, but you know you're getting all the packets in the right order, so it is much simpler to use. UDP would get you lower latency, but lower reliability too: packets could arrive in the wrong order, duplicated, corrupted, or missing. This might be OK for your application, if it has ways to detect and correct these conditions.

For example, DNS queries almost always use UDP:

11:00:03.714420 IP 192.168.1.103.33013 > 65.19.178.10.53: 10301+ A? hoopycat.com. (30)
11:00:03.821254 IP 65.19.178.10.53 > 192.168.1.103.33013: 10301*- 1/6/6 A 97.107.134.213 (283)

That was 106.8 ms and two packets. This is the same query, done over TCP:

`11:00:10.450164 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [s], seq 2417835502, win 5840, options [mss 1460,sackOK,TS val 20470529 ecr 0,nop,wscale 6], length 0
11:00:10.544152 IP 65.19.178.10.53 > 192.168.1.103.56251: Flags [S.], seq 1131116109, ack 2417835503, win 5792, options [mss 1460,sackOK,TS val 1364495880 ecr 20470529,nop,wscale 6], length 0
11:00:10.544232 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [.], ack 1, win 92, options [nop,nop,TS val 20470553 ecr 1364495880], length 0
11:00:10.544533 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [P.], seq 1:33, ack 1, win 92, options [nop,nop,TS val 20470553 ecr 1364495880], length 3246500+[|domain]
11:00:10.637761 IP 65.19.178.10.53 > 192.168.1.103.56251: Flags [.], ack 33, win 91, options [nop,nop,TS val 1364495974 ecr 20470553], length 0
11:00:10.637931 IP 65.19.178.10.53 > 192.168.1.103.56251: Flags [P.], seq 1:286, ack 33, win 91, options [nop,nop,TS val 1364495974 ecr 20470553], length 28546500*-[|domain]
11:00:10.637952 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [.], ack 286, win 108, options [nop,nop,TS val 20470576 ecr 1364495974], length 0
11:00:10.638829 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [F.], seq 33, ack 286, win 108, options [nop,nop,TS val 20470576 ecr 1364495974], length 0
11:00:10.733954 IP 65.19.178.10.53 > 192.168.1.103.56251: Flags [F.], seq 286, ack 34, win 91, options [nop,nop,TS val 1364496069 ecr 20470576], length 0
11:00:10.734024 IP 192.168.1.103.56251 > 65.19.178.10.53: Flags [.], ack 287, win 108, options [nop,nop,TS val 20470600 ecr 1364496069], length 0` 

283.9 ms, 10 packets.  (But I had my answer in 187.8 ms; the rest was just tearing down the TCP session.)

This is probably a best-case scenario for both TCP and UDP: a single-packet query and an immediate single-packet response.  Any additional interactivity will slow it down.  Just changing the response to two packets (1806 bytes) adds another 39 ms to TCP.  (Hooray for the ol' AXFR test)[/s]

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct