Slow SCP through network
The main issue is that I have large performance changes with "scp". One time it would take 8 seconds, another time 4 minutes. I am using "scp" to copy data back and furth between a node in Germany, and Linode.
Here's an example. These two commands were executed within a minute of each other:
/srv# time scp -p -r -P 1234 -i /root/a.pem file1 node.in.europe:/tmp/test
file1 100% 2184KB 2.1MB/s 00:01
real 0m6.368s
user 0m0.078s
sys 0m0.033s
/srv# time scp -p -r -P 1234 -i /root/a.pem file1 node.in.europe:/tmp/test
file1 100% 2184KB 546.1KB/s 00:04
real 1m5.746s
user 0m0.062s
sys 0m0.048s
On the latter one, SCP would go from 0% to 100% within seconds, and then hang at 100% for minutes. Running SCP with -vvv gives me:
file1 100% 2184KB 728.1KB/s 00:03
debug3: Wrote 28960 bytes for a total of 156975
debug3: Wrote 28960 bytes for a total of 185935
debug3: Wrote 27512 bytes for a total of 213447
debug3: Wrote 28960 bytes for a total of 242407
debug2: channel 0: rcvd adjust 114688
debug3: Wrote 27512 bytes for a total of 269919
debug3: Wrote 24616 bytes for a total of 294535
debug3: Wrote 24616 bytes for a total of 319151
[...]
Transferred: sent 2241120, received 2712 bytes, in 59.2 seconds
Bytes per second: sent 37832.0, received 45.8
debug1: Exit status 0
real 1m1.056s
user 0m0.084s
sys 0m0.046s
for an instance where it takes long, and
file1 100% 2184KB 2.1MB/s 00:01
debug3: Wrote 94120 bytes for a total of 400239
debug3: Wrote 77848 bytes for a total of 478087
debug2: channel 0: rcvd adjust 114688
debug3: Wrote 131264 bytes for a total of 609351
debug3: Wrote 32816 bytes for a total of 642167
debug3: Wrote 32816 bytes for a total of 674983
debug2: channel 0: rcvd adjust 131072
debug3: Wrote 65632 bytes for a total of 740615
debug3: Wrote 32816 bytes for a total of 773431
[...]
Transferred: sent 2241120, received 2712 bytes, in 3.8 seconds
Bytes per second: sent 592957.3, received 717.5
debug1: Exit status 0
real 0m5.235s
user 0m0.086s
sys 0m0.028s
for an instance where it went fast. From that, I conclude that SCP can get the data ready very quickly, and spends most of its time waiting to be able to push the data out of the network interface.
With that, I ran a couple of traceroutes, and I'm getting different routes every time:
/srv# traceroute node.in.europe
traceroute to node.in.europe (87.230.ooo.ooo), 30 hops max, 60 byte packets
1 a1.7.1243.static.theplanet.com (67.18.7.161) 0.551 ms 0.658 ms 0.645 ms
2 xe-2-0-0.car03.dllstx2.networklayer.com (67.18.7.89) 0.178 ms 0.206 ms 0.191 ms
3 po101.dsr02.dllstx2.networklayer.com (70.87.254.77) 0.582 ms 0.661 ms 0.611 ms
4 te4-3.dsr02.dllstx3.networklayer.com (70.87.255.129) 0.768 ms 0.760 ms te3-2.dsr02.dllstx3.networklayer.com (70.87.253.133) 0.812 ms
5 ae17.bbr02.eq01.dal03.networklayer.com (173.192.18.230) 50.695 ms 50.724 ms ae17.bbr01.eq01.dal03.networklayer.com (173.192.18.226) 0.477 ms
6 dls-bb1-link.telia.net (213.248.102.173) 0.490 ms 0.548 ms 0.534 ms
7 ash-bb1-link.telia.net (213.155.133.178) 60.992 ms 60.413 ms ae2-20G.scr2.DAL1.gblx.net (67.16.141.237) 5.790 ms
8 ldn-bb1-link.telia.net (80.91.246.69) 109.003 ms * po6.ar4.AMS2.gblx.net (67.17.107.174) 124.616 ms
9 ldn-b5-link.telia.net (80.91.248.216) 109.141 ms 109.125 ms *
10 * * *
11 xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14) 130.149 ms xe-0-2-0.cr-merak.fra2.hosteurope.de (176.28.4.2) 123.437 ms xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14) 128.728 ms
12 xe-2-2-0.cr-pollux.cgn3.hosteurope.de (80.237.129.169) 128.345 ms 128.334 ms 128.287 ms
I can repeat the traceroute, and it'll be different hosts everytime, however the dropouts are usually close to the Germany node, in the telia.net network.
Now, using a traceroute from the Germany node to Linode uses an entire different route, avoiding telia.net altogether:
# traceroute node.in.the.us
traceroute to node.in.the.us (66.228.ooo.ooo), 30 hops max, 40 byte packets
1 * * *
2 xe-3-3-0.cr-pollux.cgn3.hosteurope.de (176.28.4.9) 0.232 ms 0.231 ms 0.213 ms
3 xe-0-2-0.cr-antares.ams1.hosteurope.de (80.237.129.182) 4.509 ms 4.514 ms xe-0-3-0.cr-antares.ams1.hosteurope.de (80.237.129.118) 4.523 ms
4 tengigabitethernet6-2.ar4.ams2.gblx.net (206.165.75.1) 4.854 ms 4.816 ms 4.809 ms
5 ar4.scr4.AMS2.gblx.net (67.17.107.173) 4.656 ms 4.643 ms 4.634 ms
6 ae13.scr4.NYC1.gblx.net (67.16.166.214) 83.645 ms 83.511 ms 83.483 ms
7 e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54) 82.889 ms 80.544 ms 89.375 ms
8 softlayer-technologies-inc.ethernet11-3.ar9.nyc1.gblx.net (206.165.75.234) 79.368 ms 79.391 ms 79.359 ms
9 ae7.bbr02.tl01.nyc01.networklayer.com (173.192.18.177) 86.988 ms 86.961 ms 86.947 ms
10 ae1.bbr01.eq01.chi01.networklayer.com (173.192.18.132) 106.135 ms 106.122 ms 106.080 ms
11 ae20.bbr01.eq01.dal03.networklayer.com (173.192.18.136) 125.737 ms 125.731 ms 125.687 ms
12 po31.dsr01.dllstx3.networklayer.com (173.192.18.225) 121.404 ms 118.953 ms 122.796 ms
13 te4-4.dsr02.dllstx2.networklayer.com (70.87.255.134) 125.490 ms * te2-1.dsr01.dllstx2.networklayer.com (70.87.255.66) 129.907 ms
14 po2.car01.dllstx2.networklayer.com (70.87.254.78) 126.373 ms po1.car01.dllstx2.networklayer.com (70.87.254.74) 125.014 ms po2.car01.dllstx2.networklayer.com (70.87.254.78) 168.917 ms
15 5a.7.1243.static.theplanet.com (67.18.7.90) 128.726 ms 125.728 ms 129.486 ms
I guess my question basically is - what can I do? Is routing into the telia.net network something any of these providers can influence? Or am I barking up the wrong tree altogether and this isn't the real reason I'm getting these differences in performance?
8 Replies
@stw:
With that, I ran a couple of traceroutes, and I'm getting different routes every time:
/srv# traceroute node.in.europe traceroute to node.in.europe (87.230.ooo.ooo), 30 hops max, 60 byte packets 1 a1.7.1243.static.theplanet.com (67.18.7.161) 0.551 ms 0.658 ms 0.645 ms 2 xe-2-0-0.car03.dllstx2.networklayer.com (67.18.7.89) 0.178 ms 0.206 ms 0.191 ms 3 po101.dsr02.dllstx2.networklayer.com (70.87.254.77) 0.582 ms 0.661 ms 0.611 ms 4 te4-3.dsr02.dllstx3.networklayer.com (70.87.255.129) 0.768 ms 0.760 ms te3-2.dsr02.dllstx3.networklayer.com (70.87.253.133) 0.812 ms 5 ae17.bbr02.eq01.dal03.networklayer.com (173.192.18.230) 50.695 ms 50.724 ms ae17.bbr01.eq01.dal03.networklayer.com (173.192.18.226) 0.477 ms 6 dls-bb1-link.telia.net (213.248.102.173) 0.490 ms 0.548 ms 0.534 ms 7 ash-bb1-link.telia.net (213.155.133.178) 60.992 ms 60.413 ms ae2-20G.scr2.DAL1.gblx.net (67.16.141.237) 5.790 ms 8 ldn-bb1-link.telia.net (80.91.246.69) 109.003 ms * po6.ar4.AMS2.gblx.net (67.17.107.174) 124.616 ms 9 ldn-b5-link.telia.net (80.91.248.216) 109.141 ms 109.125 ms * 10 * * * 11 xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14) 130.149 ms xe-0-2-0.cr-merak.fra2.hosteurope.de (176.28.4.2) 123.437 ms xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14) 128.728 ms 12 xe-2-2-0.cr-pollux.cgn3.hosteurope.de (80.237.129.169) 128.345 ms 128.334 ms 128.287 ms
I can repeat the traceroute, and it'll be different hosts everytime, however the dropouts are usually close to the Germany node, in the telia.net network.
Looks like the varying routes start already in the networklayer.com network, as the traffic sometimes seem to go out through telia.net and sometimes through gblx.net? (Based on the varying hosts at the same hopcount in the trace above.)
Speculation follows:
Seems like it's networklayer.com that for whatever reason switch back and forth between these two… Which quite possibly is because the route to whichever one they prefer is flapping or something to that regard.
It's unclear if your problem related to which path is used, ie if one is actually notably better than the other, or if the problem is the actual switching back and forth.
@stw:
I don't want to bug Linode's staff too much with support tickets
And yet Linode's Accounting Dept bugs me every month wanting payment.
You pay for service - don't be afraid to use it. Worse that can happen is they say it's not a hardware/infrastructure problem so you're on your own.
Try laying down a ping while doing the scp, or maybe even mtr. Whatever is causing the routing to change is probably also dropping packets for a few seconds.
@hawk7000:
Looks like the varying routes start already in the networklayer.com network, as the traffic sometimes seem to go out through telia.net and sometimes through gblx.net? (Based on the varying hosts at the same hopcount in the trace above.)
That's true - I noticed some of the replies came from different hosts, but I didn't see a pattern in there. I agree with you, networklayer.com is probably switching routes between telia.net and gblx.net - which makes it hard to tell what network (telia.net or gblx.net) makes SCP take this long - or whether it's the switching altogether that causes it.
It still puzzles me that sometimes SCP finishes within seconds, and sometimes only after minutes have passed - yet the routes change within a traceroute. I'd assume that at some point, the speed of SCP would pick up if the route changes, or at least that it remains consistently low if route flapping itself is the issue - but instead the speed is either consistently slow, or consistently fast during a transfer.
@hoopycat:
This smells a lot like inconvenient packet loss.
Try laying down a ping while doing the scp, or maybe even mtr. Whatever is causing the routing to change is probably also dropping packets for a few seconds.
That's a good idea - it's funny how I use traceroute and all, and then forget using one of the most basic tools. I guess I neglected that because I assumed that commercial connections would not possibly have packet loss, and that it'd be a problem constrained to poor ADSL lines.
Anyway, you were correct; I ran a ping during the scp, and it would have a packet loss of around 20%.
For a 25 second SCP, I got
--- node.in.europe ping statistics ---
26 packets transmitted, 20 received, 23% packet loss, time 25026ms
rtt min/avg/max/mdev = 127.911/129.334/131.692/0.947 ms
and for a 2 minute SCP I got
--- node.in.europe ping statistics ---
140 packets transmitted, 114 received, 18% packet loss, time 139113ms
rtt min/avg/max/mdev = 126.246/128.959/133.576/1.009 ms
None of the replies were delivered out of order. During the "fast" scp (8 seconds) I had no packet loss. I tries to run SCP over a different port (1235) in an attempt to see whether port 1234 would be throttled, but I get the same figures.
So, thanks to you I realized that the problem is (a pretty significant?) packet loss, even though I am not sure, and probably can't find out, whether it's congestion caused or caused by the route switching. With that, is there anything I can do? Is this something Linode (or the German hoster) have any influence over? As in, could (and would) Linode choose a different route, or is it out of their hands anyway (in which case I wouldn't bother asking), because it's no longer in their network?
Telia.net is in Stockholm, networklayer.com's whois information is proxied (Domains By Proxy), which strikes me a bit as odd, seeing hiding contact information is something I would only expect individuals to do. Still, who would have more influence over the route - Linode or the provider in Germany?
This is probably worth tickets from both ends. In general, contact the party/parties with whom you have a business relationship. Neither Softlayer (theplanet.com, networklayer.com) nor Global Crossing nor Telia will deal with you directly, so start from the ends. (This is also handy because, in all but the most trivial cases, the return path will be totally different than the forward path, and packet loss could occur on either with similar effect. Did the packet get lost on the way there, or did the acknowledgement get lost on the way here?)
@hoopycat:
… (This is also handy because, in all but the most trivial cases, the return path will be totally different than the forward path, and packet loss could occur on either with similar effect. Did the packet get lost on the way there, or did the acknowledgement get lost on the way here?)
…Useand find out! 2ping
@hoopycat:
(scp really crams a lot into the sendq.)
If you use the -l (lower-case L) option, it seems to prevent scp from doing that. I've found it to be useful for getting more honest status reports from scp when transferring small files (which scp would normally just report 100% completion on immediately, as it's dumped the entire file into the send queue).
Hi Folks, I have a found a solution for this, but when using the windows app of WinSCP. Still, maybe it will give you a direction for the Linux scp command (although I did not find there, or for the ssh command, any matching attribute to tweak):
I found this - https://winscp.net/forum/viewtopic.php?t=25705, meaning the issue is at the SCP/SSH level.
So I disabled "Connection -> Optimize connection buffer size" in the WinSCP GUI location of Site Manager > Select the needed site > Edit > Advanced > Connection pane".
This changed my download speed (from the Linode server to my PC) to reach 50 Mbps as max speed but the average was about 35-40, while without this change it was about 12-13 Mbps.
https://winscp.net/eng/docs/ui_login_connection
FYI.