Are my neighbors screwing me or is it the data center?
512 is in Atlanta. 1024 is in Newark.
They are both running Unbuntu 10.04 with nginx + apc + memcached + php + mysql. Same installations, same versions, same config files, same database.
I setup a test PHP script that does some DB reads/writes. I then ran Apache Benchmark ("ab") against both Linodes. There are big differences between the outcomes.
The 512 is outperforming the 1024 by a factor of 2 or more consistently.
Why is this? Are my neighbors on the 1024 hurting my performance? Or does the Newark datacenter run older hardware or have other latencies?
I ran both ab tests from the same Media Temple server which I believe is in Virginia.
Here are the results.
Linode 512
Server Software: nginx
Server Hostname: 66.228.62.177
Server Port: 80
Document Path: /test.php
Document Length: 3843 bytes
Concurrency Level: 200
Time taken for tests: 10.373 seconds
Complete requests: 4617
Failed requests: 3280
(Connect: 0, Length: 3280, Exceptions: 0)
Write errors: 0
Keep-Alive requests: 0
Total transferred: 19718113 bytes
HTML transferred: 18240353 bytes
Requests per second: 461.68 [#/sec] (mean)
Time per request: 433.198 [ms] (mean)
Time per request: 2.166 [ms] (mean, across all concurrent requests)
Transfer rate: 1925.43 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 14 21 139.5 15 3015
Processing: 30 351 576.4 270 9815
Waiting: 30 351 576.5 269 9815
Total: 46 373 592.3 285 9836
Linode 1024
Server Software: nginx
Server Hostname: 173.255.232.118
Server Port: 80
Document Path: /test.php
Document Length: 3857 bytes
Concurrency Level: 200
Time taken for tests: 10.132372 seconds
Complete requests: 621
Failed requests: 520
(Connect: 0, Length: 520, Exceptions: 0)
Write errors: 0
Keep-Alive requests: 0
Total transferred: 2655233 bytes
HTML transferred: 2456193 bytes
Requests per second: 61.29 [#/sec] (mean)
Time per request: 3263.244 [ms] (mean)
Time per request: 16.316 [ms] (mean, across all concurrent requests)
Transfer rate: 255.91 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 7 7 0.5 7 9
Processing: 349 1656 1416.8 973 9577
Waiting: 347 1655 1416.8 972 9577
Total: 356 1663 1416.9 980 9585
UPDATE:
It appears that everything is slower with the 1024. It takes longer to SFTP files, SSH to the box, etc.
12 Replies
Are both databases hot (have beening running accessing the same data under test for a while)?
Any error logs message (nginx or system).
32 or 64 bit? On Both?
@danblack:
Do you get the same results running ab from the tested host or preferably another host in the data centre?
Are both databases hot (have beening running accessing the same data under test for a while)?
Any error logs message (nginx or system).
32 or 64 bit? On Both?
32-bit on both machines. The databases have little data in them (but the same data). Roughly 10 tables with 10 rows each.
This is the only error I could see in the various logs that looking interesting. But, this error was in php-fpm.log on BOTH machines, so I don't think it would account for the performance differences.
[16-Feb-2012 19:47:28] WARNING: [pool www] server reached pm.max_children setting (5), consider raising it
I will try running ab tests on both Linodes using the 512 to test the 512 and the 1024 to test the 1024 and let you know how it goes.
ab -k -c 200 -t 10 http://{IP-ADDRESS}/test.php
As expected, the performance of both went down. But the 512 is still outperforming the 1024.
Also, my previous statement still holds true. It takes noticeably longer to do anything on the 1024 vs the 512. SFTPing files, vi editing, running MySQL commands, SSHing in, etc.
Linode 512
Server Software: nginx
Server Hostname: 66.228.62.177
Server Port: 80
Document Path: /test.php
Document Length: 3857 bytes
Concurrency Level: 200
Time taken for tests: 10.003 seconds
Complete requests: 3517
Failed requests: 2518
(Connect: 0, Receive: 0, Length: 2518, Exceptions: 0)
Write errors: 0
Keep-Alive requests: 0
Total transferred: 15026536 bytes
HTML transferred: 13901096 bytes
Requests per second: 351.61 [#/sec] (mean)
Time per request: 568.816 [ms] (mean)
Time per request: 2.844 [ms] (mean, across all concurrent requests)
Transfer rate: 1467.05 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 2.2 0 16
Processing: 9 376 651.9 259 9929
Waiting: 9 375 651.8 259 9928
Total: 25 376 652.3 259 9939
Linode 1024
Server Software: nginx
Server Hostname: 173.255.232.118
Server Port: 80
Document Path: /test.php
Document Length: 3842 bytes
Concurrency Level: 200
Time taken for tests: 10.001 seconds
Complete requests: 1713
Failed requests: 1245
(Connect: 0, Receive: 0, Length: 1245, Exceptions: 0)
Write errors: 0
Keep-Alive requests: 0
Total transferred: 7323934 bytes
HTML transferred: 6775774 bytes
Requests per second: 171.28 [#/sec] (mean)
Time per request: 1167.659 [ms] (mean)
Time per request: 5.838 [ms] (mean, across all concurrent requests)
Transfer rate: 715.16 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 8.7 0 34
Processing: 18 789 767.8 597 9531
Waiting: 18 788 767.8 596 9531
Total: 48 791 768.0 598 9555
I'm using the same software (OpenSSH_5.3p1) with the same settings (protocol 2, etc). In verbose mode, the 1024 hangs at the last line seen here for about 20 seconds.
Not sure if this is related, but wanted to add it.
OpenSSH_5.6p1, OpenSSL 0.9.8r 8 Feb 2011
debug1: Reading configuration data /Users/Blake/.ssh/config
debug1: Applying options for pix
debug1: Reading configuration data /etc/ssh_config
debug1: Applying options for *
debug1: Connecting to 173.255.232.118 [173.255.232.118] port 22.
debug1: Connection established.
debug1: identity file /Users/Blake/.ssh/id_pix type 1
debug1: identity file /Users/Blake/.ssh/id_pix-cert type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3p1 Debian-3ubuntu7
debug1: match: OpenSSH_5.3p1 Debian-3ubuntu7 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.6
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Host '[173.255.232.118]:22' is known and matches the RSA host key.
debug1: Found key in /Users/Blake/.ssh/known_hosts:1
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: Roaming not allowed by server
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
@bperdue:
I'm using the same software (OpenSSH_5.3p1) with the same settings (protocol 2, etc). In verbose mode, the 1024 hangs at the last line seen here for about 20 seconds.
Is it possibly hanging trying to do a DNS lookup of your remote IP? Are resolvers the same on both Linodes, and the 'useDNS' setting in sshd_config?
If everything is slower on the 1024 node, it is quite possible that something is wrong. Maybe Linode is trying to hot-swap a failed disk on the node? Or it could be a bunch of really nasty neighbors. On the other hand, if only some things are slow…
What are the load avarages on both nodes? Is there any glaring difference in the "top" output?
@sleddog:
Is it possibly hanging trying to do a DNS lookup of your remote IP? Are resolvers the same on both Linodes, and the 'useDNS' setting in sshd_config?
I am not sure, no useDNS in sshd_config but I'm not sure about the resolvers – how do I check this?
@hybinet:
Holy s***, a concurrency level of 200 to test an FPM pool with 5 children? ab itself is going to use a lot of resources at that scale. But of course, that's irrelevant in this case.
If everything is slower on the 1024 node, it is quite possible that something is wrong. Maybe Linode is trying to hot-swap a failed disk on the node? Or it could be a bunch of really nasty neighbors. On the other hand, if only some things are slow…
What are the load avarages on both nodes? Is there any glaring difference in the "top" output?
Nothing unusual in top. I can do smaller concurrency tests and post them if helpful.
Everything is slower on the 1024. Here are the graphs for the two nodes.
1024 =
512 =
If the new 1024 is also slow it would tell me there's something wrong with the 1024 (settings, etc). If the new 1024 runs well, it tells me it was my neighbors or the datacenter.
@bperdue:
Nothing unusual in top. I can do smaller concurrency tests and post them if helpful.
One thing to also look for is any significant differences in your I/O wait percentage (you can just leave a vmstat 1 running on both hosts during a comparable time period). If the 1024 is getting heavier I/O contention that could easily represent the general feeling of slowness and hurt performance, and is largely outside of your control (unless, of course, you're the one doing all the I/O).
In that case it could well be your "neighborhood", or as suggested earlier perhaps some maintenance going on. But if you see a steady pattern of high I/O wait even when you aren't doing all that much I/O (again, a running vmstat could reflect this), I'd probably open a support ticket to at least have Linode verify that the host itself isn't seeing any issues.
I had one Linode (as 512 as it turns out) where even just trying to send a few hundred KB to the disk would get an effective I/O rate of less than 25KBps (~256K over ~12 seconds with I/O wait at 70+%). That (to me) was clearly less than a conservative estimate even if all 40 guests were going full bore against a typical disk. So you might see if you're getting I/O rates below what 20 users sharing a disk at full bore ought to be able to get, which could indicate a problem besides pure sharing among the guests.
The idea suggested of trying a different 1024 is pretty good too, I think, since if that improves, you can just kill the first one and essentially do your own migration away from the load.
– David
I created a new 1024 in Atlanta using a clone of my 1024 in Newark. After doing so, I realized that the SSH and other delays were due to having set a hostname that is a domain name but not having that domain name listed in /etc/hosts (ie, 127.0.0.1 –> domainname.com).
After I made that change, the SSH delay was gone. However, once that was done, I retested all three Linodes: the 512 in Atlanta, the old 1024 in Newark, and the new 1024 in Atlanta.
The 1024 in Atlanta is on par with the 512 (although strangely the 512 seems to be slighty faster in my benchmarks). The Newark 1024 is still darn slow, delivering only about 20-50% the number of reqs/s that the Atlanta Linodes are doing.
So, I'm just going to kill the Newark 1024 and call it a day.
@bperdue:
The 1024 in Atlanta is on par with the 512 (although strangely the 512 seems to be slighty faster in my benchmarks).
That's not surprising. Bigger nodes get more RAM, HDD, and transfer, but the burstable CPU allocation is more or less the same.