Diagnosing Lag Spikes
I can't find anything in the MUD itself that would be causing the lag spikes, so I'm turning to the VPS itself or something else running on it. I've noticed that even in a shell over SSH, I notice random bursts of lag. So I can confirm that the problem is affecting both me in my shell and my various MUD players.
I've been looking at "top" to measure CPU and memory usage, but I'm not finding anything out of the ordinary.
Are there are tools or utilities out there I can use to keep track of various system stats over time?
Anyone have any suggestions on how to otherwise track down or diagnose problems like this?
9 Replies
You wouldn't happed to be on dallas152?
@crazylane:
You wouldn't happed to be on dallas152?
zunzun.com is on dallas105 and is seeing very bad lag spikes and lost packets. I thought this was on my my end at first, but not according to my tests here - only linode.
James
@crazylane:
You wouldn't happed to be on dallas152?
Nope, atlanta8.
I haven't done much digging into network stats yet. I can't find any hints in CPU usage, memory usage, or file I/O though, despite the fact that my SSH connection lags at the same time players on my MUD complain, so something is definitely up.
top - 18:21:03 up 1 day, 9:22, 1 user, load average: 14.26, 10.75, 7.53
Tasks: 190 total, 2 running, 187 sleeping, 0 stopped, 1 zombie
Cpu(s): 1.2%us, 0.9%sy, 0.1%ni, 41.4%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1549180k total, 1247216k used, 301964k free, 153940k buffers
Swap: 524280k total, 4180k used, 520100k free, 765600k cached
12017 apache 20 0 34708 17m 3716 R 4.6 1.2 0:00.23 httpd
2889 mysql 20 0 246m 95m 5340 S 3.3 6.3 64:45.48 mysqld
10947 apache 20 0 35316 18m 4752 S 0.7 1.2 0:01.29 httpd
11478 apache 20 0 34156 17m 4292 S 0.7 1.2 0:00.55 httpd
12018 apache 20 0 34164 17m 3812 S 0.7 1.1 0:00.22 httpd
9488 root 20 0 13408 9928 2380 S 0.3 0.6 0:00.51 backup.pl
10686 root 20 0 13140 10m 1560 S 0.3 0.7 0:28.26 lfd
11622 apache 20 0 34208 17m 4412 S 0.3 1.2 0:00.38 httpd
22437 root 39 19 1968 648 284 S 0.3 0.0 1:43.58 gzip
25285 root 20 0 28604 14m 4724 S 0.3 1.0 0:28.19 httpd
28122 root 20 0 2416 1184 828 R 0.3 0.1 0:22.29 top
1 root 20 0 2152 604 560 S 0.0 0.0 0:00.38 init
I'm on newark10.
Try running one of the line-per-second monitoring tools (maybe in a screen session?), like 'vmstat 1' or 'dstat -c', and look for large amounts of CPU time spent in iowaits when the stalls happen.
procs –---------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 5 3956 270452 170008 785656 0 0 23 16 3 5 1 0 75 24 0
0 6 3956 265384 170044 785632 0 0 36 0 824 401 1 1 30 69 0
0 7 3956 261408 170088 785700 0 0 32 92 370 164 0 0 42 57 0
0 9 3956 254920 170120 785668 0 0 32 0 508 168 1 1 9 89 0
0 9 3956 252740 170160 785700 0 0 40 0 212 91 0 0 35 65 0
1 9 3956 243120 170172 785700 0 0 12 0 406 162 1 1 0 98 0
0 10 3956 236460 170188 785700 0 0 16 0 972 501 1 1 41 56 0
0 11 3956 236460 170224 785664 0 0 32 236 166 82 0 0 0 100 0
0 11 3956 236460 170260 785700 0 0 36 0 43 71 0 0 38 62 0
0 11 3956 236460 170284 785676 0 0 24 0 45 54 0 0 0 100 0
0 11 3956 236540 170320 785704 0 0 36 0 69 71 0 0 40 60 0
0 11 3956 236540 170356 785668 0 0 32 8 75 60 0 0 0 100 0
0 11 3956 236572 170368 785704 0 0 12 4 44 49 0 0 40 60 0
0 10 3956 236572 170380 785692 0 0 12 0 47 43 0 0 0 100 0
0 10 3956 236760 170404 785704 0 0 24 0 45 58 0 0 39 61 0
1 10 3956 236760 170420 785688 0 0 16 32 48 48 0 0 12 88 0
0 10 3956 236760 170440 785704 0 0 20 0 49 48 0 0 35 65 0
0 10 3956 235140 170468 785676 0 0 36 40 169 56 0 0 0 99 0
0 6 3956 235320 170504 785712 0 0 36 0 87 83 0 0 38 62 0
0 6 3956 235196 170528 785688 0 0 24 0 61 54 0 0 0 100 0
0 5 3956 238056 170572 785708 0 0 44 0 224 114 0 0 40 60 0
1 5 3956 236320 170608 785712 0 0 36 0 233 118 0 0 0 99 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 5 3956 270452 170008 785656 0 0 23 16 3 5 1 0 75 24 0
0 6 3956 265384 170044 785632 0 0 36 0 824 401 1 1 30 69 0
0 7 3956 261408 170088 785700 0 0 32 92 370 164 0 0 42 57 0
0 9 3956 254920 170120 785668 0 0 32 0 508 168 1 1 9 89 0
0 9 3956 252740 170160 785700 0 0 40 0 212 91 0 0 35 65 0
1 9 3956 243120 170172 785700 0 0 12 0 406 162 1 1 0 98 0
0 10 3956 236460 170188 785700 0 0 16 0 972 501 1 1 41 56 0
0 11 3956 236460 170224 785664 0 0 32 236 166 82 0 0 0 100 0
0 11 3956 236460 170260 785700 0 0 36 0 43 71 0 0 38 62 0
0 11 3956 236460 170284 785676 0 0 24 0 45 54 0 0 0 100 0
0 11 3956 236540 170320 785704 0 0 36 0 69 71 0 0 40 60 0
0 11 3956 236540 170356 785668 0 0 32 8 75 60 0 0 0 100 0
0 11 3956 236572 170368 785704 0 0 12 4 44 49 0 0 40 60 0
0 10 3956 236572 170380 785692 0 0 12 0 47 43 0 0 0 100 0
0 10 3956 236760 170404 785704 0 0 24 0 45 58 0 0 39 61 0
1 10 3956 236760 170420 785688 0 0 16 32 48 48 0 0 12 88 0
0 10 3956 236760 170440 785704 0 0 20 0 49 48 0 0 35 65 0
0 10 3956 235140 170468 785676 0 0 36 40 169 56 0 0 0 99 0
0 6 3956 235320 170504 785712 0 0 36 0 87 83 0 0 38 62 0
0 6 3956 235196 170528 785688 0 0 24 0 61 54 0 0 0 100 0
0 5 3956 238056 170572 785708 0 0 44 0 224 114 0 0 40 60 0
1 5 3956 236320 170608 785712 0 0 36 0 233 118 0 0 0 99 0
Yep, iowaits. You can try complaining so the Linode staff locates whoever's the disk hog and convinces him to stop, or accept the host switch offer.
And I thought I had it bad with 25-30% in waits… you seem to be having 60-100%… >.>
PS. [ code ] tag is useful.