Webserver Load Roundup Question Session
This is also the first time I've ever set poot outside the entry level Linode plan, but I didn't see getting decent performance out of the basic plan.
But so here's the question roundup. Any pointers would be helpful.
1) I'm running Gentoo, with Lighttpd and the latest PHP with APC. I believed that this provided the best chance to get every last bit of performance out of the server. Are there any good tricks for this stacks that I might be able to leverage in order to keep my site's performance up?
2) I have 3 32MB "slices" set up for APC for the bytecode cache. After hitting the server with 20 simultaneous clients with WAPT (a load test program), the cache seems to sit quite comfortably in that sized cache. Might I be overlooking anything here?
3) I was using Munin for graphs, per the suggestions a few threads down. It seemed to work great, until about 8 o'clock this night when I noticed the server taking minutes to serve pages. After watching top and playing with the WAPT load tester, I noticed that load, with 20 simultaneous users walking through the entire site with 30-45 seconds between clicks, would give me a system load between 1.50-3.00 according to top. Then munin-graph would kick in and the php tasks would stack up, system load would shoot up to 12.00-15.00, then take the next five minutes getting back down to 7-8 before getting hit with munin-graph again. Munin is just a bit TOO overweight for my needs… I really just want a system load, memory load, network load series of graphs like on the Linode panel, any ideas? Also, I finally got the Lighttpd graphs working in Munin… I'll miss those, but customers come before pretty graphs
4) I have absolutely no clue how to properly load test this application (Wordpress Mu running Webcomic), or what kind of performance to expect from it on a Linode. Anyone run decently sized applications on your linodes? What kind of performance do you expect? Is a Linode 540 "good enough" (counting out digg)? Will I need to find room in my tight budget to expand? Or are there stupid things I might be doing in my setup that could cause the site to underperform through no fault of Xen?
If you guys have any questions about my setup, configuration, state of mind, etc that may allow you to give me better advice, please ask. I don't know what details to put in here to get good advice. I've run tons of Linux desktops of many distro flavors, and I've had linodes off and on again… but those other linodes were hobbies. This one I'm actually expecting real work from, and I'm not sure I'm "doing it right."
12 Replies
Seems like you should be able to handle more than 20 concurrent users with a 540 though.
@BarkerJr:
I don't run Munin, but re #3, you should set Munin to generate graphs on-demand, rather than via cron job.
Or separate out the munin-graphing from the production website.
I have no idea if this is possible in munin, but if it is then what I'd do is have the data collector process run on the production service and send the results to a second server which is used for monitoring and reporting. (In my case it'd probably be at home since I have a fulltime Linux install running there, and I don't need nice pretty domain names or port 80 access for the monitoring reports).
And, yes, Munin has the ability to move the graph drawing off-server, but what I really wanted wasn't a huge sheaf of graphs at home, but a simple password protected page that would allow me to gauge system load, etc, when my friend calls me in a panic to let me know that OMG THE SITE IS SLOW FIX IT NAO. I could load up the page and go "It's just a traffic spike" or "Dear Lord! Check Digg and Penny Arcade!".
I might just build a simple script in PHP that uses some output from /proc to build the exact display I want. With the bonus that the PHP script can use the APC cache, unlike the Perl Munin.
CPU is barely being used, but all these processes are stacking up. So I found this thread:
And it showed a *#&$load of lighttpd php processes just waiting in queue. I decided "screw this" and tried to restart lighttpd. Kept responding with * Stopping lighttpd [!!] and bombing out.
So I tried a graceful stop, it said [ ok ] and then # Starting lighttpd… then it hung.
Tried to open a new terminal and it hung as well. Can't even see the login prompt.
So now I'm yanking the linode's power to reboot.
I must be missing something obvious; having the session explode like that and lock up is something I expect from Windows, not Linux.
I read somewhere that the culprit may be swap. It looks like I'm using 500 meg of swap from the top line in TOP. I've noticed that Lighttpd seems to be spawning a *#&$load of fast-cgi php's… might that be it?
PROBLEM (should have been obvious):
Linode using tons of swap (about 500 mb)
Any time a long running process interrupts, all hell breaks loose
APC cache has to be uselessly small
Runaway system load; if it ever gets above 5 it shoots up, 20, 30, 40!
REASON:
Gentoo uses (for some reason) the configuration file from the LIGHTTPD faq entry for "if your server dies a twitching, painful, screaming death this config file may be the culprit" (read as "Don't let this happen to you.")
max-procs set to 4
fcgi-child-procs set to 16.
4 max_procs * (16 fcgi processes + 1 watcher proc) = 68 PHP processes
Each watcher proc contains it's own APC cache: 64 MB * 4 = 256 MB of duplicated cache in memory.
THE FIX:
"max-procs" => "2",
set max-procs in the fcgi config file to 2. This gives you redundancy if a master proc crashes (the other proc can handle load until the first is resurrected). It also means you only have 2 duplicates of the APC cache in memory.
"PHPFCGICHILDREN" => "8",
Set fcgi_procs to 8. This gives you 16 php workers, which for general load should be fine. This value can be goosed up if customers are timing out. Don't up max-procs! max-procs + 1 = 9 php processes and another APC cache in memory!
Finally, according to lighty docs, PHP fcgi children can sometimes hang after 500 requests, causing a race condition where the child is currently dying, but requests are still being fed to it by lighty. To fix this, set:
"PHPFCGIMAX_REQUESTS" => "500"
This will tell lighty to kill the child at the 500th request and start a new one, thereby ignoring the bug.
RESULTS:
No more runaway PHP. No more infinite load blowing up the server when PHP blocks up. System loads under 1 no matter what I hurl at it. Plenty of room in memory for new processes if need be. No longer abusing swap. Everyone wins!
make_opts='-j 1'
PORTAGE_NICENESS='15'
I believed that this may allow me to run an emerge (even if really slowly) while not having to bring down my lighttpd. No, no it didn't. Load went up to 8.5 and the webserver stopped responding. As a bonus, however, the cpp compiler was also hung. IN this case, no swap was being used; the OS just kinda "siezed up." As I need the updated imagemagick, I stopped lighttpd and now the emerge is flying along like no tomorrow.
So what gives? My understanding of makeopts was that it would keep Make to using 1 compiler process at a time, and PORTAGENICENESS would give every process portage spins off a niceness of 15, which my expectation was would mean that fcgi-php and lighttpd and mysql… all those processes would be given priority in the scheduler, which would just mean that the install would run a bit slower.
But that's not what happened; the whole server siezed up like an engine sans oil. Is my understanding of the Linux scheduler wrong? Am I just expecting too much out of a 500 series Linode?
It's either or now. I can either have a stinking fast emerge on my linode (Makeopts 8, nice 0), or I can have a stinkin' fast webserver. But if I try to do both (expecting one to slow down in difference to the other), both sides sieze up and die.
@AutoDMCLabs:
make_opts='-j 1'
I've noticed (I think) that the command line parser seems "flawed", and "-j4" != "-j 4"…
@make(1):
If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.
@deadwalrus:
I haven't tried out a setup like yours. …
Seems like you should be able to handle more than 20 concurrent users with a 540 though.
I'm getting BoingBoing'ed for the first time right now. I just took this screenshot:
~~![](<URL url=)
and check out top:
~~![](<URL url=)
You can see the site's doing great. No slowdown. I'm on a 720, running Apache entirely in SSL/https mode (Not just login screens), with mod_rails, Sphinx search, and two Wordpress instances.
@umdenken:
You can see the site's doing great. No slowdown. I'm on a 720, running Apache entirely in SSL/https mode (Not just login screens), with mod_rails, Sphinx search, and two Wordpress instances.
Nice - just curious, are you using the worker or prefork MPM? What do you have for ThreadsPerChild and MaxClients?