Performance tuning Rails application

I could use some advice. I'm fairly new to Rails, Linode and Debian.

I built a RoR application that uses Nginx, clustered Mongrel servers, memcache, and a Mysql database. My setup is a Linode 360 - so pretty low end from a performance standpoint.

In performance testing it, my throughput maxes out at around 13 requests/second (~1.2m page views/day). During a burst, CPU spikes to around 70-75%, and memory will push slightly into swap. The testing is using my most expensive dynamic page - so this is testing worst case throughput.

I'm struggling with identifying the latest bottleneck though. I suspect it is either memory (e.g. going into swap), or database performance - but not sure how to determine which one is the root cause.

Advice is welcome. Also, any advice on the best way to scale horizontally would be much appreciated - my inclination is to use something like MySql replication to setup two identical instances and use round robin DNS to distribute requests between them.


How much RAM and swap are you using? ( free -m )

How much RAM is MySQL configured to comsume? Is it appropriate for the size of your database? (The indexes, as well as the most commonly accessed data, should fit in RAM.)

Does your "most expensive dynamic page" really need to be up to date to the last tenth of a second? You said you're using memcached. Why not cache some of those query results and page snippets, even for a short duration?

Rails is great and all, but you really want to know what kind of SQL it's generating under the hood. If you're running hundreds of queries for that one page (e.g. 1+N pattern), there's your culprit. If not, run EXPLAIN on some of your SQL queries to find out which query takes the longest. Optimize that first, and then throw the cache at it.

I THINK I identified the root cause: my CPU slice. Just Nginx serving up static pages maxes out around 20 pages/second.

Is there any documentation on how Linode manages CPU sharing on a host? I'm beginning to think it may not handle bursts well.


It's standard Xen, IIRC. You should have access to a ton of CPU power for bursts. You're on an eight-core host, and you have four virtual cores that are scheduled on those. It's rare that you can't get the equivalent of a full quad-core Xeon worth of power.

As for the time slices, I assume (but I'm guessing) that they run a high rate like 1000 or so; it would make more sense in a virtualization environment to accept higher overhead for more fine-grained sharing.

It's possible that your host has a high CPU load from other nodes (although it's rare). You could always open a ticket to ask for an investigation, and possible migration to a different host if this is the case.

I'd think that, unless you're maxing out the virtual CPUs, you're not running into CPU issues. Regardless of the time slice granularity, pending requests would be queued, and you'd be able to saturate it anyhow.

Best-case, you'll get four cores worth of CPU. Worst-case, you'll get a fraction of one CPU. Most Linodes use relatively little CPU, so the average case is pretty close to the best case – I'm not sure anything close to the worst case has been seen in the real world. Disk and network I/O tend to be the more common bottlenecks.

Try using "pbzip2" or another parallelizing compression tool on a large test file, while watching "htop"… outside of disk I/O, compression is a very CPU-intensive task and you can probably get darned close to 400% most of the time.

How are you testing your pages/second capacity? 20 pages/second sounds really low for serving static files. Here's what I get for a small static HTML page using ab from the server itself:

rtucker@framboise:~$ ab -n 10000 -c 100
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0

Server Software:        lighttpd/1.4.19
Document Length:        2260 bytes
Time taken for tests:   1.417060 seconds
Requests per second:    7056.86 [#/sec] (mean)
Time per request:       14.171 [ms] (mean)
Time per request:       0.142 [ms] (mean, across all concurrent requests)
Transfer rate:          17387.41 [Kbytes/sec] received

Running it against my PHP-based blog (note drop in -n and -c; it degrades considerably for -c larger than 10, which I'm OK with):

rtucker@framboise:~$ ab -n 1000 -c 10

Document Length:        88334 bytes
Time taken for tests:   3.907850 seconds
Requests per second:    255.90 [#/sec] (mean)
Time per request:       39.079 [ms] (mean)
Time per request:       3.908 [ms] (mean, across all concurrent requests)
Transfer rate:          22156.17 [Kbytes/sec] received

That's on a Linode 360, running Ubuntu 8.04, lighttpd, php via fastcgi (tcp), xcache, and b2evolution 3.3.3.

Of course, running ab from my house is a lot worse and causes my NAT router to glow red.

Personally, I would move away from the clustered mongrels and use nginx with passenger and ruby enterprise. You'll save a good chunk of memory and you don't have to worry about managing mongrels that way. Way simpler imo.

The 20 requests/second is the performance of my most expensive dynamic page (with most interaction to db and memcache). So yes, performance of static pages is much better. A quick test of my static pages shows them being served up at >600 requests/second (!).

It took me about an hour to bring up a second instance. The throughput on the second instance is much better since there is no db - but still is not where it should be. The goods news is that I have a good workaround to scale the application at least until I hit a database bottleneck.

I think I'm going to take a look at Passenger next to see if it gives me some additional performance.

Ah, I misunderstood the post I replied to. Yeah, you might need to revisit that code if you're expecting it to handle more than 20 requests/second. CPU is a very difficult bottleneck… unlike RAM, you can't just double the amount of CPU power in a server, so you're stuck buying servers with more and more cores, or buying multiple servers and adapting your application to scale horizontally. You can only do so much per clock cycle, and clocks aren't getting any faster.

And from your original post… is that 70-75% figure based on max=100% or max=400%? Usually, vmstat outputs 0 to 100%, but most everything else (including the dashboard charts) outputs 0 to n*100%, where n is 4 in this case. It's confusing, but both methods are correct in their own little way…


