408 Request Timed-out - Node Balancer -> 3x Nginx+Django
I'm having a difficult time diagnosing an intermittent issue, where our site users are occasionally getting a 408 Request Timed-out page. Our situation is a cluster with 3 nginx+gunicorn+django nodes with a Linode Node Balancer in front. (More details below)
My boss, who uses the site for several hours per day, sees it happen once a day, but not with any discernible consistency or pattern - it's not occurring on the same page/time/user action.
Our site isn't yet live, so our traffic is pretty much just some part-time testers, at most around 6 users at the one time. I have load-tests that show a single server can handle up to 30 very active users, and Linode's graphs show that all servers had light loads at time of the most recent incident (<10% cpu load, <100 blocks/sec IO) and, also that NodeBalancer considered all nodes to be UP. I didn't see anything interesting in nginx or app logs at the time.
Reading up on 408 errors, they are reported when a client opens a connection with the web server, but takes too long to finish sending its request. It may also happen if the socket closes early. (But I don't see how the error page would be delivered at all then.)
Strange thing is that apparently the 408 error page comes up immediately, without any waiting from the user – the relevant settings for nginx
Until 2 weeks ago, we were hosting the whole site from a single server for almost a year - hadn't seen the 408 error, until we migrated into the cluster we have now.
Our architecture:
NodeBalancer:
Port: 80
Protocol: HTTP
Algorithm: Round Robin
Session Stickiness: None
Health Check Type: HTTP Body Regex
Check Interval: 15
Check Timeout: 5
Check Attempts: 1
Check HTTP Path: /heartbeat
Expected HTTP Body: Server OK
configuration for the 3 web nodes: using private IPs, Weight: 100, Mode: Accept
3 Web servers - Linode 512mb (Fremont), Ubuntu 10.04, all configured identically:
Nginx (config below), 4 workers
Gunicorn (Python app server, like unicorn or mongrel), running 4 workers
Django 1.3 app
1 DB server - Linode 512mb (Fremont), Ubuntu 10.04:
MySQL
Runs celery (task queue) for our app
And we're using a CDN for (most) static media files.
So, this problem has me stumped. Since NodeBalancer is a managed service, I can't check logs or do any diagnostics there (although my network-fu / unix-fu is rather weak, I wouldn't know what to look for), so I was hoping someone might have a clue or suggestion to get me started in the right direction.
Is there some extra logging I can turn on, to give me more clues when it occurs again?
I could try to build another Linode with a manually configured nginx setup as a balancer, to act as a control, but this would be the first time I'd be doing so - I'd also rather hoped to leverage the fact that Linode would be better at setting up a balancer than me.
I have load-tests which I am in the middle of updating, so my next step is probably to check if the load-tests can trigger the issue. I'm rather concerned I'm not able to simulate the full range of interactions users have with the system, and may likely not trigger the issue, but it's all I have right now.
Thanks all, I really appreciate any advice you can give me.
Cheers,
-asavoy
/etc/nginx/nginx.conf
user www-data;
worker_processes 4;
error_log /var/log/nginx/error.log;
events {
worker_connections 1024;
# multi_accept on;
}
http {
include /etc/nginx/mime.types;
access_log /var/log/nginx/access.log;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
tcp_nodelay on;
gzip on;
gzip_http_version 1.1;
gzip_vary on;
gzip_comp_level 6;
gzip_types text/plain text/css application/json text/javascript application/x-javascript application/xml;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
include /etc/nginx/sites-enabled/*;
}
/etc/nginx/sites-enabled/example.com
server {
listen 80;
server_name example.com 192.168.166.41 "";
root /var/www/example.com;
access_log /var/www/example.com/log/nginx_access.log;
error_log /var/www/example.com/log/nginx_error.log;
client_max_body_size 4G;
keepalive_timeout 5;
# Prevents proxied content from being written to temp files on disk;
# should improve nginx speed and may help resolve 502 Bad Gateway
# errors in some cases.
proxy_buffers 32 16k;
if ($host = 'www.example.com' ) {
rewrite ^/(.*)$ http://example.com/$1 permanent;
}
location / {
if (-f /var/www/example.com/maintenance/maintenance.html) {
return 503;
}
location /favicon.ico {
access_log off;
empty_gif;
}
location /media/ {
access_log off;
alias /var/www/example.com/application/public/media/;
}
location /admin-media/ {
access_log off;
alias /var/www/example.com/application/public/admin-media/;
}
location /static/ {
access_log off;
alias /var/www/example.com/application/public/static/;
}
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
if (!-f $request_filename) {
proxy_pass http://127.0.0.1:8000;
break;
}
error_page 503 @maintenance;
error_page 502 @error502;
error_page 504 @error504;
}
# 503 Service unavailable
location @maintenance {
rewrite ^(.*)$ /maintenance/maintenance.html break;
}
# 502 Bad Gateway error
location @error502 {
rewrite ^(.*)$ /maintenance/error502.html break;
}
# 504 Gateway Timeout error
location @error504 {
rewrite ^(.*)$ /maintenance/error504.html break;
}
}
9 Replies
> I could try to build another Linode with a manually configured nginx setup as a balancer, to act as a control, but this would be the first time I'd be doing so - I'd also rather hoped to leverage the fact that Linode would be better at setting up a balancer than me.
You basically do have it already configured in nginx which "loadbalances" as reverse proxy against single node (via proxy_pass). Just fire up a linode with nginx and add individual gunicorn nodes to reverse proxy to.
We are experiencing exactly the same problem.
We have a Node Balancer in front of 2-5 application servers (Apache Tomcat). The 408 errors occur very occasionally, and are not related to service load.
Like you, we previously ran the service for 2 years on a single server, with no 408 errors.
Did you get to the bottom of the problem?
Ric Searle
–
Yellowbrick Tracking Ltd
Unfortunately, I never could get to the bottom of the problem: it happened far too sporadically, and to progress I really needed to have access to the node balancer server, beyond what's offered by the web interface.
Apart from this forum post, I didn't request support from Linode either - as I didn't feel I had enough information to prove their service was at fault, and it seemed strange to me that no-one else reported such an issue yet.
So I ended up rolling my own node balancer, using a $20 Linode running HAProxy. Turns out that Node Balancer's configuration maps very suspiciously closely to HAProxy's configuration
The upside of roll-your-own: finer control, possibility to add SSL endpoint, and no more 408 errors since!
Hit me up if you have questions about how we set it up.
-Chris
I can certainly promise that it will not fix whatever the underlying issue external to your NodeBalancer is, whether that be on the network between your browser and it or something strange going on with your backends. I unfortunately do not have enough information to speculate what that might be.
There are two solutions to this issue: * Raise the timeouts, which I have done.
- Switch to TCP mode instead of HTTP mode. In TCP mode, NodeBalancer's HTTP special-case is turned off, and each incoming connection is merely seen as a connection instead of an HTTP request. Cookie handling is turned off, request and response validation is turned off, and – notably -- the "wait for HTTP request" timeout is completely taken out of the picture. If you continue running into this 408 issue but cannot identify where the problem lies (it's not the NodeBalancer, I'm afraid), switching to TCP mode might be worth looking into.
At any rate, to receive the new timeouts, trigger a rebuild of your NodeBalancer's configuration; the easiest way to do this is to adjust your NodeBalancer's check interval or add/remove a backend. Feel free to file a ticket and request that this be done for your NodeBalancer, if you wish (I'll take care of it).
Hoopy is right, by the way; NodeBalancer is built upon a cluster (more than 2) of machines, and you are getting far more than $20/month value out of each one. A single Linode is indeed a single point of failure with numerous opportunities to take your Web site offline. While Linodes are themselves reliable, two or more Linodes are far more reliable for high-availability purposes. If your Web site is uptime-sensitive, I would definitely prefer the value of a NodeBalancer over creating a single point of failure that I have to administer myself. That opinion isn't due to me working here -- I'd say the same thing if you were built out at a competitor with a similar product. It's simply far smarter to let us administer it for you.
-Jed
Thanks,
-Chris
Given the timeout of 10 seconds, it's then possible that the 408s were triggering due to distance or flaky connections - Fremont DC to Australia and Middle East (both getting 408s) - and this would explain why our custom balancer doesn't have the same issue, since we're using a much larger 90 seconds for the timeout.
Didn't know that Node Balancer was hosted on a HA cluster, that's nice, and you guys should make that clearer in the Node Balancer docs. That's quite important to know when weighing up against a DIY solution.
Do you have any suggestions for handling HTTPS with Node Balancer? I'd like to move back but I don't want to migrate again without understanding our options there a bit better.
@asavoy:
Do you have any suggestions for handling HTTPS with Node Balancer? I'd like to move back but I don't want to migrate again without understanding our options there a bit better.
In TCP mode, data will be passed through without inspection or manipulation. If your backends serve SSL, it will be passed through with no issue – you'd probably want the same certificate on every backend in this scenario.
This prevents you from using cookie stickiness and such, since NodeBalancer cannot peek inside an intact SSL stream. However, many customers use NodeBalancer to deliver SSL in this way.
-Jed