408 Request Timed-out - Node Balancer -> 3x Nginx+Django

Hi guys,

I'm having a difficult time diagnosing an intermittent issue, where our site users are occasionally getting a 408 Request Timed-out page. Our situation is a cluster with 3 nginx+gunicorn+django nodes with a Linode Node Balancer in front. (More details below)

My boss, who uses the site for several hours per day, sees it happen once a day, but not with any discernible consistency or pattern - it's not occurring on the same page/time/user action.

Our site isn't yet live, so our traffic is pretty much just some part-time testers, at most around 6 users at the one time. I have load-tests that show a single server can handle up to 30 very active users, and Linode's graphs show that all servers had light loads at time of the most recent incident (<10% cpu load, <100 blocks/sec IO) and, also that NodeBalancer considered all nodes to be UP. I didn't see anything interesting in nginx or app logs at the time.

Reading up on 408 errors, they are reported when a client opens a connection with the web server, but takes too long to finish sending its request. It may also happen if the socket closes early. (But I don't see how the error page would be delivered at all then.)

Strange thing is that apparently the 408 error page comes up immediately, without any waiting from the user – the relevant settings for nginx, "clientbodytimeout" and "clientheadertimeout" both default to 60s. The user isn't forced to wait nearly this long, so I guess that suggests the socket is getting disconnected prematurely?

Until 2 weeks ago, we were hosting the whole site from a single server for almost a year - hadn't seen the 408 error, until we migrated into the cluster we have now.

Our architecture:

NodeBalancer:

  • Port: 80

  • Protocol: HTTP

  • Algorithm: Round Robin

  • Session Stickiness: None

  • Health Check Type: HTTP Body Regex

  • Check Interval: 15

  • Check Timeout: 5

  • Check Attempts: 1

  • Check HTTP Path: /heartbeat

  • Expected HTTP Body: Server OK

  • configuration for the 3 web nodes: using private IPs, Weight: 100, Mode: Accept

3 Web servers - Linode 512mb (Fremont), Ubuntu 10.04, all configured identically:

  • Nginx (config below), 4 workers

  • Gunicorn (Python app server, like unicorn or mongrel), running 4 workers

  • Django 1.3 app

1 DB server - Linode 512mb (Fremont), Ubuntu 10.04:

  • MySQL

  • Runs celery (task queue) for our app

And we're using a CDN for (most) static media files.

So, this problem has me stumped. Since NodeBalancer is a managed service, I can't check logs or do any diagnostics there (although my network-fu / unix-fu is rather weak, I wouldn't know what to look for), so I was hoping someone might have a clue or suggestion to get me started in the right direction.

  • Is there some extra logging I can turn on, to give me more clues when it occurs again?

  • I could try to build another Linode with a manually configured nginx setup as a balancer, to act as a control, but this would be the first time I'd be doing so - I'd also rather hoped to leverage the fact that Linode would be better at setting up a balancer than me.

  • I have load-tests which I am in the middle of updating, so my next step is probably to check if the load-tests can trigger the issue. I'm rather concerned I'm not able to simulate the full range of interactions users have with the system, and may likely not trigger the issue, but it's all I have right now.

Thanks all, I really appreciate any advice you can give me.

Cheers,

-asavoy

/etc/nginx/nginx.conf

user www-data;
worker_processes 4;

error_log /var/log/nginx/error.log;

events {
    worker_connections  1024;
    # multi_accept on;
}

http {
    include       /etc/nginx/mime.types;

    access_log    /var/log/nginx/access.log;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    tcp_nodelay        on;

    gzip  on;
    gzip_http_version  1.1;
    gzip_vary on;
    gzip_comp_level 6;
    gzip_types  text/plain text/css application/json text/javascript application/x-javascript application/xml;
    gzip_disable "MSIE [1-6]\.(?!.*SV1)";

    include /etc/nginx/sites-enabled/*;
}

/etc/nginx/sites-enabled/example.com

server {
    listen 80;
    server_name example.com 192.168.166.41 "";
    root /var/www/example.com;

    access_log /var/www/example.com/log/nginx_access.log;
    error_log  /var/www/example.com/log/nginx_error.log;

    client_max_body_size 4G;
    keepalive_timeout 5;

    # Prevents proxied content from being written to temp files on disk;
    # should improve nginx speed and may help resolve 502 Bad Gateway
    # errors in some cases.
    proxy_buffers 32 16k;

    if ($host = 'www.example.com' ) {
        rewrite  ^/(.*)$  http://example.com/$1  permanent;
    }

    location / {
        if (-f /var/www/example.com/maintenance/maintenance.html) {
            return 503;
        }

        location /favicon.ico {
            access_log off;
            empty_gif;
        }

        location /media/ {
            access_log off;
            alias /var/www/example.com/application/public/media/;
        }

        location /admin-media/ {
            access_log off;
            alias /var/www/example.com/application/public/admin-media/;
        }

        location /static/ {
            access_log off;
            alias /var/www/example.com/application/public/static/;
        }

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_redirect off;

        if (!-f $request_filename) {
            proxy_pass http://127.0.0.1:8000;
            break;
        }

        error_page 503 @maintenance;
        error_page 502 @error502;
        error_page 504 @error504;
    }

    # 503 Service unavailable
    location @maintenance {
        rewrite ^(.*)$ /maintenance/maintenance.html break;
    }

    # 502 Bad Gateway error
    location @error502 {
        rewrite ^(.*)$ /maintenance/error502.html break;
    }

    # 504 Gateway Timeout error
    location @error504 {
        rewrite ^(.*)$ /maintenance/error504.html break;
    }

}

9 Replies

I wonder why are you getting 408 and not 504. Is that originating on the balancer or its backend nginx nodes? What does netstat say, is the number of sockets consistent with your config, or are there too many of them?

> I could try to build another Linode with a manually configured nginx setup as a balancer, to act as a control, but this would be the first time I'd be doing so - I'd also rather hoped to leverage the fact that Linode would be better at setting up a balancer than me.

You basically do have it already configured in nginx which "loadbalances" as reverse proxy against single node (via proxy_pass). Just fire up a linode with nginx and add individual gunicorn nodes to reverse proxy to.

http://wiki.nginx.org/LoadBalanceExample

Hi,

We are experiencing exactly the same problem.

We have a Node Balancer in front of 2-5 application servers (Apache Tomcat). The 408 errors occur very occasionally, and are not related to service load.

Like you, we previously ran the service for 2 years on a single server, with no 408 errors.

Did you get to the bottom of the problem?

Ric Searle

Yellowbrick Tracking Ltd

It's somewhat gratifying to learn that someone else is experiencing the same problem (and that perhaps I'm not crazy/doing something horribly wrong).

Unfortunately, I never could get to the bottom of the problem: it happened far too sporadically, and to progress I really needed to have access to the node balancer server, beyond what's offered by the web interface.

Apart from this forum post, I didn't request support from Linode either - as I didn't feel I had enough information to prove their service was at fault, and it seemed strange to me that no-one else reported such an issue yet.

So I ended up rolling my own node balancer, using a $20 Linode running HAProxy. Turns out that Node Balancer's configuration maps very suspiciously closely to HAProxy's configuration :) - but it still took a significant amount of time to get it working right, time that I thought a prepackaged solution like Node Balancer would save me. Sorry, Linode folks, this one was a complete strike-out for me.

The upside of roll-your-own: finer control, possibility to add SSL endpoint, and no more 408 errors since! :D

Hit me up if you have questions about how we set it up.

Your solution is also a single point of failure, unless you meant "two" when you said "a". Roll-your-own high-availability starts at $40/mo.

Linode Staff

You guys really should open tickets for this stuff, so we know about it and can get fixes out. I believe we have a fix inbound for this.

-Chris

I have deployed a patch to NodeBalancer to raise timeouts across the board. The 408 you are seeing is, in fact, NodeBalancer getting tired of waiting for a request from the client. There is no other circumstance where that status will be emitted, and it is one of only two statuses that NodeBalancer will emit itself (the other being 503 when all backends are down). This timeout was at 10 seconds; I have raised it to 60 seconds in an attempt to alleviate this issue for the two of you.

I can certainly promise that it will not fix whatever the underlying issue external to your NodeBalancer is, whether that be on the network between your browser and it or something strange going on with your backends. I unfortunately do not have enough information to speculate what that might be.

There are two solutions to this issue: * Raise the timeouts, which I have done.

  • Switch to TCP mode instead of HTTP mode. In TCP mode, NodeBalancer's HTTP special-case is turned off, and each incoming connection is merely seen as a connection instead of an HTTP request. Cookie handling is turned off, request and response validation is turned off, and – notably -- the "wait for HTTP request" timeout is completely taken out of the picture. If you continue running into this 408 issue but cannot identify where the problem lies (it's not the NodeBalancer, I'm afraid), switching to TCP mode might be worth looking into.

At any rate, to receive the new timeouts, trigger a rebuild of your NodeBalancer's configuration; the easiest way to do this is to adjust your NodeBalancer's check interval or add/remove a backend. Feel free to file a ticket and request that this be done for your NodeBalancer, if you wish (I'll take care of it).

Hoopy is right, by the way; NodeBalancer is built upon a cluster (more than 2) of machines, and you are getting far more than $20/month value out of each one. A single Linode is indeed a single point of failure with numerous opportunities to take your Web site offline. While Linodes are themselves reliable, two or more Linodes are far more reliable for high-availability purposes. If your Web site is uptime-sensitive, I would definitely prefer the value of a NodeBalancer over creating a single point of failure that I have to administer myself. That opinion isn't due to me working here -- I'd say the same thing if you were built out at a competitor with a similar product. It's simply far smarter to let us administer it for you.

-Jed

Linode Staff

Thanks Jed – Can you guys give this a shot and let us know if it helps fix the problem for you?

Thanks,

-Chris

Thanks guys for looking at this problem.

Given the timeout of 10 seconds, it's then possible that the 408s were triggering due to distance or flaky connections - Fremont DC to Australia and Middle East (both getting 408s) - and this would explain why our custom balancer doesn't have the same issue, since we're using a much larger 90 seconds for the timeout.

Didn't know that Node Balancer was hosted on a HA cluster, that's nice, and you guys should make that clearer in the Node Balancer docs. That's quite important to know when weighing up against a DIY solution.

Do you have any suggestions for handling HTTPS with Node Balancer? I'd like to move back but I don't want to migrate again without understanding our options there a bit better.

@asavoy:

Do you have any suggestions for handling HTTPS with Node Balancer? I'd like to move back but I don't want to migrate again without understanding our options there a bit better.
In TCP mode, data will be passed through without inspection or manipulation. If your backends serve SSL, it will be passed through with no issue – you'd probably want the same certificate on every backend in this scenario.

This prevents you from using cookie stickiness and such, since NodeBalancer cannot peek inside an intact SSL stream. However, many customers use NodeBalancer to deliver SSL in this way.

-Jed

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct