Custom kernel 2.6.35-rc3 and issues with FastCGI/Django?
I'm currently bouncing between kernel versions because over the last few weeks, I haven't found one that is issue-free for me.
2.6.33-linode24: The best of the bunch, but twice now the time has "frozen", stopping all CPU timers, cronjobs, etc. This may be fixed now thanks to some advice from the Linode staff (relates to Xen and clocksources), but it's very hard to tell because it only happened somewhat randomly, and only after 10-15 days of uptime. Probably the most frustrating kind of bug – intermittent, non-reproducible, and totally fatal! (Nevertheless, this is the kernel that I'm sticking with at the moment.)
2.6.32.12-linode25: Every few hours -- seemingly randomly and uncorrelated to cronjobs or external load -- we'd see a huge spike in load average up to 30-40 or so, just for a few seconds, but enough to set off our monitoring software and for the server to be barely responsive to other tasks for 30 to 60 seconds or so. No noticable spikes on any of the Linode graphs during these events.
2.6.35-rc3: My latest attempt was to compile a custom kernel, stock from kernel.org, using the .config file from "Running Custom Kernels with PV-GRUB"
Everything seems to run great on 2.6.35-rc3, including the web server (lighttpd), database (mysql), e-mail (qmail), asterisk, etc… with one major exception: my Python Django FastCGI processes will run, but will only seem to take one or two requests from lighttpd. After that, lighttpd continues to try to pass them requests (via tcp localhost:3303), but there's no answer. In the lighttpd logs, I get:
"establishing connection failed: Connection timed out socket: tcp:127.0.0.1:3303"
The python processes continue running and don't seem to use any CPU.
Since the userland was exactly the same, and only the kernel was changing, my only thought so far was that it might be firewall related (arno-iptables-firewall). However, I tried disabling the firewall entirely, but still had identical results!
Any ideas / clues? Hoping to make the custom kernel work. Why would this one piece out of everything be affected so dramatically by a new kernel? Thanks in advance!
Mike
43 Replies
Re: 2.6.32.12-linode25 I had a different issue for me nginx under this kernel had page faults
So what I've done is gone the custom kernel route, I'm using ubuntu 9.10 with the 2.6.31-307-ec2 kernel and I have no problems.
What distro are you using?
Thanks for the reply – very interesting to hear that you've experienced similar issues with those two kernels (2.6.33-linode24 and 2.6.32.12-linode25). I thought I was going nuts!
Still surprised to see that there are such issues that are so dependent on the kernel version.
I'm using Debian stable (lenny).
I'll have to try a handful of different versions of custom kernels and see if it's specific to this 2.6.35-rc3.
Thanks again,
Mike
-Chris
@caker:
OK - 2.6.34-linode26 and 2.6.34-x86_64-linode13 are out there. Test away.
-Chris
Sounds good, I'm going to be unavailable for a few days, compumike if you get a chance to test it I'd be interested in your results.
Thanks for your quick replies! Currently spending a few hundred bucks trying out reddit.com ads today, so this isn't a good time to experiment, but my plan is to give it a shot early tomorrow. Will write in and let you know how it goes.
Just curious – when you built the kernel, did you do anything other than get the stock kernel from kernel.org, copy in a .config from one your recent linode kernels (like 2.6.32.12-linode25), run "make oldconfig" and answer with the defaults, and then build? Any patches to apply or special config options? Just trying to track down my issues with 2.6.35-rc3.
(Also noticed that
(One more "quickie" --
Thanks again,
Mike
I'll push up the tarballs once I know the kernels will stick around for more than a few days. For now, zcat /proc/config.gz
I fixed the /irc/ folder permissions on one of our loadbalancers. Thanks for the heads-up.
-Chris
Just booted with the 2.6.34-linode26 kernel (32-bit) and everything seems to be fine! (All userspace services working properly. No issue with the lighttpd/django fastcgi intercommunication, even when stressed via "ab".)
It is yet to be seen whether the issues I experienced with 2.6.33-linode24 (with CPU timers stopping) and with 2.6.32.12-linode25 (with random load average spikes) will repeat themselves, as those seemed to happen over the course of weeks and hours respectively. But for at least 15 minutes of uptime, I can safely say that it's actually working, which was definitely not the case for my attempt to build 2.6.35-rc3.
Will watch it carefully this weekend and report back if there are any issues.
Thanks!
Mike
At this point I would have expected to experience the 2.6.32.12-linode25 load average spike issue, so it's very good news that I haven't.
Mike
Now at 3 days, 18 hours of uptime, and it's been the most rock-solid I've experienced. Finally had a nice quiet weekend without issues – seriously, this has made a tremendous change.
Still will have to see if it has the same once-every-few-weeks "timer stopped" fatal issues as with 2.6.33-linode24, but I'm hoping not!
I recommend trying 2.6.34-linode24 if you are having issues with one of the other kernels similar to those I've described earlier in this thread.
Thanks!
Mike
Good to know. Mine is still OK at the moment. Hopefully we can try to track this down! Let's collect some basic information that might be helpful to those more knowledgeable about the clock / timer freezing issue. Here are some questions I thought might be important, as well as my personal answers.
stefantalpalaru and obs, and anyone else who has had the clock freeze issue, can you both respond to these questions so we can make sure we're on the same page?
1) Can you confirm that the nature of the clock freeze is identical to what I described in my first post? (System is still reachable and serves web pages, but cronjobs don't run, load average statistics don't update, problems with interactive ssh sessions, non-working lish console. Running "ssh me@mylinode uptime" shows some fields that do update (number of users, amount of uptime) and some fields that don't (currrent time, load averages). Running "ssh me@mylinode date" still updates with the correct time.)
2) What distribution are you running? (I'm on Debian lenny / stable)
3) What Linode plan? (I'm on the Linode 1024, which was 720 when I had my two clock freezes)
4) What kind of load was that kernel seeing? (Based on Linode graphs, I'm typically around 4-5% CPU, sometimes bursting to maybe 110% for some batch jobs. I use almost zero swap space and make a serious effort to keep everything in RAM. This includes vm.swappiness=0, use of memcached, and making strategic use of "tmpfs" RAM-backed filesystems for certain parts of my application.)
5) Are you running ntpd? (I have seen clock freezes on 2.6.33-linode24 both with and without ntpd, but just for the record…)
6) Have you had this issue with other kernels too? (So far, I've only experienced it with 2.6.33-linode24 -- not yet with 2.6.34-linode26.)
7) How much uptime did the box have before the clocks froze? (I had roughly 10-15 days uptime on both occurrences.)
8) Have you been able to make a correlation / guess as to whether the issue occurs with high CPU usage, high IO usage, high network usage, etc? Any unusual log messages from those incidents? (I have not been able to find anything that I thought might be related.)
9) Do you have anyway to quickly / controllably reproduce the clock freeze? (Unfortunately I don't.)
10) What datacenter are you in? (I'm in newark)
Mike
@stefantalpalaru:
2.6.34-linode26 has frozen the clock for me on 2 different linodes.
Hrmm well that's not good maybe I didn't test for long enough, have you raised a ticket with support?
1. the system responds to ping and I can ssh into it, but can't input anything in the interactive session. The time I see in the shell prompt is way off. The web server doesn't work, CPU usage is at 100%.
2. Debian unstable and Gentoo ~x86
3. Linode 1024 and Linode 4096
4. very low CPU usage. see the munin graphs:
5. yes, ntpd runs on both linodes
6. yes, all the paravirt kernels I've tried, with varying periods of time between clock freezes (some of them lasted for more than a month). But I need the latest DRBD version so I keep trying to stabilize it. Most of the kernels had custom configs (booted with PV_GRUB) so I was pretty much on my own, but now I see the same problem with the official config.
7. last uptimes: 5 and 2 days
8. no, but I suspect it's all triggered by the clock as presented by Xen
9. no
10. Dallas and Newark
1) Yes web server still serves pages (but not for long since the firewall detects a synflood and blocks connections) SSH however doesn't work it just locks up if the session is already connected, if not connected it hangs on connection.
2) Ubuntu 9.10 32 bit.
3) 512 which was 360 when the freeze happened.
4) Not sure since it was a long time ago, but if I was to hazard a guess probably around 5-10%.
5) Yes - logs didn't show NTPD trying to change the time if memory serves.
6) No (I'm currently using pv_grub with ubuntus ec2 kernel)
7) 12-24 hours at most, the server is in use almost 24/7 and people tended to yell at me as soon as it locked up
8) Absolutely nothing, I delved into my logs and could find diddly squat it seemed completely random.
9) Again no sadly.
10) Dallas
Two days ago you said, in reference to 2.6.34-linode26:
> It's been running for 2 days no issues! This kernel's a keeper!
Is that box still running – now up to 4 days / 96 hours? If so, that seems like a significant departure from your
> 12-24 hours at most
description about earlier kernels.
Mike
Can anyone suggest useful things to try to record in the event that it does freeze again (before rebooting)? Catting particular files within /proc perhaps?
With great speculation: this may be load triggered in some way, either as a cumulative load in some kernel variable that isn't getting reset properly, or as an instantaneous load that causes some virtual interrupt to get missed or something like that. Alternatively, it may be host-triggered.
However, the fact that Obs suggests that it happens rather consistently on a 12-24 hour time span is really interesting. Obs, does this mean you're still manually rebooting it every 12-24 hours at this point? When you made a clone to test with the new 2.6.34-linode26 kernel for two days, was that clone taking any of your client load, or was it unloaded?
Since mine only occurs on the time period of weeks on the 2.6.33-linode24 kernel, it's just about impossible for me to do any testing. But if I had a setup that I knew would lock up within a period of minutes or hours, then I think we could really get to the bottom of this.
The clone was under load, I asked my users to use the server as normal and they did.
One thing I can pretty sure say it's not to do with is network load, since my backups are sent to s3 storage every night around the same time and it never crashed during that, it really was quite random.
I can tell you what runs on the box.
Nginx takes the brunt of the web serving static files, it passes php back to apache, mysql is running as the database, there's also a nsd DNS server running. The usual system utilities i.e. logrotate, munin etc are running.
Now if memory serves the last lockup was around 4:37pm with it being such an odd time no cron jobs are running, no backups are running. I checked my nginx, mysql, munin etc logs when it happened and I couldn't;spot anything unusual.
I'll add a few additional notes about my setup on the off chance it helps get to the bottom of this as there's nothing obvious in my logs to say what happened:
* Webserver still serves pages for a short period. SSH lets me connect but doesn't allow me to issue commands.
OS is Ubuntu 10.04 32-bit running the 2.6.34 kernel
Linode is a 512 (formerly a 360)
Load on the server is very low
NTPD didn't try to correct the time as far as I can tell
The linode is located in newark
Did you have any cron jobs running? Can you provide a time stamp from when it was happening, maybe the linode guys can check the host for any weirdness at that time?
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
-Chris
I am at 11 days, 12 hours uptime (and still have not had any issue with 2.6.34-linode26).
I'm guessing that you have a good reason to believe that tsc is a winner, so I have now switched to the tsc clocksource (previously was xen) without rebooting. Ntpd appears to be maintaining time fine after the switch.
(Of course, I expect the more useful testing results to come from the other users who have had this issue with greater frequency – looking forward to seeing your test results!)
Mike
I've changed the clocksource and will report back if it crashes.
-Chris
That's with the change of clocksource, seems to be doing better. Anyone else had success/problems with tsc?
Good to hear that switching to tsc might be a solution!
21:29:09 up 5 days, 1:37, 1 user, load average: 0.08, 0.02, 0.01
I've never had uptime that long on that kernel before so I'd say tsc seems to fix it.
Now it's up to the linode guys to figure out why!
2.6.34-linode26, now up 32 days. (I had switched to the TSC clocksource on day 11 as per my earlier post in this thread.)
Summary for those just joining this thread:
use 2.6.34-linode26
set clocksource to tsc
seems to avoid this issue!
@compumike:
As what will hopefully be a conclusion to this thread:
2.6.34-linode26, now up 32 days. (I had switched to the TSC clocksource on day 11 as per my earlier post in this thread.)
Summary for those just joining this thread:
use 2.6.34-linode26
set clocksource to tsc
seems to avoid this issue!
2.6.34-linode27 solves another related issue, which might be even more stable for you.
Hard coding a specific kernel version in your config profile is just asking for trouble down the road.
-Chris
Is there any progress when one can expect the "Latest 2.6 Paravirt" entry to point to a recent 32bit kernel? I've seen that you're already up to 2.6.35 for the x86_64 kernels. One _could_ run this on a 32bit system as well if carefully deployed but I'd rather avoid that.
Any update would be great.
Thanks a lot,
matthew
My main and only concern is security. There are is unfortunately too little official documentation/feedback from Linode about how they maintain their kernels wrt to security leaks.
I could naturally fight myself through countless git backlogs, security reports and so on to see if their 2.6.32.16 based kernel is reasonably safe or if there are open issues. But I honestly lack the time and I don't think that would be my job to do as I am not the maintainer of those kernels.
Don't get me wrong, I love Linode and I think their service is simply fabulous. It's just that I would like to get more insight on their kernel maintenance and see where we stand wrt to security. That's all.
So long,
matthew