Rebooted and Now Can't Even Connect
You can see the dashboard graphs here, something clearly went wrong over night:
The first thing I did was just to reboot the Linode, and that seems to have been completely wrong. Now I can't even connect to it by SSH anymore
Does anyone have any ideas what I should do? I'm really at a loss now
12 Replies
Lish
Trying to restart Apache is giving an error now that's causing it to fail. Reading more about it online now:
> (30)Read-only file system: apache2: could not open error log file /var/log/apache2/error.log.
The few results I've found seem to say it's something to do with filesystem errors, but I can't work out what could have changed to make this suddenly appear.
But indeed, it does sound filesystem-related.
I may have things sorted temporarily.
The problem was that Ubuntu turned the filesystem readonly. The most common reason given for that happening seems to be that it perceived a disk issue. Given the crazy stats in the graphs I posted though, I'm guessing something on my server was the cause of it
Running the "fsck" command was enough to fix it though.
With that said, it's only been fixed for about half an hour now. Will be watching those graphs to see if the issue comes back.
Does anyone know how I could find out what is causing the massive load? (if it does come back)
Basically if you run out of memory then your system starts to swap. This is normal. But if you're REALLY short of memory then it can swap a lot. So much that the system spends nearly all of the time swapping pages in/out. Response is almost zero, I/O activity is through the roof… it almost looks like the machine has crashed.
Now if, at this point, you told the control panel to reboot your machine it would attempt to do a graceful shutdown. BUT if your linode was in swap hell then it might not have been able to do it, so the control panel may have switched to a more aggressive reboot method and effectively pressed the "Reset" button.
This is an unclean shutdown and can result in filesystems needing fsck'ing afterwards; the machine doesn't fully reboot and the only way of accessing the machine is via lish.
If this is what happened then you need to look into why your machine started taking up so much memory. Are you running MySQL or similar? If so check the dozens of threads here on how to ensure MySQL never explodes like this. Similarly there are threads on how to tune Apache.
Basically, you just need to tune all your application processes so they can live happily in memory and not cause swap hell.
I'll take a look at the things you mentioned.
The strange part is that my sites are only running fairly standard scripts; WordPress, phpBB, and Coppermine Photo Gallery. I'll take a look at them all (And any mods/plugins especially) like you said though, hopefully will be able to avoid a repeat!
Thanks again for your detailed reply, really helps to get an understanding of what happened!
You might want to look at
The sites have hit the exact same issue again, but I haven't restarted the server this time.
What's the best way to handle this for now?
I have 2 active sites on this. If I disable one (Just using a2dissite), will that be enough to stop it from causing any more trouble if that site is the culprit? Or would I need a different way of detecting it?
Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:
fsck from util-linux-ng 2.16
/dev/xvda: clean, 62959/770048 files, 2408634/3072000 blocks
FATAL: Module nf_conntrack_ftp not found.
FATAL: Module nf_nat_ftp not found.
FATAL: Module nf_conntrack_irc not found.
FATAL: Module nf_nat_irc not found.
* Setting preliminary keymap...
* Setting up console font and keymap...
* Stopping NTP server ntpd
* Starting OpenBSD Secure Shell server sshd
* Starting NTP server ntpd
* Starting MySQL database server mysqld ...done.
* Checking for corrupt, not cleanly closed and upgrade needing tables.
* Starting Postfix Mail Transport Agent postfix * Starting NTP server ntpd
* Starting web server apache2
Ubuntu 9.10 merlin hvc0
merlin login: Out of memory: kill process 2338 (mysqld_safe) score 224717 or a child
Killed process 2446 (mysqld)
Out of memory: kill process 2873 (apache2) score 197820 or a child
Killed process 6182 (apache2)
Out of memory: kill process 2873 (apache2) score 197787 or a child
Killed process 6464 (apache2)
Out of memory: kill process 2873 (apache2) score 93674 or a child
Killed process 6492 (apache2)
Out of memory: kill process 2338 (mysqld_safe) score 110602 or a child
Killed process 7010 (mysqld)
Out of memory: kill process 2873 (apache2) score 98892 or a child
Killed process 6649 (apache2)
Out of memory: kill process 2873 (apache2) score 98406 or a child
Killed process 8311 (apache2)
Option one is significantly easier than option two.
@Michael-Martin:
Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:
Well, not necessarily. The Out of memory (OOM) killer doesn't necessarily kill the program that's exploding.
But given past experience in these forums it probably is a combination of apache instances (too many?) and mySQL going mad (it normally is
As for how to resolve the problem… depends on how urgent your needs are.
If you need "working web site now!" then upgrade to a bigger linode (heck, even a 2880). Then work very hard in getting your footprint down to a reasonable size, then downgrade to the smallest linode that meets your needs. Because of how linode pro-rata's usage, you'll get a credit back (not a refund) for the unused 2880 period and this can be used to pay for the smaller linode.
(Umm, I think I'm right; I'm sure linode staff will correct me if I've mis-stated the billing/refund policies).
If this is still in the "I don't care if it's down" stage, then work on reducing the footprint and accept the outages.
I'm looking into optimizing things now while it's back up. Do yous know of any good resources (online, or even books!) I should start with?
Thanks again for the help!