Any help with 2nd Kernel Panic After Migration?
We have had two kernel panics and/or CPU stalls (are they the same or different, I dunno.) since our forced migration to a new server a couple months ago. Before that, our system had been stable since the famous Christmas 2015 incident. Linode Support was unable to provide guidance, so I'm reaching out to the community in hopes of finding some answers.
Unfortunately, I'm a newbie on the server side of things so have no idea what could cause such issues. Can anyone provide some guidance on what I can do to traceback what may have been the cause or otherwise help me in preventing panics from happening again?
FWIW, we are running a basic LAMP stack with Apache 2.2, PHP 7.2 and MySQL 5.5 on Centos 6.8.
6 Replies
A kernel panic can be caused by a multitude of factors. By definition, it is a safety precaution that is taken by your OS's kernel when it detects a fatal error that it is unable to recover from safely or it determines that if the system continues to run it would risk major data loss.
The kernel panic typically happens during a boot job, which why it can happen during a migration. The reason is that in order to migrate your Linode, it needs to be shut down and then rebooted on the new host.
Most of the time when you see a Kernel Panic in your console log it'll indicate to you what caused it. You can access your console log by using LISH. When you see the kernel panic listed in that console, it may look something like this (and this is just one example of what you could find there):
Kernel panic - not syncing: No working init found.
You'll see that it gives you the cause of the kernel panic. The best resources would be searching for the error online or reaching out to support. We may not be able to fix it for you, but we can usually point you in the right direction.
Preventing a kernel panic is not an exact science, because they can be caused by so many factors. Some things you can do to defend against them are:
Make sure that your distribution and your applications are always up to date.
Run a periodic file system check from Rescue Mode
Run periodic scans of your system for malware. We have a guide on using ClamAV
We have documentation in our library that pertains to your Linode's kernel that you might want to check out as well:
Unfortunately when we went to the console, we just saw a lot of scrolling stuff that I'm unable to interpret. The cause was not visible, and had likely long scrolled away. Here is a screenshot: https://imgur.com/a/k4OdKHN.
We do keep things up to date as much as possible, including our kernel version and our associated software. I'll do a ClamAV scan if I'm able. Unfortunately, this is a production server so doing a file system check in rescue mode isn't really possible until we move off the server.
Redacted, I figured out answer to question in this post. Still have no idea what is causing issue nor where to look. It seems there is no log file made for these issues.
I've tried installing kdump to try to track the crashes per the directions at https://www.thegeekdiary.com/centos-rhel-6-how-to-configure-kdump/ but it does not seem to be running. When booting my Linode, does it look at the grub.conf file or is there some other alternative I should be using to install kdump?
Uhh.. sorry if this is off base or stupid but have you tried dmesg / also ? You can pipe dmesg it into less or into something like pastebinit to see results if there is some scrolling problems? Just what popped into my head when I saw this thread - don't shoot the messenger :)
PS: There is also a description of a system recovery technique here if it may be of any use to you but it just really basic guideline of a procedure and would need researched the commands to use.
And you might have also a
/var/log/boot.log
hth
Jake
Thanks Jake. Unfortunately, those tools only seem to show information from bootup time. Our issue is occuring after the server has been up for a couple weeks. Thus, after rebooting from the CPU Stall or whatever is happening, all debugging information is gone. I need something that could help us track down what is causing the CPU stall in the first place. I was hoping kdump would help us with that the next time it happens.
Unfortunately, we have absolutely no access to the server when this happens other than to log on to weblish and view all the scrolling messages. The only way to get out of it is to reboot, and the reboot doesn't happen for approximately 5 minutes after pressing the button. I've heard this is because it tries to do clean reboots, and then finally gives up and essentially pulls the plug to reboot.