Why did my Linode go down hard today?

I have a Linode which went down hard and was restarted automatically by Lassie today, and I cannot explain why.

Linode Support says that there was no information in console logs regarding this (I hope they mean hypervisor logs and not Lish logs?) and that there’s nothing else they can do and I should ask about the incident here. So here I am!

This Linode was running normally until ~16:01:24, at which point all logs terminate abruptly with none of the usual termination messages. Then there was a Lassie-initiated boot, and the Linode was running again by 16:02:34.

Some things that Linode Support suggested, which are not possible:

  • A cron or script reboot. There is no cron task on this system which would reboot it (and that wouldn’t cause a hard reboot anyway).
  • A manual reboot. last shows that there were no users logged in, I am the only one with root access anyway (via sudo)—and, again, that would not cause a hard reboot.

Other things that I checked myself:

  • Authentication logs show no login attempts before the VM’s death, nor any uses of sudo.
  • System logs show no OOMs, segfaults, or other malfunctions or unusual activity prior to the VM’s death.
  • Lish’s console log doesn’t show a kernel panic or any error messages at all.
  • I didn’t receive any messages that there were any logins to the Linode Manager and the job queue in the Linode Manager only reports the automatic restart, not a shutdown.

After the system came back online, I noticed a significant amount of CPU steal (30–50%) & HW interrupt (10–20%). Linode Support says that this was because of a noisy neighbour, whom they moved, and that it was coincidental. However, this is the only thing that I can identify as being weird around the time of the VM’s spontaneous combustion.

Other recent changes include a migration to new hardware last week, and (at the same time) a kernel update from 4.17.17-x86_64-linode112 to 4.17.17-x86_64-linode116. The VM is running (32-bit) Debian oldstable with the default systemd init system.

The only other times this Linode has ever gone down hard were due to DC power failures.

So, what could have happened? What else can I look at that I haven’t looked at already? At this point I can only think of these possibilities:

  • A sysop accidentally killed the wrong instance when trying to deal with the noisy neighbour (this is fine, mistakes happen);
  • There is some serious problem with the hypervisor or hardware, possibly triggered by the noisy neighbour;
  • Something related to the L1TF mitigations;
  • Unknown unknowns

Thanks in advance for your thoughts!

1 Reply

Hey there,

I know this is coming in a bit late but I noticed something that may be related to why your Linode was hard restarted. You stated the kernel you are running is 4.17.17-x86_64-linode116, however you were running 4.17.17-x86_64-linode112 at the time of the incident. These are 64-bit kernels that are running on your 32-bit OS.

It is possible these kernels were attempting to pull in more RAM than the OS could understand and that is what caused the instability. You also mentioned there was steal on the host after you rebooted. It is possible the contention may have increased the load and resource usage to a point beyond what a 32-bit distribution would expect. Have you attempted to boot your Linode into a 32-bit kernel? Our latest 32-bit kernel is 4.17.17-x86-linode135.

Hope this helps!

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct