Sigh! --Another Scheduled Maintenance? Another Unreachable Server
So I wake up this morning to find that all my websites across both my Linodes are down. Although, when I login to Linode Manager, it shows both as running and there's nothing in the Service Status pages.
However given I had two different servers suddenly become inaccessible from the network at the same time, I think it's safe to assume that Linode have been tinkering with their infrastructure again.
Anyway, I rebooted both servers and, thankfully, one of them is now accessible again and its sites up and running. However the other one won't come back online. It's unreachable via SSH and when I log into it via LISH, I get the following:
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):
I've tried following the troubleshooting guide on https://www.linode.com/community/questions/323/my-linode-is-unreachable-after-maintenance but not getting any forr'arder.
Following various suggestions given on the aforementioned page, results in the following:
cat /etc/network/interfaces
returns:
# /etc/network/interfaces
auto lo
iface lo inet loopback
source /etc/network/interfaces.d/*
auto eth0
allow-hotplug eth0
iface eth0 inet6 auto
iface eth0 inet static
address 85.90.245.13/24
gateway 85.90.245.1
ip a
gives me:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether ae:f4:ad:ba:56:73 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether f2:3c:91:df:ef:7a brd ff:ff:ff:ff:ff:ff
4: teql0: <NOARP> mtu 1500 qdisc noop state DOWN group default qlen 100
link/void
5: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
6: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN group default qlen 1000
link/gre 0.0.0.0 brd 0.0.0.0
7: gretap0@NONE: <BROADCAST,MULTICAST> mtu 1476 qdisc noop state DOWN group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
8: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1464 qdisc noop state DOWN group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
10: ip6_vti0@NONE: <NOARP> mtu 1364 qdisc noop state DOWN group default qlen 1000
link/tunnel6 :: brd :: permaddr 5ae6:403b:a31b::
11: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
12: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN group default qlen 1000
link/tunnel6 :: brd :: permaddr 3204:ee4b:803c::
13: ip6gre0@NONE: <NOARP> mtu 1448 qdisc noop state DOWN group default qlen 1000
link/gre6 :: brd :: permaddr a84:f5ed:6073::
ip r
doesn't return any output at all
If I run sudo systemctl status networking.service -l --no-pager --ful
I get this:
Sep 16 10:32:54 vostok systemd[1]: Starting Raise network interfaces...
Sep 16 10:32:54 vostok sudo[517]: root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptabe
Sep 16 10:32:54 vostok sudo[517]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 10:32:54 vostok ifup[519]: sh: 1: /sbin/iptables-restore: not found
Sep 16 10:32:54 vostok sudo[517]: pam_unix(sudo:session): session closed for user root
Sep 16 10:32:54 vostok ifup[515]: run-parts: /etc/network/if-pre-up.d/iptables exited with return code 127
Sep 16 10:32:54 vostok ifup[513]: ifup: pre-up script failed
Sep 16 10:32:54 vostok systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Sep 16 10:32:54 vostok systemd[1]: networking.service: Failed with result 'exit-code'.
Sep 16 10:32:54 vostok systemd[1]: Failed to start Raise network interfaces.
So, the problem seems to lie with iptables. However, when I try the provided solution to that:
sudo mv /etc/network/if-up.d/iptables ~
it doesn't exist:
mv: cannot stat '/etc/network/if-up.d/iptables': No such file or director
If I try the next listed command anyway:
ifdown -a && ifup -a
I get:
sh: 1: /sbin/iptables-restore: not found
run-parts: /etc/network/if-pre-up.d/iptables exited with return code 127
ifup: pre-up script failed
So I'm kind of stuck now. Where do I go from here, in order to get my Linode networking up again?
EDIT: In case it's relevant, both my Linodes are hosted in Frankfurt and the problematic one is running Debian – Latest 64 bit (5.13.4-x86_64-linode146)
10 Replies
OK. So it looks like loads of services on the server are dead. I managed to get networking to come up [I think!] by:
ln -s /etc/alternatives/ip6tables-restore /sbin/iptables-restore
and then
ifup eth0
Now systemctl status networking.service -l --no-pager --full
says the network has come up properly:
Sep 16 22:07:12 vostok sudo[1025]: root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1025]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1025]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok sudo[1032]: root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1032]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1032]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok sudo[1064]: root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1064]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1064]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok systemd[1]: Finished Raise network interfaces.`
However, I still can't bring my Caddy webserver up, or start SSH to allow me to login normally. If I do a systemctl start caddy.service
or systemctl start ssh.service
I just get dumped back to the:
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):
prompt each time. I don't know what the heck's going on here. This server has been quietly running unattended for months and I've not touched it in a similar amount of time. And then suddenly overnight, it [and my other Linode hosted in the same datacentre] goes down and this one is completely wrecked and acting like its services are broken and looking in the wrong places for their support files.
Anyone from Linode got anything to report on goings on at Frankfurt? It's a bit of a coincidence that both my Linodes in that datacentre suddenly died at the same time, when I've not logged into either, or changed anything on them in weeks, if not months.
So, every service is dead and trying to start any of them dumps me back to that annoying 'emergency mode' prompt:
systemctl list-units --type=service --state=dead
UNIT LOAD ACTIVE SUB DESCRIPTION
acpid.service loaded inactive dead ACPI event daemon
apparmor.service loaded inactive dead Load AppArmor pr>
apt-daily-upgrade.service loaded inactive dead Daily apt upgrad>
apt-daily.service loaded inactive dead Daily apt downlo>
atd.service loaded inactive dead Deferred executi>
● auditd.service not-found inactive dead auditd.service
bind9.service loaded inactive dead LSB: Start and s>
caddy.service loaded inactive dead Caddy
certbot.service loaded inactive dead Certbot
● clamav-daemon.service not-found inactive dead clamav-daemon.se>
clamav-freshclam.service loaded inactive dead ClamAV virus dat>
● connman.service not-found inactive dead connman.service
cron.service loaded inactive dead Regular backgrou>
dbus.service loaded inactive dead D-Bus System Mes>
● display-manager.service not-found inactive dead display-manager.>
dovecot.service loaded inactive dead Dovecot IMAP/POP>
e2scrub_all.service loaded inactive dead Online ext4 Meta>
e2scrub_reap.service loaded inactive dead Remove Stale Onl>
exim4.service loaded inactive dead LSB: exim Mail T>
fail2ban.service loaded inactive dead Fail2Ban Service
fancontrol.service loaded inactive dead fan speed regula>
● firewalld.service not-found inactive dead firewalld.service
UPDATE: 17th Sept and same scenario again. Both Linodes inaccessible again, as of sometime overnight, UK time. I reboot both and one of them comes back online while the other one is still inaccessible with all services dead and LISH console constantly booting me out and back into 'emergency mode' whenever I try to start any services.
Have opened a ticket: #16241056
Let's see if Linode can shed any light on this!
EDIT: This is what happens when I'm logged in via LISH. All vital services are down. But if I try and start any of them, I just get dumped back to the 'emergency mode' prompt and have to login again. Extremely frustrating and tedious!
ᐅ service sshd status
● ssh.service - OpenBSD Secure Shell server
Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: e>
Active: inactive (dead)
Docs: man:sshd(8)
man:sshd_config(5)
ᐅ service sshd start
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):
@madra --
A couple of things…
I've been a Linode customer since 2013. Linode never, ever does scheduled maintenance or "tinkering with their infrastructure" without telling you first.
If it happens two days/nights in a row, it's something else…not "scheduled maintenance".
From the transcript of your lish session, the culprit is some part of your Linux configuration…not anything to do with your Linode itself. The fact that you got to "emergency mode" in the first place indicates the system booted fine. Unless there's something wrong with networking (which is extremely unlikely), Linode's responsibility ends there.
That's not to say that they won't try to help you out…but, be advised that Linode support folks do not have access to the guts of your distro.
Today's modern Linux distros are little more than a kernel and a collection of systemd(1) services. It looks like you're entering the ring for a cage match with systemd(1). I hope you've stocked up on your blood pressure meds…you're going to need them.
All that being said, there are some indications from the stuff you've posted that your firewall configuration may be the culprit. Can you turn of your firewall completely (for awhile)?
-- sw
From the transcript of your lish session, the culprit is some part of your Linux configuration…not anything to do with your Linode itself.
Why is everyone [including Linode support] ignoring the salient facts here? Let me state them again:
This happened simultaneously to 2 different Linodes at the same time
Before this down time, I had not logged into either Linode for at least a couple of weeks, so I had not changed anything in their setup
The two Linodes are running different OSes [one on Debian, one on Ubuntu]
So how can a configuration error suddenly spontaneously manifest itself on two different Linodes running two different OSes, at the same time, when both Linodes had previously been running for months without issue? And this happen WITHOUT ANY USER INPUT?
Anyway. Some progress has been made. Though how long it lasts is anybody's guess.
Searching for info online I found a few threads wherein people advised that being continually dumped into 'emergency mode' indicated that something had gone wrong with /etc/fstab
so that the drives were not mounting properly at boot time.
I checked my /etc/fstab
and found that my various drives therein were identified by drive letter, rather than UUID. So, I ran blkid
and edited /etc/fstab
to use UUIDs instead and then ran mount -a
. This time I got a message:
/var/opt/bin/s3fs: /usr/lib/i386-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /var/opt/bin/s3fs)
/var/opt/bin/s3fs: /usr/lib/i386-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /var/opt/bin/s3fs)
which relates to two Amazon S3FS drives which I also mount as storage on my Linode. I commented out those lines and ran mount -a
again and 'Hey Presto!' my Linode was back online and my websites up again.
So, as I said, we'll have to see whether they stay up this time. But I'm still curious as to what went wrong here?…
Did Linode replace some hardware which broke my
/etc/fstab
since it wasn't usingblkid
s?… orDid
`CURL_OPENSSL_3
[whatever that is] get broken somehow. Thus breaking/etc/fstab
as regards mounting the S3FS mountpoints?
Again to re-emphasise for about the 3rd time!
Before they both went down, I had not touched either of these Linodes in weeks. So whatever changed to break them [whether hardware or software] was not anything I did.
Why is everyone [including Linode support] ignoring the salient facts here? Let me state them again:
I dunno…s**t happens… I was just trying to help you out. Ranting will only make you feel better. Frankly, it doesn't do anything for me…
Again to re-emphasise for about the 3rd time!
Before they both went down, I had not touched either of these Linodes in weeks. So whatever changed to break them
It doesn't have to be related anything that "you did"… Do you have disgruntled ex-employees? Do you have cron(8) jobs to install distro updates automagically? Do you scan your security logs for unauthorized logins and take appropriate countermeasures?
-- sw
Ranting will only make you feel better. Frankly, it doesn't do anything for me…..
Why did you think I was ranting?
It's incredibly frustrating when you ask for support somewhere, taking the time to point out relevant the facts and people respond in a way that suggests they didn't read past the subject line.
And I wasn't singling out you either. I got similar 'advice' from Linode support, suggesting I run fsck
on my Linode or update the kernel. When I had also emphasised to them that this problem arose simultaneously on two separate Linodes running two separate Oses, neither of which had been touched my me in weeks, if not months.
Do you have disgruntled ex-employees?
No. I work for myself
Do you have cron(8) jobs to install distro updates automagically?
Now that, at last, is a useful suggestion. However I've not setup any such cronjobs and nor do I have unattended-upgrades
setup.
Do you scan your security logs for unauthorized logins and take appropriate countermeasures?
Not yet. But I think unauthorised logins highly unlikely since I login via publickey and have root login and password login disabled.
It's hard to say what caused your problem (if, in fact, you still have a problem)…or who the culprit is.
What I can say is, that in my many years of dealing with Linux, Linux as a platform for doing what I need/want to do has become increasingly unstable. It seems to get worse with every release of every "mainstream" distro. You can probably think of as many reasons as I can about why this might true. In my experience, 9 times out of 10, the culprit has been systemd(1).
To mitigate the fact that systemd(1) is a bloated, not-POSIX-compliant, inescapable POS, I've abandoned use of Linux altogether…everywhere in my personal computing infrastructure (even at home). Consequently, my BP is several points lower.
Maybe that's not the answer for you but you should maybe at least think about that. There are pros and cons of doing this of course so you'll have to weigh those yourself.
Linux is a one-man show (and it's a well-known fact that this guy has a temper)…and I think problems like yours really expose why this engineering model doesn't work.
-- sw
I think it's safe to assume that Linode have been tinkering with their infrastructure again.
The only two things that could/would happen on Linode’s end to cause your VPS(s) to become unavailable would be:
The physical server itself has an emergency issue. A ticket is automatically issued in this case.
A serious network outage or DDOS attack which, again, would automatically be issued an alert or ticket.
Keep in mind that their service is partially unmanaged, so you are responsible for configuration issues on your end.