Sigh! --Another Scheduled Maintenance? Another Unreachable Server

Question

Sigh! --Another Scheduled Maintenance? Another Unreachable Server

madra 3 years, 3 months ago

So I wake up this morning to find that all my websites across both my Linodes are down. Although, when I login to Linode Manager, it shows both as running and there's nothing in the Service Status pages.

However given I had two different servers suddenly become inaccessible from the network at the same time, I think it's safe to assume that Linode have been tinkering with their infrastructure again.

Anyway, I rebooted both servers and, thankfully, one of them is now accessible again and its sites up and running. However the other one won't come back online. It's unreachable via SSH and when I log into it via LISH, I get the following:

You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):

I've tried following the troubleshooting guide on https://www.linode.com/community/questions/323/my-linode-is-unreachable-after-maintenance but not getting any forr'arder.

Following various suggestions given on the aforementioned page, results in the following:

cat /etc/network/interfaces returns:

# /etc/network/interfaces

auto lo
iface lo inet loopback

source /etc/network/interfaces.d/*

auto eth0

allow-hotplug eth0


iface eth0 inet6 auto
iface eth0 inet static
    address 85.90.245.13/24
    gateway 85.90.245.1

ip a gives me:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ae:f4:ad:ba:56:73 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether f2:3c:91:df:ef:7a brd ff:ff:ff:ff:ff:ff
4: teql0: <NOARP> mtu 1500 qdisc noop state DOWN group default qlen 100
    link/void
5: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
6: gre0@NONE: <NOARP> mtu 1476 qdisc noop state DOWN group default qlen 1000
    link/gre 0.0.0.0 brd 0.0.0.0
7: gretap0@NONE: <BROADCAST,MULTICAST> mtu 1476 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
8: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1464 qdisc noop state DOWN group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
10: ip6_vti0@NONE: <NOARP> mtu 1364 qdisc noop state DOWN group default qlen 1000
    link/tunnel6 :: brd :: permaddr 5ae6:403b:a31b::
11: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
12: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN group default qlen 1000
    link/tunnel6 :: brd :: permaddr 3204:ee4b:803c::
13: ip6gre0@NONE: <NOARP> mtu 1448 qdisc noop state DOWN group default qlen 1000
    link/gre6 :: brd :: permaddr a84:f5ed:6073::

ip r doesn't return any output at all

If I run sudo systemctl status networking.service -l --no-pager --ful

I get this:

Sep 16 10:32:54 vostok systemd[1]: Starting Raise network interfaces...
Sep 16 10:32:54 vostok sudo[517]:     root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptabe
Sep 16 10:32:54 vostok sudo[517]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 10:32:54 vostok ifup[519]: sh: 1: /sbin/iptables-restore: not found
Sep 16 10:32:54 vostok sudo[517]: pam_unix(sudo:session): session closed for user root
Sep 16 10:32:54 vostok ifup[515]: run-parts: /etc/network/if-pre-up.d/iptables exited with return code 127
Sep 16 10:32:54 vostok ifup[513]: ifup: pre-up script failed
Sep 16 10:32:54 vostok systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Sep 16 10:32:54 vostok systemd[1]: networking.service: Failed with result 'exit-code'.
Sep 16 10:32:54 vostok systemd[1]: Failed to start Raise network interfaces.

So, the problem seems to lie with iptables. However, when I try the provided solution to that:

sudo mv /etc/network/if-up.d/iptables ~

it doesn't exist:

mv: cannot stat '/etc/network/if-up.d/iptables': No such file or director

If I try the next listed command anyway:

ifdown -a && ifup -a

I get:

sh: 1: /sbin/iptables-restore: not found
run-parts: /etc/network/if-pre-up.d/iptables exited with return code 127
ifup: pre-up script failed

So I'm kind of stuck now. Where do I go from here, in order to get my Linode networking up again?

EDIT: In case it's relevant, both my Linodes are hosted in Frankfurt and the problematic one is running Debian – Latest 64 bit (5.13.4-x86_64-linode146)

10 Replies

dwfreed · Answer 1 · Sept. 16, 2021, 8:37 p.m.

dwfreed 3 years, 3 months ago

As noted by the output, the file is in /etc/network/if-pre-up.d not /etc/network/if-up.d

madra · Answer 2 · Sept. 16, 2021, 9:28 p.m.

madra 3 years, 3 months ago

OK. So it looks like loads of services on the server are dead. I managed to get networking to come up [I think!] by:

ln -s /etc/alternatives/ip6tables-restore /sbin/iptables-restore

and then

ifup eth0

Now systemctl status networking.service -l --no-pager --full says the network has come up properly:

Sep 16 22:07:12 vostok sudo[1025]:     root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1025]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1025]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok sudo[1032]:     root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1032]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1032]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok sudo[1064]:     root : PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/iptables-restore < /etc/iptae
Sep 16 22:07:12 vostok sudo[1064]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Sep 16 22:07:12 vostok sudo[1064]: pam_unix(sudo:session): session closed for user root
Sep 16 22:07:12 vostok systemd[1]: Finished Raise network interfaces.`

However, I still can't bring my Caddy webserver up, or start SSH to allow me to login normally. If I do a systemctl start caddy.service or systemctl start ssh.service I just get dumped back to the:

You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):

prompt each time. I don't know what the heck's going on here. This server has been quietly running unattended for months and I've not touched it in a similar amount of time. And then suddenly overnight, it [and my other Linode hosted in the same datacentre] goes down and this one is completely wrecked and acting like its services are broken and looking in the wrong places for their support files.

Anyone from Linode got anything to report on goings on at Frankfurt? It's a bit of a coincidence that both my Linodes in that datacentre suddenly died at the same time, when I've not logged into either, or changed anything on them in weeks, if not months.

madra · Answer 3 · Sept. 16, 2021, 9:40 p.m.

madra 3 years, 3 months ago

So, every service is dead and trying to start any of them dumps me back to that annoying 'emergency mode' prompt:

systemctl list-units --type=service --state=dead


  UNIT                                 LOAD      ACTIVE   SUB  DESCRIPTION
  acpid.service                        loaded    inactive dead ACPI event daemon
  apparmor.service                     loaded    inactive dead Load AppArmor pr>
  apt-daily-upgrade.service            loaded    inactive dead Daily apt upgrad>
  apt-daily.service                    loaded    inactive dead Daily apt downlo>
  atd.service                          loaded    inactive dead Deferred executi>
● auditd.service                       not-found inactive dead auditd.service
  bind9.service                        loaded    inactive dead LSB: Start and s>
  caddy.service                        loaded    inactive dead Caddy
  certbot.service                      loaded    inactive dead Certbot
● clamav-daemon.service                not-found inactive dead clamav-daemon.se>
  clamav-freshclam.service             loaded    inactive dead ClamAV virus dat>
● connman.service                      not-found inactive dead connman.service
  cron.service                         loaded    inactive dead Regular backgrou>
  dbus.service                         loaded    inactive dead D-Bus System Mes>
● display-manager.service              not-found inactive dead display-manager.>
  dovecot.service                      loaded    inactive dead Dovecot IMAP/POP>
  e2scrub_all.service                  loaded    inactive dead Online ext4 Meta>
  e2scrub_reap.service                 loaded    inactive dead Remove Stale Onl>
  exim4.service                        loaded    inactive dead LSB: exim Mail T>
  fail2ban.service                     loaded    inactive dead Fail2Ban Service
  fancontrol.service                   loaded    inactive dead fan speed regula>
● firewalld.service                    not-found inactive dead firewalld.service

madra · Answer 4 · Sept. 17, 2021, 8:48 a.m.

madra 3 years, 3 months ago

UPDATE: 17th Sept and same scenario again. Both Linodes inaccessible again, as of sometime overnight, UK time. I reboot both and one of them comes back online while the other one is still inaccessible with all services dead and LISH console constantly booting me out and back into 'emergency mode' whenever I try to start any services.

Have opened a ticket: #16241056

Let's see if Linode can shed any light on this!

EDIT: This is what happens when I'm logged in via LISH. All vital services are down. But if I try and start any of them, I just get dumped back to the 'emergency mode' prompt and have to login again. Extremely frustrating and tedious!

 ᐅ  service sshd status

● ssh.service - OpenBSD Secure Shell server
     Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: e>
     Active: inactive (dead)
       Docs: man:sshd(8)
             man:sshd_config(5)

ᐅ  service sshd start
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or "exit"
to boot into default mode.
Give root password for maintenance
(or press Control-D to continue):

stevewi · Answer 5 · Sept. 17, 2021, 2:20 p.m.

stevewi 3 years, 3 months ago

@madra --

A couple of things…

I've been a Linode customer since 2013. Linode never, ever does scheduled maintenance or "tinkering with their infrastructure" without telling you first.
If it happens two days/nights in a row, it's something else…not "scheduled maintenance".
From the transcript of your lish session, the culprit is some part of your Linux configuration…not anything to do with your Linode itself. The fact that you got to "emergency mode" in the first place indicates the system booted fine. Unless there's something wrong with networking (which is extremely unlikely), Linode's responsibility ends there.

That's not to say that they won't try to help you out…but, be advised that Linode support folks do not have access to the guts of your distro.

Today's modern Linux distros are little more than a kernel and a collection of systemd(1) services. It looks like you're entering the ring for a cage match with systemd(1). I hope you've stocked up on your blood pressure meds…you're going to need them.

All that being said, there are some indications from the stuff you've posted that your firewall configuration may be the culprit. Can you turn of your firewall completely (for awhile)?

-- sw

madra · Answer 6 · Sept. 17, 2021, 2:49 p.m.

madra 3 years, 3 months ago

From the transcript of your lish session, the culprit is some part of your Linux configuration…not anything to do with your Linode itself.

Why is everyone [including Linode support] ignoring the salient facts here? Let me state them again:

This happened simultaneously to 2 different Linodes at the same time
Before this down time, I had not logged into either Linode for at least a couple of weeks, so I had not changed anything in their setup
The two Linodes are running different OSes [one on Debian, one on Ubuntu]

So how can a configuration error suddenly spontaneously manifest itself on two different Linodes running two different OSes, at the same time, when both Linodes had previously been running for months without issue? And this happen WITHOUT ANY USER INPUT?

Anyway. Some progress has been made. Though how long it lasts is anybody's guess.

Searching for info online I found a few threads wherein people advised that being continually dumped into 'emergency mode' indicated that something had gone wrong with /etc/fstab so that the drives were not mounting properly at boot time.

I checked my /etc/fstab and found that my various drives therein were identified by drive letter, rather than UUID. So, I ran blkid and edited /etc/fstab to use UUIDs instead and then ran mount -a. This time I got a message:

/var/opt/bin/s3fs: /usr/lib/i386-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /var/opt/bin/s3fs)
/var/opt/bin/s3fs: /usr/lib/i386-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /var/opt/bin/s3fs)

which relates to two Amazon S3FS drives which I also mount as storage on my Linode. I commented out those lines and ran mount -a again and 'Hey Presto!' my Linode was back online and my websites up again.

So, as I said, we'll have to see whether they stay up this time. But I'm still curious as to what went wrong here?…

Did Linode replace some hardware which broke my /etc/fstab since it wasn't using blkids?… or
Did `CURL_OPENSSL_3 [whatever that is] get broken somehow. Thus breaking /etc/fstab as regards mounting the S3FS mountpoints?

Again to re-emphasise for about the 3rd time!

Before they both went down, I had not touched either of these Linodes in weeks. So whatever changed to break them [whether hardware or software] was not anything I did.

stevewi · Answer 7 · Sept. 17, 2021, 3:09 p.m.

stevewi 3 years, 3 months ago

Why is everyone [including Linode support] ignoring the salient facts here? Let me state them again:

I dunno…s**t happens… I was just trying to help you out. Ranting will only make you feel better. Frankly, it doesn't do anything for me…

Again to re-emphasise for about the 3rd time!

Before they both went down, I had not touched either of these Linodes in weeks. So whatever changed to break them

It doesn't have to be related anything that "you did"… Do you have disgruntled ex-employees? Do you have cron(8) jobs to install distro updates automagically? Do you scan your security logs for unauthorized logins and take appropriate countermeasures?

-- sw

madra · Answer 8 · Sept. 18, 2021, 12:49 p.m.

madra 3 years, 3 months ago

Ranting will only make you feel better. Frankly, it doesn't do anything for me…..

Why did you think I was ranting?

It's incredibly frustrating when you ask for support somewhere, taking the time to point out relevant the facts and people respond in a way that suggests they didn't read past the subject line.

And I wasn't singling out you either. I got similar 'advice' from Linode support, suggesting I run fsck on my Linode or update the kernel. When I had also emphasised to them that this problem arose simultaneously on two separate Linodes running two separate Oses, neither of which had been touched my me in weeks, if not months.

Do you have disgruntled ex-employees?

No. I work for myself

Do you have cron(8) jobs to install distro updates automagically?

Now that, at last, is a useful suggestion. However I've not setup any such cronjobs and nor do I have unattended-upgrades setup.

Do you scan your security logs for unauthorized logins and take appropriate countermeasures?

Not yet. But I think unauthorised logins highly unlikely since I login via publickey and have root login and password login disabled.

stevewi · Answer 9 · Sept. 18, 2021, 2:05 p.m.

stevewi 3 years, 3 months ago

It's hard to say what caused your problem (if, in fact, you still have a problem)…or who the culprit is.

What I can say is, that in my many years of dealing with Linux, Linux as a platform for doing what I need/want to do has become increasingly unstable. It seems to get worse with every release of every "mainstream" distro. You can probably think of as many reasons as I can about why this might true. In my experience, 9 times out of 10, the culprit has been systemd(1).

To mitigate the fact that systemd(1) is a bloated, not-POSIX-compliant, inescapable POS, I've abandoned use of Linux altogether…everywhere in my personal computing infrastructure (even at home). Consequently, my BP is several points lower.

Maybe that's not the answer for you but you should maybe at least think about that. There are pros and cons of doing this of course so you'll have to weigh those yourself.

Linux is a one-man show (and it's a well-known fact that this guy has a temper)…and I think problems like yours really expose why this engineering model doesn't work.

-- sw

LouWestin · Answer 10 · Sept. 18, 2021, 5:35 p.m.

LouWestin 3 years, 3 months ago

I think it's safe to assume that Linode have been tinkering with their infrastructure again.

The only two things that could/would happen on Linode’s end to cause your VPS(s) to become unavailable would be:

The physical server itself has an emergency issue. A ticket is automatically issued in this case.
A serious network outage or DDOS attack which, again, would automatically be issued an alert or ticket.

Keep in mind that their service is partially unmanaged, so you are responsible for configuration issues on your end.

Compute

Storage

Networking

Databases

Services

Developer Tools

Industries

Pricing

Community

Engage With Us

Sigh! --Another Scheduled Maintenance? Another Unreachable Server

10 Replies

Reply

Tips: