What's the most reliable way of diagnosing high and seemingly random CPU spikes that cripple my machine?

Question

What's the most reliable way of diagnosing high and seemingly random CPU spikes that cripple my machine?

I have a small Linode VPS with 1GB of RAM running various things. A few Ghost sites, an uptime monitor, a Caddy webserver, a Discord bot, and whatever I happen to be tinkering with at the moment.

Generally, the system idles at about ~5-10% CPU usage. However, recently, every so often, CPU usage will spike to 150% for no reason. When this happens at night (always at midnight UTC, but not every night), it generally lasts for 1-2h before returning to normal. When I I am there to witness it, which is not always as sometimes it prevents both me and Lish from connecting through SSH, top always shows kswappd0 as the main CPU user. I am aware that kswappd0 is what handles swap, and though I would have thought high CPU usage would indicate memory space is at a premium, there are generally no stand-out culprits for high memory usage when the issue occurs. As well as this, top always shows a constant 100MB of swap in use.

Sometimes, killing either Caddy or VSCode Remote Server will stop the issue, but never consistently, so one of my theories has been that some process they both use at some point or another. Most recently, the problem was caused by extracting an image for and subsequently running the Docker image for a new version of Ghost, however, older versions have never had issues. I have also checked Disk IO using iotop, as I noticed Disk IO spikes that almost exactly correlated with the CPU spikes, but nothing stood out (except that monitoring it makes the problem worse by using 10% CPU by itself).

I have one script that runs at midnight that I am aware of, this is my backups script and I don't think it's the culprit because I can run it manually without issue, and the spikes do not happen every night.

I am at my wits end. Are there any other tools or methods I can use to diagnose the cause of this issue?

3 Replies

stevewi · Answer 1 · July 7, 2022, 1:29 p.m.

stevewi 2 years, 4 months ago

I have a couple of suggestions:

look at your auth.log to see if the Chinese and/or Russians are mounting an ssh brute force attack (they seemed to have changed their MO from constant noise to brief, timed, targeted attacks lately);
look at your web server logs for the same thing (to mitigate this, you can install https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker or https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker);
install fail2ban and set up bans for these kinds of things (there are lots of canned ones in the distro package)…you'll get useful logs too so you can fine tune your defenses.
close ANY ports you're not using…to both in- and outbound traffic.

-- sw

Southpaw1496 · Answer 2 · July 16, 2022, 12:18 p.m.

Southpaw1496 2 years, 4 months ago

@stevewi

I looked in auth.log as per your suggestion, here's a sample of the output:

Jul 16 11:39:45 localhost sshd[488022]: Invalid user username from 112.78.188.194 port 41000
Jul 16 11:39:45 localhost sshd[488022]: Received disconnect from 112.78.188.194 port 41000:11: Bye Bye [preauth]
Jul 16 11:39:45 localhost sshd[488022]: Disconnected from invalid user username 112.78.188.194 port 41000 [preauth]
Jul 16 11:39:49 localhost sshd[488024]: Received disconnect from 61.177.172.124 port 32813:11:  [preauth]
Jul 16 11:39:49 localhost sshd[488024]: Disconnected from authenticating user root 61.177.172.124 port 32813 [preauth]
Jul 16 11:41:00 localhost sshd[488050]: Invalid user client1 from 112.78.188.194 port 58640
Jul 16 11:41:00 localhost sshd[488050]: Received disconnect from 112.78.188.194 port 58640:11: Bye Bye [preauth]
Jul 16 11:41:00 localhost sshd[488050]: Disconnected from invalid user client1 112.78.188.194 port 58640 [preauth]
Jul 16 11:41:36 localhost sshd[488075]: Connection closed by authenticating user nobody 179.60.147.122 port 48334 [preauth]
Jul 16 11:42:15 localhost sshd[488099]: Invalid user oracle from 112.78.188.194 port 48050
Jul 16 11:42:15 localhost sshd[488099]: Received disconnect from 112.78.188.194 port 48050:11: Bye Bye [preauth]
Jul 16 11:42:15 localhost sshd[488099]: Disconnected from invalid user oracle 112.78.188.194 port 48050 [preauth]

Based on this, I assume I'm getting botted, albeit very lightly. While I don't think this is the cause of the issue, since the times do not correspond to periods of high CPU usage, I have disabled password authentication for ssh (in favour of keys) and tightened up my firewall just to be on the safe side, so thank you for bringing that to my attention.

One thing I'm wondering about is whether I don't have enough swap, as I saw an article saying that servers with 1GB of RAM or less should have 2x the RAM as swap and I only have 512MB. I haven't yet worked out how to increase it, though. I also found this article on Troubleshooting Memory and Network Issues, which I'm going to try following some advice in, but I'm not sure if it will solve the problem or merely dampen it.

stevewi · Answer 3 · July 16, 2022, 4:31 p.m.

stevewi 2 years, 4 months ago

Based on this, I assume I'm getting botted

Yep. 112.78.188.194 belongs to an Indonesian ISP based in Jakarta. 61.177.172.124 belongs to China Telecom (Jiangsu province network). The first is probably a Chinese proxy as well.

albeit very lightly.

Don't worry… it'll get worse…especially when the Russians join the party.

One thing I'm wondering about is whether I don't have enough swap

That could definitely be an issue.

as I saw an article saying that servers with 1GB of RAM or less should have 2x the RAM as swap and I only have 512MB.

It's pretty easy to get more swap. 2x memory is definitely a good guideline.

-- sw

Compute

Storage

Networking

Databases

Services

Developer Tools

Industries

Pricing

Community

Engage With Us

What's the most reliable way of diagnosing high and seemingly random CPU spikes that cripple my machine?

3 Replies

Reply

Tips: