What's the most reliable way of diagnosing high and seemingly random CPU spikes that cripple my machine?
I have a small Linode VPS with 1GB of RAM running various things. A few Ghost sites, an uptime monitor, a Caddy webserver, a Discord bot, and whatever I happen to be tinkering with at the moment.
Generally, the system idles at about ~5-10% CPU usage. However, recently, every so often, CPU usage will spike to 150% for no reason. When this happens at night (always at midnight UTC, but not every night), it generally lasts for 1-2h before returning to normal. When I I am there to witness it, which is not always as sometimes it prevents both me and Lish from connecting through SSH, top
always shows kswappd0
as the main CPU user. I am aware that kswappd0
is what handles swap, and though I would have thought high CPU usage would indicate memory space is at a premium, there are generally no stand-out culprits for high memory usage when the issue occurs. As well as this, top
always shows a constant 100MB of swap in use.
Sometimes, killing either Caddy or VSCode Remote Server will stop the issue, but never consistently, so one of my theories has been that some process they both use at some point or another. Most recently, the problem was caused by extracting an image for and subsequently running the Docker image for a new version of Ghost, however, older versions have never had issues. I have also checked Disk IO using iotop
, as I noticed Disk IO spikes that almost exactly correlated with the CPU spikes, but nothing stood out (except that monitoring it makes the problem worse by using 10% CPU by itself).
I have one script that runs at midnight that I am aware of, this is my backups script and I don't think it's the culprit because I can run it manually without issue, and the spikes do not happen every night.
I am at my wits end. Are there any other tools or methods I can use to diagnose the cause of this issue?
3 Replies
I have a couple of suggestions:
look at your auth.log to see if the Chinese and/or Russians are mounting an ssh brute force attack (they seemed to have changed their MO from constant noise to brief, timed, targeted attacks lately);
look at your web server logs for the same thing (to mitigate this, you can install https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker or https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker);
install fail2ban and set up bans for these kinds of things (there are lots of canned ones in the distro package)…you'll get useful logs too so you can fine tune your defenses.
close ANY ports you're not using…to both in- and outbound traffic.
-- sw
I looked in auth.log
as per your suggestion, here's a sample of the output:
Jul 16 11:39:45 localhost sshd[488022]: Invalid user username from 112.78.188.194 port 41000
Jul 16 11:39:45 localhost sshd[488022]: Received disconnect from 112.78.188.194 port 41000:11: Bye Bye [preauth]
Jul 16 11:39:45 localhost sshd[488022]: Disconnected from invalid user username 112.78.188.194 port 41000 [preauth]
Jul 16 11:39:49 localhost sshd[488024]: Received disconnect from 61.177.172.124 port 32813:11: [preauth]
Jul 16 11:39:49 localhost sshd[488024]: Disconnected from authenticating user root 61.177.172.124 port 32813 [preauth]
Jul 16 11:41:00 localhost sshd[488050]: Invalid user client1 from 112.78.188.194 port 58640
Jul 16 11:41:00 localhost sshd[488050]: Received disconnect from 112.78.188.194 port 58640:11: Bye Bye [preauth]
Jul 16 11:41:00 localhost sshd[488050]: Disconnected from invalid user client1 112.78.188.194 port 58640 [preauth]
Jul 16 11:41:36 localhost sshd[488075]: Connection closed by authenticating user nobody 179.60.147.122 port 48334 [preauth]
Jul 16 11:42:15 localhost sshd[488099]: Invalid user oracle from 112.78.188.194 port 48050
Jul 16 11:42:15 localhost sshd[488099]: Received disconnect from 112.78.188.194 port 48050:11: Bye Bye [preauth]
Jul 16 11:42:15 localhost sshd[488099]: Disconnected from invalid user oracle 112.78.188.194 port 48050 [preauth]
Based on this, I assume I'm getting botted, albeit very lightly. While I don't think this is the cause of the issue, since the times do not correspond to periods of high CPU usage, I have disabled password authentication for ssh (in favour of keys) and tightened up my firewall just to be on the safe side, so thank you for bringing that to my attention.
One thing I'm wondering about is whether I don't have enough swap, as I saw an article saying that servers with 1GB of RAM or less should have 2x the RAM as swap and I only have 512MB. I haven't yet worked out how to increase it, though. I also found this article on Troubleshooting Memory and Network Issues, which I'm going to try following some advice in, but I'm not sure if it will solve the problem or merely dampen it.
Based on this, I assume I'm getting botted
Yep. 112.78.188.194 belongs to an Indonesian ISP based in Jakarta. 61.177.172.124 belongs to China Telecom (Jiangsu province network). The first is probably a Chinese proxy as well.
albeit very lightly.
Don't worry… it'll get worse…especially when the Russians join the party.
One thing I'm wondering about is whether I don't have enough swap
That could definitely be an issue.
as I saw an article saying that servers with 1GB of RAM or less should have 2x the RAM as swap and I only have 512MB.
It's pretty easy to get more swap. 2x memory is definitely a good guideline.
-- sw