Huge load spikes, 502 and connection timeouts
Hi,
for the past few weeks, my linode has started acting very weirdly and the website is down multiple times per day every day. So I started investigating… It seems like I have errors everywhere! I haven't changed anything server-side or on the website so I really don't get what's happening.
I am runing Wordpress on ubuntu 16.04LTS, Linode 4096. Traffic is at its lowest point in the past 12 months, no more than 40-50 users at the same time. Running Nginx, MariaDB, PPHP7.0-fpm.
Here is everything I could pull out from the logs and my research:
I see there is plenty of memory left, but the CPU is through the roof:
%Cpu(s): 22.1 us, 10.0 sy, 0.7 ni, 58.0 id, 7.6 wa, 1.5 hi, 0.1 si, 0.1 st
KiB Mem : 4024580 total, 1019928 free, 948504 used, 2056148 buff/cache
KiB Swap: 262140 total, 262140 free, 0 used. 2704140 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1122 mysql 20 0 1088088 452896 18344 S 186.7 11.3 503:52.63 mysqld
1 root 20 0 119576 5736 3984 S 0.0 0.1 0:03.55 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcugp 5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H 7 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mmpercpuwq 8 root 20 0 0 0 0 S 0.0 0.0 0:02.47 ksoftirqd/0 9 root -2 0 0 0 0 I 0.0 0.0 0:07.43 rcupreempt
All of my log files contain errors every seconds or every few seconds. Here are some samples.
Nginx log:
2018/08/14 06:28:23 [error] 933#933: *72352 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: exemple.com, request: "GET /?wordfencelh=1&hid=19C3C2785E77AF81FB8C0A21F3D8015A&r=0.5195645098705209 HT$ 2018/08/14 06:28:27 [error] 933#933: *72357 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: exemple.com, request: "POST /wp-admin/admin-ajax.php?tdthemename=Newsmag&v=3.3.1 HTTP/1.1", upstre$ 2018/08/14 06:28:29 [error] 933#933: *72370 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: exemple.com, request: "GET /?wordfencelh=1&hid=63F944E025B8089D8C37A09E4F98DE23&r=0.564857619316481$
2018/08/14 06:28:30 [error] 934#934: *72227 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: exemple.com, request: "GET /2018/03/12/kim-levesque-lizotte-montre-premiere-bedaine-de-grossesse/ HTT$
2018/08/14 06:44:51 [error] 933#933: *73952 connect() to unix:/run/php/php7.0-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: x.x.x.x, server: exemple.com, request: "GET /wp-content/plugins/onesignal-free-web-push-n$
2018/08/14 06:44:52 [error] 933#933: *74023 connect() to unix:/run/php/php7.0-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: x.x.x.x, server: exemple.com, request: "GET /wp-content/plugins/onesignal-free-web-push-not
PHP log:
[14-Aug-2018 08:40:15] WARNING: [pool www] seems busy (you may need to increase pm.startservers, or pm.min/maxspareservers), spawning 8 children, there are 4 idle, and 24 total children [14-Aug-2018 08:40:18] WARNING: [pool www] server reached pm.maxchildren setting (25), consider raising it
[14-Aug-2018 08:45:03] WARNING: [pool www] server reached pm.max_children setting (25), consider raising it
syslog:
Aug 14 06:25:24 andy kernel: [47503.899782] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:a8:41:08:00 SRC=84.176.x.x DST=104.237.x.x LEN=60 TOS=0x00 PREC=0x00 TTL=53 ID=21381 DF PROTO=TCP SPT=35366 DPT=23 WINDOW=29200 RES=0x00 SYN URGP=0
Aug 14 06:25:25 andy kernel: [47504.898302] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:a8:41:08:00 SRC=84.176.x.x DST=104.237.x.x LEN=60 TOS=0x00 PREC=0x00 TTL=53 ID=21382 DF PROTO=TCP SPT=35366 DPT=23 WINDOW=29200 RES=0x00 SYN URGP=0
Aug 14 06:25:27 andy kernel: [47506.900182] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:a8:41:08:00 SRC=84.176.x.x DST=104.237.x.x LEN=60 TOS=0x00 PREC=0x00 TTL=53 ID=21383 DF PROTO=TCP SPT=35366 DPT=23 WINDOW=29200 RES=0x00 SYN URGP=0
Aug 14 06:25:31 andy kernel: [47510.904657] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:a8:41:08:00 SRC=84.176.x.x8 DST=104.237.x.x LEN=60 TOS=0x00 PREC=0x00 TTL=53 ID=21384 DF PROTO=TCP SPT=35366 DPT=23 WINDOW=29200 RES=0x00 SYN URGP=0
Aug 14 06:25:39 andy kernel: [47518.924772] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:a8:41:08:00 SRC=84.176.x.x DST=104.237.x.x LEN=60 TOS=0x00 PREC=0x00 TTL=53 ID=21385 DF PROTO=TCP SPT=35366 DPT=23 WINDOW=29200 RES=0x00 SYN URGP=0
Aug 14 06:25:56 andy kernel: [47536.650971] [UFW BLOCK] IN=eth0 OUT= MAC=ff:3c:91:1f:b1:25:84:78:ac:57:aa:c1:08:00 SRC=146.185.x.x DST=104.237.x.x LEN=40 TOS=0x00 PREC=0x00 TTL=245 ID=27725 PROTO=TCP SPT=40570 DPT=30581 WINDOW=1024 RES=0x00 SYN URGP=0
Thanks for your help!
4 Replies
How did you determine your traffic is at it's lowest historically? Can you show us a screenshot of your network graph, or any other graphing tools you may have installed? Your PHP-FPM errors indicate you may be seeing more traffic than usual, or possibly you're seeing abusive traffic in some way.
Can you also attach screenshots of your RAM and CPU graphs? I'm not sure what command you ran to get that resource usage information but I don't think we can determine system load from that easily.
If you do in fact have un-used resource availability to work with you might just need to raise the amount of PHP-FPM workers. The configuration file for that should be in the php-fpm config file directory, here's an example path:
/etc/php/7.0/fpm/pool.d/www.conf
Search for this line:
pm.max_children = 25
Make a relatively small increase, perhaps something like 40 so as to not overwhelm the system. Then restart the php-fpm process. If after that you continue to see errors like this:
2018/08/14 06:44:51 [error] 933#933: *73952 connect() to unix:/run/php/php7.0-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream
[14-Aug-2018 08:45:03] WARNING: [pool www] server reached pm.max_children setting (25), consider raising it
Then likely there's something wonky/different going on with either 1) the traffic you're now receiving or 2) your application now has some problem in how it handles traffic that needs to be investigated. By the way in the context of this error message, "upstream" means the php-fpm process.
When our team (Linode Professional Services) handles problems like these for clients, our workflow generally looks like this until the issue is resolved:
- Evaluate webserver traffic and compare against past received traffic to determine any differences, e.g. has some new type of bot started spamming traffic at you? Determine what that traffic looks like and add in a WAF or firewall rules to drop them.
- Optimize resource usage for PHP-FPM and MySQL based on system resource availability so they can handle more traffic.
- De-couple MySQL to it's own Linode instance to better optimize resources, remove MySQL from the original instance. PHP-FPM and MySQL compete pretty badly for system resources when traffic load gets heavier, and having their own separate systems to run on helps greatly.
- Run a php-profiler program on the application to determine any code bloat or heavy plugins/modules that may have recently been added, or really any that could be removed to improve the code-base efficiency.
- Evaluate MySQL queries to determine if there are any excessively inefficient or redundant queries being run by someone or some code.
Lastly those firewall messages are also most likely a red-herring. Ubuntu 16 is set by default to log firewall denials, and on the public internet there is a ton of ridiculous bot traffic flying around that will be dropped by default.
Load average: https://ibb.co/eGq1Ap
Ram: https://ibb.co/hRQCGU
Connections: https://ibb.co/jdMxi9
Hey, glad to see you figured it out. It's generally best to try things one at a time if you're looking to understand your system better, but what works, works right?
If you raised your max PHP-FPM children workers up, that was probably it since that was your immediate problem. Keep an eye out for that error in the future, as well as a close eye on your resource utilization charts.