Need help debugging a random connection timeout
Linux newbie here, I need help debugging a random connection timeout between my app server and my database server.
Servers:
Linode 768, Debian 6 (64bit) - App server (www1)
Ruby 2
Rails 3 (with Rainbows! as server)
Sidekiq (async background message processor)
pgbouncer
Linode 512, Debian 6 (64bit) - DB server (db1)
Postgres 9.2
Sphinx Search
Redis 2.6.11 (with AOF persistence)
Both are talking through private ip. Redis is used as my main Rails cache storage.
Problem:
Sometimes my application server would throw error like these:
Redis::TimeoutError (Connection timed out)
ActionView::Template::Error (Connection timed out):
It happened randomly, it can happen whether there are <10 people or >60 people active on the site.
The strange thing is, my postgres connection NEVER had such problem (timing out).
Another things to note are:
When I was still using memcache instead of redis, I get the random connection timeout to memcached as well.
Same thing when I was still using MySQL, my database connection never timed out.
Things I've tried:
I've monitored my server using new relic. My CPU, memory, IO, and bandwith seems to be OK. Average response time is acceptable 133ms.
I've upgraded to latest gems, ruby, redis, etc.
I've set my redis timeout = 0, tcp-keepalive = 60. From redis "info", rejected_connection stats is at 0.
I've opened support ticket, and they suggested I did a mtr, which seems to be ok:
mtr --report db1
HOST: www1 Loss% Snt Last Avg Best Wrst StDev
1\. db1 0.0% 10 0.5 0.5 0.4 0.8 0.1
mtr --report www1
HOST: db1 Loss% Snt Last Avg Best Wrst StDev
1\. www1 0.0% 10 0.5 0.6 0.4 1.0 0.2
However, I can't do an mtr as the timeout happen, because it's so random I tend to only saw it via the Rails log.
I hope I didn't missed out any details. Any ideas where to start pinpointing where the problem is?
1 Reply
EDIT:
I've been monitoring for 2 days so far, and the problem seems to magically goes away after restarting both server (both now Linode 1GB) for the Nextgen free upgrade.
I also did upgrade my Linux kernel to latest (3.8.4 x64), and aptitude safe-upgrade all of the installed packages.
So at this point of time I've no idea whether it's fixed because of the increased memory, or the new machine/infrastructure, or some other thing.