Need help debugging a random connection timeout

Hi everyone,

Linux newbie here, I need help debugging a random connection timeout between my app server and my database server.

Servers:

Linode 768, Debian 6 (64bit) - App server (www1)

Ruby 2

Rails 3 (with Rainbows! as server)

Sidekiq (async background message processor)

pgbouncer

Linode 512, Debian 6 (64bit) - DB server (db1)

Postgres 9.2

Sphinx Search

Redis 2.6.11 (with AOF persistence)

Both are talking through private ip. Redis is used as my main Rails cache storage.

Problem:

Sometimes my application server would throw error like these:

Redis::TimeoutError (Connection timed out)

ActionView::Template::Error (Connection timed out):

It happened randomly, it can happen whether there are <10 people or >60 people active on the site.

The strange thing is, my postgres connection NEVER had such problem (timing out).

Another things to note are:

When I was still using memcache instead of redis, I get the random connection timeout to memcached as well.

Same thing when I was still using MySQL, my database connection never timed out.

Things I've tried:

I've monitored my server using new relic. My CPU, memory, IO, and bandwith seems to be OK. Average response time is acceptable 133ms.

I've upgraded to latest gems, ruby, redis, etc.

I've set my redis timeout = 0, tcp-keepalive = 60. From redis "info", rejected_connection stats is at 0.

I've opened support ticket, and they suggested I did a mtr, which seems to be ok:

mtr --report db1
HOST: www1                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1\. db1                           0.0%    10    0.5   0.5   0.4   0.8   0.1
mtr --report www1
HOST: db1                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1\. www1                          0.0%    10    0.5   0.6   0.4   1.0   0.2  

However, I can't do an mtr as the timeout happen, because it's so random I tend to only saw it via the Rails log.

I hope I didn't missed out any details. Any ideas where to start pinpointing where the problem is?

1 Reply

I'm moving Redis to app server (localhost) and see whether it stops the problem.

EDIT:

I've been monitoring for 2 days so far, and the problem seems to magically goes away after restarting both server (both now Linode 1GB) for the Nextgen free upgrade.

I also did upgrade my Linux kernel to latest (3.8.4 x64), and aptitude safe-upgrade all of the installed packages.

So at this point of time I've no idea whether it's fixed because of the increased memory, or the new machine/infrastructure, or some other thing.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct