How to fix High Availability Failure

Hi folks,

I followed along with the HA Linode Library article for two Linodes to run MySQL.

http://library.linode.com/linux-ha/high … untu-10.04">http://library.linode.com/linux-ha/highly-available-file-database-server-ubuntu-10.04

Now, with the latest Fremont DDoS attack last night (I rebooted ha2-db because I could not reach it with ssh), both of my Linodes think they are the Primary machine according to the "crm_mon" command, but my ha1-db is actually doing the job. ha2-db was not able to mount the file system and subsequently did not launch MySQL. Also, running "cat /proc/drbd" on each node shows ha2-db as the Secondary storage device and Inconsistent information. I tried running "drbdadm invalidate all" on ha2-db, as that had fixed synchronization issues other times I ran into them with drbd, but not this time.

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha2-db, 2010-11-11 04:04:44
 0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:6291228

ha1-db looks like:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha1-db, 2010-11-11 04:03:49
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:3615468 dr:4021959 al:4472 bm:4304 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:603364

The main problem is that ha1-db is running over 100% CPU, because heartbeat is running at 100% all the time. Database queries are still working, though after the holiday I expect more traffic and don't know what will happen. I am worried to just reboot ha1-db and cross my fingers.

top on ha1-db shows:

643 root      -2   0  105m 105m 5876 R  100 21.4 914:08.17 heartbeat

Any other suggestions to debug or fix this?

crm_mon from ha1-db shows:

============
Last updated: Thu Nov 25 19:02:20 2010
Stack: Heartbeat
Current DC: ha1-db (c8658f6d-b186-4143-9b16-5eacf721cb7b) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============

Online: [ ha1-db ]
OFFLINE: [ ha2-db ]

 Resource Group: HAServices
     ip1        (ocf::heartbeat:IPaddr2):       Started ha1-db
     ip1arp     (ocf::heartbeat:SendArp):       Started ha1-db
     fs_mysql   (ocf::heartbeat:Filesystem):    Started ha1-db
     mysql      (ocf::heartbeat:mysql): Started ha1-db
 Master/Slave Set: ms_drbd_mysql
     Masters: [ ha1-db ]
     Stopped: [ drbd_mysql:0 ]

crm_mon from ha2-db shows:

============
Last updated: Thu Nov 25 19:03:18 2010
Stack: Heartbeat
Current DC: ha2-db (a46a8fc8-2c6a-4f81-93a3-2dab6f9439c2) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============

Online: [ ha2-db ]
OFFLINE: [ ha1-db ]

 Resource Group: HAServices
     ip1        (ocf::heartbeat:IPaddr2):       Started ha2-db
     ip1arp     (ocf::heartbeat:SendArp):       Started ha2-db
     fs_mysql   (ocf::heartbeat:Filesystem):    Started ha2-db FAILED
     mysql      (ocf::heartbeat:mysql): Stopped
 Master/Slave Set: ms_drbd_mysql
     Slaves: [ ha2-db ]
     Stopped: [ drbd_mysql:1 ]

Failed actions:
    fs_mysql_start_0 (node=ha2-db, call=14, rc=1, status=complete): unknown error

I can ping each node from the other, also.

More detail (tail of /var/log/syslog):

Nov 25 20:23:07 ha1-db heartbeat: [643]: ERROR: Message hist queue is filling up (500 messages in queue)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011cd10)
Nov 25 20:23:07 ha1-db lrmd: [757]: info: RA output: (ip1:monitor:stderr) eth0:1: warning: name may be invalid
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cd78)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cde0)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 110 ms (> 10 ms) (GSource: 0x1011ce48)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011ceb0)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 120 ms (> 10 ms) (GSource: 0x1011cf18)

Thanks, Josh

1 Reply

Looks like heartbeat can get into an infinite loop in some situations.

http://www.gossamer-threads.com/lists/l … sers/67922">http://www.gossamer-threads.com/lists/linuxha/users/67922

I planned for the worst, did a Linode backup and a database backup, then rebooted ha1-db.

Everything is working as expected, and now ha2-db is synching properly with DRBD.

Thanks, Josh

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct