How to fix High Availability Failure
I followed along with the HA Linode Library article for two Linodes to run MySQL.
Now, with the latest Fremont DDoS attack last night (I rebooted ha2-db because I could not reach it with ssh), both of my Linodes think they are the Primary machine according to the "crm_mon" command, but my ha1-db is actually doing the job. ha2-db was not able to mount the file system and subsequently did not launch MySQL. Also, running "cat /proc/drbd" on each node shows ha2-db as the Secondary storage device and Inconsistent information. I tried running "drbdadm invalidate all" on ha2-db, as that had fixed synchronization issues other times I ran into them with drbd, but not this time.
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha2-db, 2010-11-11 04:04:44
0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:6291228
ha1-db looks like:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha1-db, 2010-11-11 04:03:49
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----
ns:0 nr:0 dw:3615468 dr:4021959 al:4472 bm:4304 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:603364
The main problem is that ha1-db is running over 100% CPU, because heartbeat is running at 100% all the time. Database queries are still working, though after the holiday I expect more traffic and don't know what will happen. I am worried to just reboot ha1-db and cross my fingers.
top on ha1-db shows:
643 root -2 0 105m 105m 5876 R 100 21.4 914:08.17 heartbeat
Any other suggestions to debug or fix this?
crm_mon from ha1-db shows:
============
Last updated: Thu Nov 25 19:02:20 2010
Stack: Heartbeat
Current DC: ha1-db (c8658f6d-b186-4143-9b16-5eacf721cb7b) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============
Online: [ ha1-db ]
OFFLINE: [ ha2-db ]
Resource Group: HAServices
ip1 (ocf::heartbeat:IPaddr2): Started ha1-db
ip1arp (ocf::heartbeat:SendArp): Started ha1-db
fs_mysql (ocf::heartbeat:Filesystem): Started ha1-db
mysql (ocf::heartbeat:mysql): Started ha1-db
Master/Slave Set: ms_drbd_mysql
Masters: [ ha1-db ]
Stopped: [ drbd_mysql:0 ]
crm_mon from ha2-db shows:
============
Last updated: Thu Nov 25 19:03:18 2010
Stack: Heartbeat
Current DC: ha2-db (a46a8fc8-2c6a-4f81-93a3-2dab6f9439c2) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============
Online: [ ha2-db ]
OFFLINE: [ ha1-db ]
Resource Group: HAServices
ip1 (ocf::heartbeat:IPaddr2): Started ha2-db
ip1arp (ocf::heartbeat:SendArp): Started ha2-db
fs_mysql (ocf::heartbeat:Filesystem): Started ha2-db FAILED
mysql (ocf::heartbeat:mysql): Stopped
Master/Slave Set: ms_drbd_mysql
Slaves: [ ha2-db ]
Stopped: [ drbd_mysql:1 ]
Failed actions:
fs_mysql_start_0 (node=ha2-db, call=14, rc=1, status=complete): unknown error
I can ping each node from the other, also.
More detail (tail of /var/log/syslog):
Nov 25 20:23:07 ha1-db heartbeat: [643]: ERROR: Message hist queue is filling up (500 messages in queue)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011cd10)
Nov 25 20:23:07 ha1-db lrmd: [757]: info: RA output: (ip1:monitor:stderr) eth0:1: warning: name may be invalid
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cd78)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cde0)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 110 ms (> 10 ms) (GSource: 0x1011ce48)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011ceb0)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 120 ms (> 10 ms) (GSource: 0x1011cf18)
Thanks, Josh
1 Reply
I planned for the worst, did a Linode backup and a database backup, then rebooted ha1-db.
Everything is working as expected, and now ha2-db is synching properly with DRBD.
Thanks, Josh