High availability strange error

Hi there,

A strange issue is (seemingly) randomly occuring. After a while of running in the HA configuration (from here) just fine, the servers just stop talking to each other.

If ha1 is the master, ha2 starts trying to take over the configured services. It tries to bring up the "floating" IP which it manages (and thus causes all my sites to go down), but it can't mount the DRBD drive as it's already mounted on ha1.

On ha1, crm_mon shows that ha1 is online and ha2 is OFFLINE.

On ha2, crm_mon shows that ha2 is online and ha1 is OFFLINE.

I'm not particularly sure on what logs I should be looking at, so if anyone could help that'd be appreciated.

Rebooting ha2 seems to work fine so I'm guessing it might be something to do with that server… I have not tried it the other way around yet.

3 Replies

This happened again this morning. I asked Linode support, but of course it's not something they can really help with but they did point me to my LISH console where there were errors about drbd split brain. However I think that may be a result of whatever is going wrong in the cluster management - I followed the instructions (here) to manually fix the drbd split brain, but crm_mon still showed each other's nodes as offline. After a while, ha2 started completely taking over somehow and started all it's services, so I had to reboot from Linode console to fix, and it's fine again. I could really do with some help on this as my Googlefu is not bringing up much useful leads.

While I don't have much experience with the HA aspects of things, my first guess would the the communications between the two nodes (heartbeat). One node thinks the other is no longer available and is doing its configured job of taking over.

Hopefully someone else here will chime in.

Travis

Yeah that sounds about right, otherbbs, that's about all I can gather too.

I opened a Linode ticket but they can't really do much, but they did point me in the direction of the LISH shell, which had an error about drbd split brain, as I mentioned in my previous post. Unfortunately as I suspected, it's not the cause of the problem, just a resulting factor. The error is something to do with the cluster management stuff which I'm clueless with.

At the moment I've set ha2 to standby (crm node standby ha2) which has caused it not to stop all my sites working, but the error still exists. It's worth noting that because ha2 hasn't tried taking over, the drbd split brain situation hasn't arisen, hence my logic that that's not the root of the problem.

Even stranger is that yesterday ha2 was standby + OFFLINE, but today (without restarting ha2), it is just standby (therefore online).

I don't even know what to look for in the logs… I'm considering just dropping the second Linode completely and going back to a single Linode, this hassle just isn't worth the extra money I'm spending…

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct