Packet loss on private network leads to heartbeat death
Twice in the last couple of months, one of the nodes has had heartbeat jump to 100%, and this time it was the backup node that had the problem. However, it caused apache on the primary to go unresponsive as well. The first time, both nodes needed rebooting. So far, only the failed backup node needed rebooting. Once that was done, apache on the primary resumed serving, but it still might be a little wacky - haven't fully checked it out yet.
Searching around, it seems like there are instances of heartbeat taking up 100% CPU time and rendering the box useless without a reboot. From what I can tell, it's generally caused by cumulative snowballing of failing-and-retransmitting packets.
While this problem may have been caused by today's DoS attack against Newark (started ~3hours before the status alert
I've snipped a couple of bits from the log file below. While I don't have any control over the network, does anyone have any suggestions on how to improve this situation on the server?
Thanks in advance.
log from when the problem seemed to start:
(timezone is UTC)
Oct 8 17:03:27 ewrha01 heartbeat: [1027]: WARN: 3 lost packet(s) for [ewrha02] [256919:256923]
Oct 8 17:03:27 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 8010 ms
Oct 8 17:03:30 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:31 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256923:256925]
Oct 8 17:03:31 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:35 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256925:256927]
Oct 8 17:03:35 ewrha01 attrd: [1061]: info: attrdtriggerupdate: Sending flush op to all hosts for: master-drbd_webfs:1 (1000)
Oct 8 17:03:35 ewrha01 attrd: [1061]: info: attrdperformupdate: Sent update 125: master-drbd_webfs:1=1000
Oct 8 17:03:36 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:37 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256927:256929]
Oct 8 17:03:37 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:45 ewrha01 heartbeat: [1027]: WARN: 3 lost packet(s) for [ewrha02] [256929:256933]
Oct 8 17:03:45 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 8000 ms
Oct 8 17:03:46 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:49 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256933:256935]
Oct 8 17:03:53 ewrha01 attrd: [1061]: info: attrdtriggerupdate: Sending flush op to all hosts for: master-drbd_userfs:1 (1000)
Oct 8 17:03:53 ewrha01 attrd: [1061]: info: attrdperformupdate: Sent update 127: master-drbd_userfs:1=1000
Oct 8 17:03:54 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: WARN: 2 lost packet(s) for [ewrha02] [256935:256938]
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 6010 ms
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:57 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256938:256940]
And I now have 80MB of this, with the dispatch delay growing as time went on.
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: WARN: Gmaintimeoutdispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0aa48)
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: info: Link ewrha02:eth0 dead.
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: WARN: Gmaintimeoutdispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0bb58)
Oct 8 17:39:27 ewrha01 heartbeat: [1027]: WARN: Gmaintimeoutdispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0c378)
Oct 8 17:39:27 ewrha01 heartbeat: [1027]: WARN: Gmaintimeoutdispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0d910)
5 Replies
But since Linode doesn't add a second interface for private network (which is strange, by the way) you can try to use your private addresses in heartbeat configuration.
What version of heartbeat are you using, by the way? 3.0.5 is the latest.
The tl;dr was that heartbeat went split-brain (lost inter-node connectivity) even though there appeared to be no network loss. This was after seeing (and ignoring) the late and missing heartbeat messages mentioned by the OP. I'd always wondered about them but they seemed benign, until things went very south. I did a bit of searching and eventually turned realtime on in ha.cf. That (supposedly) gives heartbeat a much higher priority and keeps it pinned in RAM. The docs mentioned it as a debugging thing but it seemed to resolve the issue. And it seems to make sense: even if the machine is under load (which I don't entirely understand because the nodes aren't really …), heartbeat needs to be (one of the) highest priorities since very bad things happen if it mistakes its own priority problems for the other node being dead.
This came up because I got into a fairly weird failure state where the my heartbeat pair had decided on who owned an IP resource but the upstream router had a different idea. Basically the upstream router had an obsolete arp entry.
It's unclear how this happened, exactly, though it happened after a "virtual" split-brain incident. The heartbeat pair "lost" connectivity between them so both tried to master the resource concurrently. When connectivity came back, they sorted things out amongst themselves, but the upstream router was still confused. That shouldn't have happened, I still don't know why it did, and am on the watch for recurrences. Restarting heartbeat fixed things.
I'll try shifting to private addresses to see if that helps - I can't see how it would hurt. Unfortunately the disruptions are so uncommon that I don't know how I can test for improvement.
I've added the realtime directive to see if that can help prevent the issue in the future, but as with smparkes experience, both boxes have nonexistent loads and effectively zero swapping, including prior to this event.
Thank you both.
@jrq:
I'll try shifting to private addresses to see if that helps
I don't think that matters as long as nothing malicious is going on. That said, I think it's a good idea. I also have iptables rules on my private interfaces that drop packets that aren't from my other nodes. Private addresses are not internet-routable but they are accessible from any linode in the same datacenter (I'm pretty sure?)
@smparkes:
I don't think that matters as long as nothing malicious is going on. That said, I think it's a good idea. I also have iptables rules on my private interfaces that drop packets that aren't from my other nodes. Private addresses are not internet-routable but they are accessible from any linode in the same datacenter (I'm pretty sure?)
I have similar rules that only allow traffic from my other nodes outside of ports 80, 443 and remapped ssh port on both the public and private side. I guess I'm hoping that the switches might treat traffic on the private network better in a DoS scenario than that using public IPs. While that may not be the case, I'm thinking it shouldn't be treated worse.