Cross-datacenter failover questions

I would like to set up a cross-datacenter ha setup for a joomla website. I have read enough here to see that it is possible, but being a relative amateur when it comes to linux management I have some questions before I get started. I've read thru the library tutorial at http://library.linode.com/linux-ha/ip-f … untu-10.04">http://library.linode.com/linux-ha/ip-failover-heartbeat-pacemaker-drbd-mysql-ubuntu-10.04 and it makes sense as far as setting up a failover within the same datacenter, but doesn't go so far as cross-datacenter setups.

For clarification sake, I'm really only interested in redundancy in the (hopefully rare) case that the primary datacenter is unavailable. I don't care about load balancing, etc. Just a backup to reduce the chances of site downtime. The sites I host don't get a huge amount of traffic, but we're moving from Dreamhost where frequent random outages have my clients all in a huff.

Basically, I'd simply like to know what additional/different steps are required for a cross-datacenter setup as compared to the tutorial. Is it as simple as using static IPs in place of private IPs in the "Configure Private Networking" section? My experience is that nothing is ever that simple, and I expect there's more to it than that.

Thanks

5 Replies

Cross-datacenter failover is problematic because the IP address for a given Linode is fixed to the datacenter where that Linode resides. You have to use DNS to do the switchover, which takes time.

Also, synchronization of data between two locations isn't quite as easy, due to the distance involved. Some methods work OK, other methods don't work out well at all. Transporting data over the Internet isn't free, either.

I'm not sure I'd use DRBD. I'd probably use whatever replication capabilities are available in the database engine, perhaps with rsync (or maybe DRBD, if it will do it) for relatively static files.

There's also the whole question of recovery, since once you have a primary failure and cut over to the secondary, you need to be sure you've got all the steps in place for re-synchronizing the primary and then switching back the other way. And hope that the disruption wasn't partial enough (or just network segmenting) that part of the system thought it was still good when it wasn't so you end up with split state. It's a lot of work, especially regarding state management.

For my own purposes, I've more or less concluded that I'm better with a basic synchronization process to a secondary (rsync/unison for static filesystem content, database appropriate support for replication), but leaving the process for cut-over under manual control.

hoopycat's bandwidth comment is well taken too - for example, in one of my node pairs, my main standby node in the same DC as the primary, with maximum sync latency of 60s between the two over the private network. Doing the same between DCs could eat a full node's bandwidth allotment over a month, so might require allowing for a slightly larger (say 5min) latency or just allocate the bandwidth to that task.

Of course, this does impose a minimum latency on any eventual cut-over (mostly my deciding to take the step and then DNS propagation), but to be honest, I'm mostly concerned with protecting against an unexpected multi-hour or more outage due to serious failure than a few minutes here or there.

The incremental cost (time, configuration, expense) to achieve zero latency HA is pretty extreme for the benefits, at least in my own scenarios, and certainly without a mechanism to work around DNS propagation, there's always going to be a reasonable latency to enabling a standby anyway.

– David

Thanks for all your responses and for setting me straight about the reality of accomplishing this.

Here's what I'm thinking of doing, as a sort-of compromise solution. First off, the site doesn't change all that much on a regular basis aside from new user registrations (which are strictly database changes). I do all of the new article posting myself anyway, so I can do any image/file syncing manually when it's required. For the database, I think I can do pretty much the same… manually sync when I make changes to articles, etc. And I'll just turn off the registered area on the backup site so no database changes can happen there, and no other synchronization is necessary… so the public portion of the site can be available on the secondary host node in case of primary failure.

Honestly I think any more than that is overkill for this particular client, and you all make very valid points about the difficulty involved in doing any more than that.

I was looking at DNS Made Easy to handle DNS failover (I've seen it mentioned on other posts here, and it looks like a good/easy solution). Any other good options for handling that part?

That sounds like a lot of work, doing it all manually. You're probably better off coming up with something simple but automated. For example, nightly syncing via rsync. Run a cron job on your primary that, through SSH, initiates a blocking (consistent, if your database access is transactional) database dump, rsyncs it and any website changes to the secondary box, and initiates a database import on the secondary box.

In terms of more timely updates for when you post an article, if you've got some sort of article posting script, you can just have it execute the script that the nightly cron job executes (or if you have no article posting script, initiate it yourself). The sync should be pretty fast since little will have changed, and while the import on the secondary box might take a while, that shouldn't matter since your primary box doesn't need to wait on that.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct