How to protect against datacenter failure
Does anyone know what I can do to protect myself against future failures? I simply can't afford for this to happen.
6 Replies
Maybe just move datacenter? I'm planning to move my servers out of fremont this weekend. OFC you will still get the occasional datacenter failure but It's going to be much more rare.
Probably the cheapest option is to use a simple round-robin DNS setup
One step up from basic round-robin DNS is to use a third-party DNS service that does load balancing and health monitoring. UltraDNS
From there things get really expensive. Hardware products like BIG-IP
That covers the external stuff. For the internal side, you need to think about how to replicate configuration and data between nodes. These options are covered pretty well in the Linode Library:
For your setup, you'll need to consider Apache and MySQL. You can ignore the parts about IP failover if you're sticking with one node per data center.
To tie it all together for the highest availability, I would recommend combining at least two nodes behind a NodeBalancer in each of at least two data centers. Then you stick UltraDNS in front of your NodeBalancers. This covers individual node downtime, host downtime, and a full data center outage.
Hope this helps!
@dwc:
Probably the cheapest option is to use a
to serve your site from multiple data centers. The drawback here is that if one of your nodes goes down, you have to update DNS manually and this update may take a while to propagate. Some of your users will see downtime as a result. simple round-robin DNS setup
It's important to note that the propagation delay is true of any DNS-based solution (and this includes UltraDNS), and that setting a low TTL can minimize the impact here (at the expense of higher load on the DNS server, and slightly longer page load times for users as their DNS queries take longer to resolve). It's also important to note that manually activating failover scenarios is probably the only reliable solution (unless you get really fancy and complicated), because otherwise whatever is monitoring your downtime becomes a single point of failure. UltraDNS' solution isn't really any better than round-robin, except that it automates (and potentially introduces another single-point-of-failure). That's not to say that the convenience might be worth it.
The good thing about the round-robin approach is that, for the period of time that the one server is down until you can take it out of the rotation, you're not losing all your traffic, only a portion of it; if you have linodes in each datacenter, and it takes you 15 minutes between the server going down and your DNS changes propagating, you're only losing 17% of your traffic for 15 minutes.
From there, you would need to keep an eye on your server and update the DNS whenever possible. You can also use a DNS that supports failover, which will ping your main server and switch it over to the backup when it doesn't respond, though as Guspaz pointed out, that would also add an extra point of failure. Nothing saying that you need to fill in all the name server slots on your domain name registrar with the same DNS provider – you can use two or three name server entries from another DNS provider, and fill in the others with Linode's name servers as a backup.
this
It's not completely invisible: generally there will be a delay while clients try to use the bad address, but a lot/most clients will failover to the next address.
That requires, of course, that the downed address be gone. If it's returning error status, e.g., HTTP 500 or socket-level connection denied, that's what the client will see.
As to DBs, I have streaming replication setup between postgres instances. It does to take manual effort to failover. It's definitely the weak link in the chain.
@jords:
Maybe just move datacenter?
Data centers are very, very heavy so that is probably not an option.
James