Best-practice backups for worst-case?
Last week's events have encouraged me to develop a fairly comprehensive disaster plan, and I'm trying to work out what the current best practice is. Here's what I've got so far:
Backups:
Linode backups enabled
Cronjob that does mysqldump && s3sync to push mysql databases to s3
SCENARIO 1: Single disk failure on the Linode host.
Damage: RAID protects the data, Linode staff swap the disk.
Response: Nothing for me to do. Not down at all.
SCENARIO 2: Total hardware failure on the Linode host.
Damage: Linode staff move my account to a new physical box, but all disk images are gone.
Response: I log in, restore the latest daily Linode backup to get the machine image back. Then I pull the latest hourly database dump from s3 (a cronjob pushes it there) to get mostly back up to date. Down < 1 hour.
SCENARIO 3: Virtual machine gets trashed (software error, hacked, etc).
Response: Same recovery as for hardware failure.
SCENARIO 4: Volcano (physical, or software-based as in EC2) hits Linode's London datacenter.
Damage: All disk images and all Linode backups are permanently gone. Can't spin up a new Linode instance.
Response: Now what? Looks like I'd have to start a new VM somewhere else, build it from scratch (user accounts, apache, phusion, mysql), redeploy code from github, and reload DB from S3 backups. Down ~ 1 day.
Scenario 4's the interesting one here. Ideally what I'd like to do is download a machine image to somewhere else (s3?), knowing that I could spin that image up with minimal changes on another provider. That way, even in a total datacenter-loss situation, downtime would be < 1 hour, yet it'd be quite cheap since there'd be no hot spare, and simple since there'd be no need to make every change to the image in two places. Is this possible?
Gwyn.
14 Replies
@gwynm:
Hi guys! I'm an EC2 refugee, loving Linode so far.
Last week's events have encouraged me to develop a fairly comprehensive disaster plan, and I'm trying to work out what the current best practice is. Here's what I've got so far:
Backups:
Linode backups enabled
Cronjob that does mysqldump && s3sync to push mysql databases to s3
SCENARIO 4: Volcano (physical, or software-based as in EC2) hits Linode's London datacenter.
Damage: All disk images and all Linode backups are permanently gone. Can't spin up a new Linode instance.
Response: Now what? Looks like I'd have to start a new VM somewhere else, build it from scratch (user accounts, apache, phusion, mysql), redeploy code from github, and reload DB from S3 backups. Down ~ 1 day.
Scenario 4's the interesting one here. Ideally what I'd like to do is download a machine image to somewhere else (s3?), knowing that I could spin that image up with minimal changes on another provider. That way, even in a total datacenter-loss situation, downtime would be < 1 hour, yet it'd be quite cheap since there'd be no hot spare, and simple since there'd be no need to make every change to the image in two places. Is this possible?
Gwyn.
The question is how to recover from a complete datacenter failure?
That requires complete offsite backup.
Is your S3 backup just mysql, or the entire datasystem?
If it's the entire system (other than /dev, /sys, /proc I think) then you can redeploy everything from your offsite backup.
Linode maintains 5 distinct datacenters - California, Texas, Georgia, New Jersey, and in UK. With your account, you can launch an instance anywhere for the same price, instant provisioning, and pull your S3 backup. (Unlike amazon, there isn't any free transfer between them.. which also means it's unlikely to have the same cascading failure as the networks get saturated.)
You will have a new IP, so if you set your DNS with a low TTL (linode's free DNS service I'd imagine is multi-datacenter..) then you should be able to recover pretty darn fast, if you don't have IPs hardcoded anywhere.
@AviMarcus:
If it's the entire system (other than /dev, /sys, /proc I think) then you can redeploy everything from your offsite backup.
Linode maintains 5 distinct datacenters - California, Texas, Georgia, New Jersey, and in UK. With your account, you can launch an instance anywhere for the same price, instant provisioning, and pull your S3 backup.
Sounds good. So, how would this work?
Do I setup a cronjob to stuff the entire system (minus /dev,/sys,/proc) into a giant tarball, then push that tarball to s3? Then, does the recovery look like:
Create a new empty disk image in a new linode datacenter
Boot the recovery image on a new node, mount the empty disk image, pull the tarball from s3, and extract it onto the empty disk
Reboot the node onto the now-full disk image?
Would this, at least in theory, work on any vps host?
@gwynm:
Sounds good. So, how would this work?
Do I setup a cronjob to stuff the entire system (minus /dev,/sys,/proc) into a giant tarball, then push that tarball to s3?
I would use a tool like duplicity that support S3 and can do incremental backups. That would save quite a bit on bandwidth, and you can restore to before stuff went fubar.
I'm currently using rdiff-backup to hard drive space elsewhere - it uses rsync and diffs to maintain incrementals. It require rdiff-backup on the remote machine (like rsync) so it won't work to S3 w/o also having an ec2 instance there doing stuff.
@gwynm:
Then, does the recovery look like:
Create a new empty disk image in a new linode datacenter
Boot the recovery image on a new node, mount the empty disk image, pull the tarball from s3, and extract it onto the empty disk
Reboot the node onto the now-full disk image?
Would this, at least in theory, work on any vps host?
I would actually deploy the same OS and then pull the entire update overwriting stuff, to let the OS rebuild those excluded folders. Then reboot. But I'm not 100% and haven't gotten around to trying it, unfortunately.
And.. yeah, I think it would work on most VPS hosts. Some have some kind of funny internal system settings for networking and the like, so you wouldn't want to overwrite those. If it's linode -> linode, same OS, there's no worry about messing up any kind of config files.
I try to avoid copying disk images around… it's ugly, and doesn't work as well as you'd hope. You're much better off being able to easily deploy a new host from scratch. Using Puppet or Chef or Fabric, it's very easy to write scripts to deploy your configuration and copy your data into place. Once you've got that, it doesn't matter if you're deploying to your Linode, a local VM, or some other VPS provider, you're just running scripts, so it's completely portable.
Write a bash script that does the following:
Update existing packages, and install all the new packages you need. (If you're using Debian or Ubuntu, use dpkg -l to get the list of currently installed packages. Clean it up a bit, and paste it into your bash script.)
For any program/version that can't be found in your distribution's official repositories, fetch the latest stable version from the program's website, or clone its GitHub repository. Compile it and install it. (This applies to newfangled things like node.js and redis.)
Download your backups from S3. Unzip them where they belong: website files in /home, /var/www, or /srv/www, and configuration files in /etc.
Load the MySQL dump.
All of the above can be done with minimal user interaction, and it works in all of your scenarios from #2 to #4. No matter what happens to your server or the datacenter, you should be able to get from a newly provisioned Linode to a fully functioning web server in 1 hour max. (Maybe a little more if your website files weigh more than a few GB, but those large files should be on S3 to begin with.)
Of course, it's always a good idea to test it.
@hybinet:
There is no need to back up the entire machine. Just back up your website files, the database, and /etc (where all your configuration files should be)
I also backup some folders in /var for example the munin graphs and /usr/local for server specific software
What to backup
* website files (might be under /etc )
database (/var/lib/mysql )
/etc
odds 'n ends of other files (perhaps listing of dpkg -l )</list>
Where to create the backup
* S3 has been suggested as the place to put the backup offsite. I don't have an S3 account but on
this page
How To Backup
* This has been touched on briefly.
Duplicity
Anyone have examples of what they use or what we could use?</list>
Anything else that needs to be addressed in our back-up situation?
@rsk:
Why git and not rdiff-backup?
You can't use rdiff-backup directly to S3, it require rdiff-backup on the remote machine… so you'd need some sort of running EC2 instance operating during the backup.