Best-practice backups for worst-case?

Hi guys! I'm an EC2 refugee, loving Linode so far.

Last week's events have encouraged me to develop a fairly comprehensive disaster plan, and I'm trying to work out what the current best practice is. Here's what I've got so far:

Backups:

  • Linode backups enabled

  • Cronjob that does mysqldump && s3sync to push mysql databases to s3

SCENARIO 1: Single disk failure on the Linode host.

Damage: RAID protects the data, Linode staff swap the disk.

Response: Nothing for me to do. Not down at all.

SCENARIO 2: Total hardware failure on the Linode host.

Damage: Linode staff move my account to a new physical box, but all disk images are gone.

Response: I log in, restore the latest daily Linode backup to get the machine image back. Then I pull the latest hourly database dump from s3 (a cronjob pushes it there) to get mostly back up to date. Down < 1 hour.

SCENARIO 3: Virtual machine gets trashed (software error, hacked, etc).

Response: Same recovery as for hardware failure.

SCENARIO 4: Volcano (physical, or software-based as in EC2) hits Linode's London datacenter.

Damage: All disk images and all Linode backups are permanently gone. Can't spin up a new Linode instance.

Response: Now what? Looks like I'd have to start a new VM somewhere else, build it from scratch (user accounts, apache, phusion, mysql), redeploy code from github, and reload DB from S3 backups. Down ~ 1 day.

Scenario 4's the interesting one here. Ideally what I'd like to do is download a machine image to somewhere else (s3?), knowing that I could spin that image up with minimal changes on another provider. That way, even in a total datacenter-loss situation, downtime would be < 1 hour, yet it'd be quite cheap since there'd be no hot spare, and simple since there'd be no need to make every change to the image in two places. Is this possible?

Gwyn.

14 Replies

@gwynm:

Hi guys! I'm an EC2 refugee, loving Linode so far.

Last week's events have encouraged me to develop a fairly comprehensive disaster plan, and I'm trying to work out what the current best practice is. Here's what I've got so far:

Backups:

  • Linode backups enabled

  • Cronjob that does mysqldump && s3sync to push mysql databases to s3

SCENARIO 4: Volcano (physical, or software-based as in EC2) hits Linode's London datacenter.

Damage: All disk images and all Linode backups are permanently gone. Can't spin up a new Linode instance.

Response: Now what? Looks like I'd have to start a new VM somewhere else, build it from scratch (user accounts, apache, phusion, mysql), redeploy code from github, and reload DB from S3 backups. Down ~ 1 day.

Scenario 4's the interesting one here. Ideally what I'd like to do is download a machine image to somewhere else (s3?), knowing that I could spin that image up with minimal changes on another provider. That way, even in a total datacenter-loss situation, downtime would be < 1 hour, yet it'd be quite cheap since there'd be no hot spare, and simple since there'd be no need to make every change to the image in two places. Is this possible?

Gwyn.

The question is how to recover from a complete datacenter failure?

That requires complete offsite backup.

Is your S3 backup just mysql, or the entire datasystem?

If it's the entire system (other than /dev, /sys, /proc I think) then you can redeploy everything from your offsite backup.

Linode maintains 5 distinct datacenters - California, Texas, Georgia, New Jersey, and in UK. With your account, you can launch an instance anywhere for the same price, instant provisioning, and pull your S3 backup. (Unlike amazon, there isn't any free transfer between them.. which also means it's unlikely to have the same cascading failure as the networks get saturated.)

You will have a new IP, so if you set your DNS with a low TTL (linode's free DNS service I'd imagine is multi-datacenter..) then you should be able to recover pretty darn fast, if you don't have IPs hardcoded anywhere.

@AviMarcus:

If it's the entire system (other than /dev, /sys, /proc I think) then you can redeploy everything from your offsite backup.

Linode maintains 5 distinct datacenters - California, Texas, Georgia, New Jersey, and in UK. With your account, you can launch an instance anywhere for the same price, instant provisioning, and pull your S3 backup.

Sounds good. So, how would this work?

Do I setup a cronjob to stuff the entire system (minus /dev,/sys,/proc) into a giant tarball, then push that tarball to s3? Then, does the recovery look like:

  • Create a new empty disk image in a new linode datacenter

  • Boot the recovery image on a new node, mount the empty disk image, pull the tarball from s3, and extract it onto the empty disk

  • Reboot the node onto the now-full disk image?

Would this, at least in theory, work on any vps host?

@gwynm:

Sounds good. So, how would this work?

Do I setup a cronjob to stuff the entire system (minus /dev,/sys,/proc) into a giant tarball, then push that tarball to s3?
I would use a tool like duplicity that support S3 and can do incremental backups. That would save quite a bit on bandwidth, and you can restore to before stuff went fubar.

I'm currently using rdiff-backup to hard drive space elsewhere - it uses rsync and diffs to maintain incrementals. It require rdiff-backup on the remote machine (like rsync) so it won't work to S3 w/o also having an ec2 instance there doing stuff.

@gwynm:

Then, does the recovery look like:

  • Create a new empty disk image in a new linode datacenter

  • Boot the recovery image on a new node, mount the empty disk image, pull the tarball from s3, and extract it onto the empty disk

  • Reboot the node onto the now-full disk image?

Would this, at least in theory, work on any vps host?

I would actually deploy the same OS and then pull the entire update overwriting stuff, to let the OS rebuild those excluded folders. Then reboot. But I'm not 100% and haven't gotten around to trying it, unfortunately.

And.. yeah, I think it would work on most VPS hosts. Some have some kind of funny internal system settings for networking and the like, so you wouldn't want to overwrite those. If it's linode -> linode, same OS, there's no worry about messing up any kind of config files.

I prefer to split out the configuration of the machine from the data when I'm doing backups. The data is easy to back up (as you've seen), it's the configuration that's a bit trickier.

I try to avoid copying disk images around… it's ugly, and doesn't work as well as you'd hope. You're much better off being able to easily deploy a new host from scratch. Using Puppet or Chef or Fabric, it's very easy to write scripts to deploy your configuration and copy your data into place. Once you've got that, it doesn't matter if you're deploying to your Linode, a local VM, or some other VPS provider, you're just running scripts, so it's completely portable.

I use git to create backups then push them to S3, works wonders for incremental mysql backups etc.

I've been using virtualbox to run a copy of ubuntu server that has anacron setup to run a small script. I use task manager on windows to launch it every day. The script uses rsync to backup my server's files. It does an incremental backup for each day of the week. Seems to work pretty well.

For what it's worth, I went the Duplicity route, and I just finished my first fire drill. Just restoring duplicity files to a blank image and booting off it doesn't work (perhaps due to missing /dev?), but copying-with-overwrite onto a bare ubuntu image does work. This is nifty.

There is no need to back up the entire machine. Just back up your website files, the database, and /etc (where all your configuration files should be). Anything more is a waste of disk space and bandwidth. In particular, there is no need to back up binaries that can be easily reinstalled from your distribution's official repositories. Anything that isn't specific to your server isn't really worth backing up. ftp.debian.org ain't going anywhere.

Write a bash script that does the following:

  • Update existing packages, and install all the new packages you need. (If you're using Debian or Ubuntu, use dpkg -l to get the list of currently installed packages. Clean it up a bit, and paste it into your bash script.)

  • For any program/version that can't be found in your distribution's official repositories, fetch the latest stable version from the program's website, or clone its GitHub repository. Compile it and install it. (This applies to newfangled things like node.js and redis.)

  • Download your backups from S3. Unzip them where they belong: website files in /home, /var/www, or /srv/www, and configuration files in /etc.

  • Load the MySQL dump.

All of the above can be done with minimal user interaction, and it works in all of your scenarios from #2 to #4. No matter what happens to your server or the datacenter, you should be able to get from a newly provisioned Linode to a fully functioning web server in 1 hour max. (Maybe a little more if your website files weigh more than a few GB, but those large files should be on S3 to begin with.)

Of course, it's always a good idea to test it.

@hybinet:

There is no need to back up the entire machine. Just back up your website files, the database, and /etc (where all your configuration files should be)

I also backup some folders in /var for example the munin graphs and /usr/local for server specific software

This is a great post! Let me see if i can condense things so we can hammer it out.

What to backup
* website files (might be under /etc )

database (/var/lib/mysql )

/etc

odds 'n ends of other files (perhaps listing of dpkg -l )</list> 

Where to create the backup
* S3 has been suggested as the place to put the backup offsite. I don't have an S3 account but on this page there is mention of a free teir. I don't know the ins and outs, but free sounds good ;-)
How To Backup
* This has been touched on briefly. Duplicity to S3 was mentioned.

Anyone have examples of what they use or what we could use?</list> 

Anything else that needs to be addressed in our back-up situation?

I use git to create incremental backups and push that to S3, it takes up less space so saves $.

Why git and not rdiff-backup?

@rsk:

Why git and not rdiff-backup?
You can't use rdiff-backup directly to S3, it require rdiff-backup on the remote machine… so you'd need some sort of running EC2 instance operating during the backup.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct