Trying to figure out a backup solution...
I've never had to resort to restoring a backup so I'd just like to ask a few questions if you don't mind.
Some points:
My main server is a PHP/Apache/Mysql vBulletin server. The MySql database is 800M.
Questions:
Backups are categorized daily, weekly, and monthly when done correctly right? So if I backed up to S3 then I'd have to pay at least 800mb a day for a total of 800mb x 30? That seems like a lot of space that would add up in $$ really fast.
Is Amazon S3 the ideal way to back it up?
How do I backup MySQL properly and not crash the server during a mysql dump (server gets slow when I do this?) Will the SQL dump be fine? What about if it tries to do this in the middle of someone trying to post something?
What are the alternatives to Amazon S3? Something cheaper? Easier to restore?
20 Replies
As for MySQL dumping, I don't know, don't use MySQL, but PostgreSQL that I use does it transactionally (with pg_dump), ie. the dump is db snapshot at the time of taking.
I'm really confused at how much space I'll be needing.
The problem with S3 is that because both the source and destination (S3) are local filesystems to the linode, it won't be able to tell what parts of a file have changed without actually reading the file (unchanged files can be detected by filesize/modification time, but telling WHAT changed requires reading it).
Wheras when you're doing a local/remote scenario, you can run rsync on the remote end (in this case, the home server), where rsync can scan its copies of the files without transferring data over the net.
@arachn1d:
Ah, hmm I think that's more expensive than Amazon S3 isn't it?
I'm really confused at how much space I'll be needing.
Well, don't you know your current usage? Besides, I'm not affiliated with Gigapros, but using the S3 online calculator shows S3 is twice as expensive. Granted, it also says as of June the inbound traffic will be free, which is what you mostly have, I guess.
As for the backup plan, depends on what kind of backups you need/want. I currently keep daily only backups, meaning not 30 monthly archives but last 24 hours only. That's all I need, since the backup is just to prevent data loss if my node begets to push daisies for whatever reason.
So unless you need historical snapshots throughout the month, last 24 hours plus maybe twice a week as extras should be more than enough. Or just keep last 5 days worth of backup or something.
So if I kept a week worth of backups that would be 800mb x 7 then, correct?
Anyone have answers on the MySQL concern I had?
What scripts do you guys use?
So Amazon isn't good because you can't rsync? What if you had a daily backup with rsync wouldn't it be 800mb+ no matter what each transfer? It doesn't just version the previous file…?
Have you considered backing up the binary logs (binlogs) instead of shipping an entire database dump every time? If only a small part of the database changes during the day, this might be more space-efficient. Binary logs are also more rsync-friendly than raw database dumps, because they're append-only. The downside to binary logs is that they're tricky to verify.
Just a couple of reminders:
1) If you use the MyISAM storage engine, mysqldump will lock all your tables while the dump is in progress. If your application tries to insert or update some rows during this time, the page is likely to hang until the dump is complete.
2) If you use InnoDB tables, inserts and updates are allowed to happen even while the dump is in progress. So you should specify the –single-transaction --quick options in mysqldump to prevent getting an inconsistent snapshot.
nightly backups with rsync.
Before the backup I dump the mysql tables and .tgz them.
Dump of mysql databases then duplicity backup.
I use a script that makes incremental backup every day and full backup every month.
With 4Gb of (uncompressed) data to backup, one year of backup (12 full and more than 300 incremental) my amazon bill is 6-7 dollars a month.
Every month I also make a full backup from my home box with rdiff-backup.
And every time I shut down I also take some minute to make a clone of the disk image, shrink it to the minimum size and keep it as an additional backup: now that I have two linodes (one for production and one for testing) I want to keep the last image on the testing linode.
A bit paranoid, I know
@nexnova:
I use duplicity with amazon s3.
Dump of mysql databases then duplicity backup.
I use a script that makes incremental backup every day and full backup every month.
With 4Gb of (uncompressed) data to backup, one year of backup (12 full and more than 300 incremental) my amazon bill is 6-7 dollars a month.
Every month I also make a full backup from my home box with rdiff-backup.
And every time I shut down I also take some minute to make a clone of the disk image, shrink it to the minimum size and keep it as an additional backup: now that I have two linodes (one for production and one for testing) I want to keep the last image on the testing linode.
A bit paranoid, I know
;-)
I like your style. Could you share your script perhaps?
@arachn1d:
I like your style. Could you share your script perhaps?
Sure!
This is the script for db backup (one table for file, I like this way and not all the DB in one file)
#!/bin/bash
# backup mysql - evry DB in its file
MUSER="root"
MPASS="XXXXXXXX"
MDBAK="/var/dumps"
MYSQLDUMP="$(which mysqldump)"
MYSQL="$(which mysql)"
# clean old backups
rm $MDBAK/*bak > /dev/null 2>&1
# save db list
DBS="$($MYSQL -u $MUSER -p$MPASS -Bse 'show databases')"
# dump every database
for db in $DBS; do
MFILE="$MDBAK/$db.bak"
$MYSQLDUMP -u $MUSER -p$MPASS $db > $MFILE
#echo "$db -> $MFILE"
done
exit 0
And this is the script for duplicity backup:
#!/bin/bash
## NEX: full bkup on Amazon S3
# NOTE: In a shared environment is not safe to export env vars
# Export variables
export AWS_ACCESS_KEY_ID='your AWS Access Key ID'
export AWS_SECRET_ACCESS_KEY='your AWS Secret Key'
export PASSPHRASE='your passphrase'
GPG_KEY='your GPG key'
# day of the month
DDATE=`date +%d`
# full backup only on 1st of the month, otherwise incremental
if [ $DDATE = 01 ]
then
DO_FULL=full
else
DO_FULL=
fi
# Backup source
SOURCE=/
# Bucket backup destination
DEST=s3+http://your.real.unique.bucket.amazon
#NEX - enable for removing old backups
#duplicity remove-older-than 1Y
duplicity ${DO_FULL} \
--encrypt-key=${GPG_KEY} \
--sign-key=${GPG_KEY} \
--exclude=/root/download/** \
--exclude=/var/www/www.toexclude.com/** \
--exclude=/var/www/web23/user/** \
--include=/etc \
--include=/home \
--include=/root \
--include=/usr/local \
--include=/var/www \
--include=/var/dumps \
--include=/var/mail \
--exclude=/** \
${SOURCE} ${DEST}
# Reset env variables
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset PASSPHRASE
exit 0
Useful links:
https://help.ubuntu.com/community/DuplicityBackupHowto
The difference from duplicity is that
duplicity can encrypt so it is safe to use in untrusted backup destinations (like amazon s3)
rdiff-backup stores files without encryption, so it is easier to use in a trusted environment (your home pc, I hope
#!/bin/bash
## NEX - full rdiff bkup
## must be executed manually with root privileges (sudo)
## better to create ssh account only for backup, so it can be launched unattended
# backup source (your linode)
SOURCE=root@yourlinode.com::/
# Local destination (on your home pc)
DEST=/var/mirrors/my-linode-full
# Replace 12345 with your ssh port number, or remove "-p 12345" if it is on standard 22
rdiff-backup -v5 --print-statistics \
--remote-schema 'ssh -p 12345 -C %s rdiff-backup --server' \
--exclude=/root/download/** \
--exclude=/var/www/www.toexclude.com/user/** \
--exclude=/var/www/web23/** \
--exclude /lost+found \
--exclude /media \
--exclude /mnt \
--exclude /proc \
--exclude /sys \
--exclude /tmp \
${SOURCE} ${DEST}
exit 0
So your backup procedure for file "foo" could be:
1) Backup time! Copy /mystuff/foo to /backups/2010-01-28/foo
2) rdiff /backups/2010-01-28/foo against /backups/2010-01-27/foo
3) Compress the diff
4) Send the diff to S3 for storage
The next day, repeat with process, diffing against the previous day. You would want to periodically do a full backup. Perhaps every week, you would want to compress the whole shebang and send it, and then send diffs for each day of the week.
In this scenario, you also only ever need to keep the most recent backup snapshot locally; each day, you just need to diff against the previous day, unless it's full-backup day.
Restoring a backup just involves taking the latest full backup and then applying the diffs sequentially until you reach the desired date. That can also be automated with scripts.
Of course, this is all far more complex than the old "rsync home each night and then let your home backup solution take care of things like incremental stuff for history".
Day 1: Full backup.
Day 2: Incremental against Day 1
Day 3: Incremental against Day 2
Day 4: Incremental against Day 1
Day 5: Incremental against Day 4
Day 6: Incremental against Day 1
(and so forth – I actually do it every 0.7 days, but you get the idea)
Basically, ensure that the distance from the full backup to your most recent incremental is reasonably short. This will make your incremental backups larger, more often than not, but it will reduce the amount of work needed to restore. Also, if one component of the backup gets corrupted or deleted, you've got a better shot of not losing everything.
Bandwidth is cheap, storage is cheap, but neither your data nor your time are. Have a backup strategy that works, is automatic, assures you that everything is up to date, and has a restore method you know how to use. And practice a restore… grab a 360 for the day and restore to it. It'll cost you a buck or two, but you'll sleep better.
And if you're me, you'll find out why backing up to home really sucks for full restores
EDIT: And I might as well plug my personal backup methods:
0. Linode's backup service (ideal for full restores, not to be relied upon yet)
1. BackupPC on my home server (ideal for full LAN restores and single-file restores; stores ~3 months of data with pooling across machines)
2. Keyfobs with tarballs generated by BackupPC and moved off-site monthly (ideal for full restores and sphincter-clenching disasters)
3. Experimental backups from BackupPC to S3 (ideal for full restores, somewhat more automated than #2 but slow due to upstream bandwidth constraints)
Also, most of my works-in-progress are stored on Dropbox, which is synced across all of my computers and backed up by BackupPC. I use git for revision control and a script I wrote to back up my remote IMAP accounts (gmail, live@edu, etc).
I… think of too many worst-case scenarios.
I have three backups solutions currently:
1. Linode Backup Beta - Never had to restore, I really don't even look at the tab anymore to make sure its backing up. A quick check (just now) shows that the backups are being made successfully, but are not to be relied upon.
2. rdiff-backup daily to a virtual machine running Ubuntu 9.04. I'll elaborate below.
3. Custom made GMail backup script. I'll elaborate below.
Daily rdiff-backup
So this is my main fallback. If I lose data I just restore from my virtual machine. I run a Ubuntu 9.04 server install inside VMWare Workstation 7. Every morning before class I get it, boot the VM, login, type "./backup" at the prompt and go to class. I come back, it tells me how much has changed. If it was successful (occasionally it will fail), I shutdown the OS, power off the VM, and go about my daily work.
I have the VM using a 100GB disk image. The disks were not allocated at creation, so it grows as the disks get larger. This means its slightly slower, but isnt always taking 100GB of disk space when theres only 5gb of data in the VM. Every month I copy the VM to one of my external hard drives where it sits until the next month. I have a two month rotation for VM backups. This means at any point I have:
1. 2 month old Backup VM image
2. 1 month old Backup VM image
3. Currently used Backup VM image
I would normally have an old computer set up for this, but as a college student, I don't have the space to keep an extra desktop around. A VM works just fine for me.
Custom GMail Backups
Once I got my first web hosting customer I started looking for more redundancy as far as backups. The VM was for my own use. I would experiment and mess something up, I could simply restore it. After looking around for different backup solutions I decided to not spend any money and work on a customized GMail backup solution.
I have three different backup categories:
1. MySQL Database dump backups - Daily
2. Web (htdocs) folder backups - Weekly
3. Nothing here yet - Monthly
Let me lay out how the GMail backup script works. It is sloppy and needs to be rewritten, but heres how it works:
1. Create a temporary folder in /tmp/
2. Compress the command line parameter (a filename or directory) using tar.gz and save it in the folder created in 1. Redirect the output of the tar command (the file list) to a temporary text file.
3. Encrypt the folder using GPG with a special passcode.
4. Split the encrypted file into 24MB chunks.
5. Delete the old encrypted file.
6. For each of the chunks created in 4:
6a. Generate a subject line (m/d/Y - Filename - (X of Y)
6b. Generate the body. The body has instructions for recombining the files, decrypting, and extracting them.
6c. Send an email using mutt to a predetermined email address attaching the file list from 2 and the current chunk from 4/6.
7. Cleanup (remove temp folder from 1, and a sent folder that mutt makes)
8. Append a line to backup log.
So thats the gmailbackup.sh script itself. I have a backup.daily.sh that contains a bunch of calls to gmailbackup.sh with various filenames as the parameters. For instance the backup.daily.sh file dumps all my MySQL tables to a temp file, then calls gmailbackup.sh with the temp file as a parameter. When its done it deletes the temp file, and is done for the day. I also have a backup.weekly.sh script that calls gmailbackup.sh on all of my htdocs directories (with some exceptions), along with system logs, repositories, etc.
All system-related backups go to a predetermined email. Once a gmail account fills up (I've been using ths system account since August and its only 43% full) I just register a new one, change the email in the script, and let it continue running. The web-related backup email was created at the same time and is 75% full.
Its not even close to an industry standard backup plan, but its worked for me in the past. If you're interested in the script I can go ahead and post it.
Couple questions…
So if I backup my database today for example:
backup-jan28-2010.sql 800mb
and then the backup script runs again tomorrow it'll backup to backup-jan29-2010.sql but only transfer what has risen above 800mb?
Another question… if I backed up over S3 how would I restore it? What about the other solutions?
Say I want to back up my Linode, which has 20GB of data.
Assume my incremental backups weekly are only 5% the size of a full backup on average. Assume that I only keep the previous/current week of data. That's two full backups and 12 incrementals, or 52GB of data.
A full month schedule would involve transferring roughly four full and 24 incrementals, let's say. 104GB to transfer.
Total costs:
S3 storage, 52GB @ $0.15/GB: $7.80/mth
S3 transfer, 104GB @ $0.10/GB: $10.40/mth
Linode transfer, 104GB @ $0.10/GB: $10.40/mth
Total: $28.60/mth
That's not cheap to back up a $30 linode!
The same approach when backing up to home would only cost the last bit, $10.40 total, assuming you already have the storage to spare. If you want to account for the cost of home storage, my costs for building my file server are actually somewhat similar to Amazon's… $0.10/GB (including drives/hardware) without redundancy, $0.12 per gig with…
How rdiff-backup works is that you can pull any file from any date, assuming you make backups between changes.
With rdiff-backup, if you name each file uniquely (ie backup-jan28-2010.sql) you will have to redownload all 800mb (compression?) once a day. If you simply overwrite backup.sql every time you dump your database, rdiff-backup will only transfer the changed date.
Basically rdiff-backup works like this: The first time it sees a file it HAS to download the whole thing, after that it will only download the changes. If you move a file, it will have to redownload the entire thing at its new location. If the file hasnt changed, it'll simply make a note in its internal tracking system that the file has not been changed and download nothing.
You can run all kinds of commands on your rdiff-backup such as the following. It checks my home directory for all the files that have been changed in the past 5 days. Note that these are all run on the virtual machine where the backups are kept and require no communication with my actual linode.
root@linode-backup:/backups# rdiff-backup --list-changed-since 1D linode_current/home/smark/
changed home/smark/.bash_history
changed home/smark/.lastip
changed home/smark/codinguniverse.com/logs/access_log
changed home/smark/files.spectralcoding.com/logs/access_log
changed home/smark/psybnc/psybnc-oftc/log/psybnc.log
changed home/smark/spectralcoding.com/logs/access_log
changed home/smark/wackyfeedback.com/logs/access_log
changed home/smark/wiki.spectralcoding.com/logs/access_log
Or the following which lists all my backups since I recreated the VM:
root@linode-backup:/backups# rdiff-backup --list-increments --list-increment-sizes linode_current/
Time Size Cumulative size
-----------------------------------------------------------------------------
Thu Jan 28 01:04:22 2010 11.8 GB 11.8 GB (current mirror)
Wed Jan 27 12:54:44 2010 2.26 MB 11.8 GB
Mon Jan 25 14:07:15 2010 35.3 MB 11.9 GB
Wed Jan 20 02:22:05 2010 39.1 MB 11.9 GB
Thu Jan 14 13:14:55 2010 42.6 MB 11.9 GB
Tue Jan 12 21:30:25 2010 37.1 MB 12.0 GB
Tue Jan 12 00:57:17 2010 36.7 MB 12.0 GB
Sat Jan 9 10:26:20 2010 35.3 MB 12.0 GB
Mon Jan 4 01:25:00 2010 41.1 MB 12.1 GB
Mon Jan 4 01:12:35 2010 1.18 MB 12.1 GB
Wed Dec 30 12:26:25 2009 49.8 MB 12.1 GB
Tue Dec 29 17:46:06 2009 34.3 MB 12.2 GB
Mon Dec 28 21:21:21 2009 32.7 MB 12.2 GB
Thu Dec 10 11:07:38 2009 305 MB 12.5 GB
Wed Dec 9 14:11:12 2009 16.9 MB 12.5 GB
Tue Dec 8 00:01:27 2009 41.2 MB 12.6 GB
Sun Dec 6 14:59:15 2009 17.0 MB 12.6 GB
Fri Dec 4 11:59:59 2009 16.3 MB 12.6 GB
Thu Dec 3 12:04:18 2009 15.5 MB 12.6 GB
Tue Dec 1 11:54:55 2009 20.0 MB 12.6 GB
Mon Nov 30 11:30:45 2009 22.0 MB 12.6 GB
Sun Nov 29 14:04:16 2009 13.5 MB 12.7 GB
Wed Nov 25 11:32:36 2009 25.4 MB 12.7 GB
Tue Nov 24 12:20:49 2009 20.3 MB 12.7 GB
Mon Nov 23 12:52:05 2009 21.1 MB 12.7 GB
Sun Nov 22 18:59:23 2009 21.2 MB 12.7 GB
Sun Nov 22 15:28:17 2009 1.47 MB 12.7 GB
Sat Nov 21 11:44:32 2009 21.9 MB 12.8 GB
Thu Nov 19 15:57:10 2009 18.0 MB 12.8 GB
Mon Nov 16 11:25:33 2009 16.5 MB 12.8 GB
Fri Nov 13 15:18:25 2009 17.8 MB 12.8 GB
Thu Nov 12 01:33:37 2009 29.1 MB 12.8 GB
Mon Nov 9 13:33:38 2009 29.7 MB 12.9 GB
Fri Nov 6 11:49:26 2009 42.5 MB 12.9 GB
Wed Nov 4 15:02:38 2009 37.6 MB 13.0 GB
Mon Nov 2 12:57:26 2009 35.1 MB 13.0 GB
Wed Oct 28 12:54:22 2009 34.6 MB 13.0 GB
Tue Oct 27 16:13:07 2009 17.9 MB 13.0 GB
Mon Oct 26 12:11:58 2009 15.9 MB 13.1 GB
Mon Oct 26 00:41:27 2009 15.7 MB 13.1 GB
Fri Oct 23 12:04:53 2009 7.58 MB 13.1 GB
Thu Oct 22 12:19:17 2009 6.16 MB 13.1 GB
Wed Oct 21 11:46:55 2009 7.27 MB 13.1 GB
Mon Oct 19 11:33:05 2009 7.30 MB 13.1 GB
Sat Oct 17 16:29:55 2009 8.01 MB 13.1 GB
Fri Oct 16 17:48:02 2009 7.23 MB 13.1 GB
Wed Oct 14 11:14:16 2009 7.47 MB 13.1 GB
Tue Oct 13 11:57:40 2009 7.50 MB 13.1 GB
Mon Oct 12 12:06:24 2009 7.73 MB 13.1 GB
Sun Oct 11 14:10:16 2009 6.34 MB 13.1 GB
Sat Oct 10 17:53:44 2009 6.86 MB 13.1 GB
Fri Oct 9 11:49:11 2009 6.84 MB 13.2 GB
Thu Oct 8 11:47:08 2009 7.78 MB 13.2 GB
Wed Oct 7 13:41:54 2009 8.81 MB 13.2 GB
Tue Oct 6 15:49:47 2009 38.3 MB 13.2 GB
Mon Oct 5 11:22:37 2009 8.22 MB 13.2 GB
Sun Oct 4 17:08:00 2009 8.07 MB 13.2 GB
Sat Oct 3 13:44:13 2009 9.80 MB 13.2 GB
Fri Oct 2 11:39:14 2009 8.07 MB 13.2 GB
Thu Oct 1 17:13:01 2009 7.10 MB 13.2 GB
Wed Sep 30 13:09:17 2009 23.9 MB 13.3 GB
Tue Sep 29 11:59:40 2009 5.95 MB 13.3 GB
Mon Sep 28 15:03:24 2009 5.25 MB 13.3 GB
Sat Sep 26 15:14:45 2009 4.83 MB 13.3 GB
Fri Sep 25 14:05:57 2009 2.23 MB 13.3 GB
Thu Sep 24 11:56:54 2009 3.12 MB 13.3 GB
Wed Sep 23 11:40:27 2009 6.71 MB 13.3 GB
Tue Sep 22 15:55:41 2009 1.24 MB 13.3 GB
Mon Sep 21 21:10:56 2009 1.50 MB 13.3 GB
Sun Sep 20 11:29:00 2009 103 MB 13.4 GB
Sat Sep 19 16:34:42 2009 1.39 MB 13.4 GB
Fri Sep 18 10:52:50 2009 1.17 MB 13.4 GB
Thu Sep 17 10:19:15 2009 1.28 MB 13.4 GB
Tue Sep 15 23:01:54 2009 1.65 MB 13.4 GB
Tue Sep 15 18:32:13 2009 930 KB 13.4 GB
Size = Changed size of the mirror files (?)
Cumulative Size = Total size of the backup
So I saw someone mentioned that to do proper MySQL dumps I have to lock the db so nothing odd will happen if restored, correct?
Alright, he said if it was running under a certain version, I think MyISAM then I wouldn't need to run a specific command to back it up, otherwise I would?
Firstly, how do I figure out what type of DB MySQL is running?
Second if it's not running that specific type of db then I have to mysqldump with a command to lock the tables, correct?
Bash Script: Incremental Encrypted Backups with Duplicity (Amazon S3)