Backups: S3 vs. FTP/SSH
S3
I've actually set this up using s3sync and this tutorialthese scripts
Pros
I'd say it's the price. By my calculations you can get 10GB backed up including monthly full re-uploads (rsyncing meanwhile) and up to 100K requests, for about $3 a month. And if you go beyond that, there are no barriers whatsoever.
Cons
1. For some reason it created a subdirectory in etc on my S3 bucket (so it's etc/etc) even though my prefix was kept at simply "etc". I don't get that. Clearly, I'm not yet familiar enough with this.
2. Even as briefly as I tried it I already saw an "internal server error" a few times as well as "With result 505 HTTP Version Not Supported" which doesn't really inspire my confidence in its reliability… to be honest.. especially after those february and june (2008) "massive outages" I've read about, which may have resulted in loss of data. Cloud computing of this scale still seems to have growing pains I suppose..
3. It's not quite a standard technology as it need special tools (like s3sync) for access and it's hard to find easy to use and efficient interfaces with which I could add or remove files and folders in my buckets. Most of the ones that exist are either half assed (s3fox can't delete), proprietary or don't work on Linux (which is my primary desktop system).
4. No real customer support, AFAIK.
FTP/SSH
By this I mean services like BQ Backup
Pros
1. More methods of backup, from rdiff to rsync.
2. A lot of available software for viewing and manipulating remote files.
3. Arguably simpler, due to familiarity..
Cons
1. Price. It's a little more expensive than S3. For example, I probably wouldn't even use enough to be charged only $3 a month on S3 let alone $5. With BQ Backup though I would probably pay at least $2 or $3 a month for space I'm not actually using, albeit this might be justified by the relative peace of mind and simplicity/convenience.
–--
So, what do you think? Would you add any other pros or cons to the above options? I'm still trying to decide.
Thanks
29 Replies
I've been trying to set up a backup solution for my linode and ultimately got down to two basic choices.
If you are a linux user as you say why don't you rsync to your local linux machine. You likely already pay for enough bandwidth at home to cover this amount of traffic and you are more then likely to have your <=10Gb of space on a local partition.
@malex:
If you are a linux user as you say why don't you rsync to your local linux machine. You likely already pay for enough bandwidth at home to cover this amount of traffic and you are more then likely to have your <=10Gb of space on a local partition.
The biggest reason is my upload connection. It is only 256 kbit/s so in case I actually need to restore the full backup (which would likely involve gigabytes of stuff) I would need to wait a little too long just to upload everything to my server, and that makes for more downtime.
Other reasons are that my ISP, t-com, is reconnecting me daily (changing my IP) so I'd need to set up some dyndns domain and run an IP updater. Also, whenever I'd need to turn my computer off it'd be inaccessible for backup crons, though that's not such a big issue as above.
As of current, my install and all data is about 1.5GB in total, so for about 2.40USD I can have my entire system backed up, or for geo-redundant for about 5$ a month.
Honestly though, I could slide away with less than a GB of actual data and config files. My current backup scheme is to tar up my webroot directory and my mysql databases weekly and my home server reaches out and grabs them.
And if that was to ever grow to 10GB price would be a little too high. I've had a $5 a month backup on Slicehost for a full 10GB slice. They charge $10 for a 20GB slice. I'd like to keep it within those price limits.
I'll bookmark them though, just in case.
Rsync PalaceWebbyCart
Anyone had any experience with these?
I'll try to get some sleep and then hopefully make a decision. Meanwhile, any suggestions are welcome.
In the end the only thing that keeps S3 under consideration for me is the price flexibility and the fact that I might use it to host some images, CSS files etc. (to speed loading times) so having backups tied into that might be a more efficient way to spend bucks.
But other than that I'd say convenience, simplicity and familiarity are on the side of FTP/SSH/rsync based solutions..
Thanks
@memenode:
Other reasons are that my ISP, t-com, is reconnecting me daily (changing my IP) so I'd need to set up some dyndns domain and run an IP updater. Also, whenever I'd need to turn my computer off it'd be inaccessible for backup crons, though that's not such a big issue as above.
I use rsnapshot from a remote host to back up important directories on my linode. The backups are initiated from the remote host so the ISP changing the IP doesn't impact the backups.
–John
@freedomischaos:
You could do an el chepo web hosting and toss everything up there :D
??? El Chepo is an auto parts company, no web hosting.
James
True.. though I hear some hosts are prohibiting use for only backups.. and on shared hosting you don't usually get SSH.
And it would not bug you again. My scripts were designed to setup the bucket so it always any uploaded data in a new sub folder.
Removing the prefix option entirely. Will solve your problem.
With respect to pricing. S3 is as open as you want it to be. You have to decide what's your safety zone. I store lots of GB worth of data on S3. It saves me $500 every month (comapred to the storage option i get from the Colo provider).
Pricing is definitely a plus, especially if you have extremely large amounts of data to back up I suppose. Otherwise the FTP/SSH option does seem fairly attractive, especially these $5 a month for 15GB services. That actually kinda comes close to S3 pricing.
Maybe I should just go with a classical backup solution for now until I am in need of enough backup space that S3 savings would really mean a lot (though I'd have to evaluate that when I get there). Meanwhile I might gain some more experience with S3 if I decide to host some static content to offload my server a bit.
Thanks
1. Offload any static elements of your site/app to S3.
2. Call the static elements via Cloudfront.
3. Dump any excess data from your linode onto S3 so its safe. In the case of system backups keep the acl as private and you will have 100% security.
5. Linode offers Raid1 which is much better then whats offered for EC2 standalone. It would be wise to dump what ever extra data you have onto S3 and call it in whenever you need it.
Reallistically speaking you would have to work really hard in order to have a very high invoice from S3.
Best Regards
Hareem Haque
http://www.tarsnap.com/
Thanks Hareem for your suggestions. I still wonder though about all these "internal server error" messages that happen (a few of them already as I'm uploading a first backup). When it happens it waits 30 seconds and then retries so I assume it gets it the second time most of the time, but this sort of thing makes me wonder if I can be 100% sure that my data is up without any corruptions. It's just odd that these errors are so common on S3.
Atourino, thanks for suggesting tarsnap. It looks interesting, but is still in beta and the site is super-minimal so not sure.. It's pricing is in line with S3 and also uses some nonstandard tech so in that sense it's similar to S3… though with more security..
I'll bookmark it and then if I decide to use a non-S3 backup at some point I might consider it too.
Thanks
However, at 99 attempts the sync process makes damn sure to get every last bit of specified data up to S3. I have backup lot's dvd source backups everyday. I run into the same issue. Using it for more then a year now. Never had data corruption problem (atleast with S3. Ec2 is completely useless).
Best Regards
Hareem Haque.
With result 500 Internal Server Error
72 retries left, sleeping for 30 seconds
Since retries are removed 30 seconds from each other and it apparently tried 28 times by now, this effectively translates to 14 minutes of S3 downtime/failure.
I suppose in one of the remaining 72 tries it will succeed…
EDIT: Actually, I just realized that wasn't on a single file, but in general. It now has 63 retries left so overall I had 37 errors during this upload. Hopefully it wont reach 100 before it's done.. (it's got gigabytes of stuff to do).
Are you chunking your data or have you selected a whole drive with tons of data in it. As it will start to bug you. It creates a list of uploading files. Sync size has to be less then 5GB.
I have not tested this on a linode yet. Works fine from EC2
@memenode:
Atourino, thanks for suggesting tarsnap. It looks interesting, but is still in beta and the site is super-minimal so not sure.. It's pricing is in line with S3 and also uses some nonstandard tech so in that sense it's similar to S3… though with more security..
Tarsnap has some features in common with S3, such as the linear pricing model and the very very low probability that your data will ever be lost (data stored via tarsnap currently gets stored on S3 behind the scenes, in fact); but tarsnap is designed fundamentally as a backup system rather than a general-purpose storage system like S3.
In addition to the improved security, tarsnap works with a snapshot model of backups: Instead of just synchronizing the latest version of your data to S3 like s3sync does (which can cause problems if you realize that you mangled a file after the next sync happens), tarsnap allows you to store as many archives as you want (either snapshots of the same files/directories or completely different data, it doesn't matter) and uses some magic behind the scenes to remove any duplicate bits. Each archive stored on tarsnap can be deleted independently of all the others; so you can do things like creating backups every hour but deleting most of them later so that at any point you have (for example) hourly backups for the past week, daily backups for the past month, and weekly backups for the past year.
I'll stop my evangelizing there for now
I reside in Canada. Is there any way that i could test the service.
@hareem:
I reside in Canada. Is there any way that i could test the service.
Send me an email and I'll try to work something out.
@LatecomerX:
Have you considered Joyent's BingoDisk before?
http://www.bingodisk.com/
Yep, and bookmarked them too. That also seems to be a "cloud" though since it doesn't mention rsync or SSH, but rather WebDAV. Pricing is good though.
@hareem:
Are you chunking your data or have you selected a whole drive with tons of data in it. As it will start to bug you. It creates a list of uploading files. Sync size has to be less then 5GB.
It's in chunks and I'm only backing up select directories so no single s3sync command actually gets 5GB at once (/home comes closest at 3.5GB).
It seems to be complete and is showing 5.1GB bandwidth spent on S3 with 68,637 PUT/COPY/etc requests and costing $1.23. The only thing that's confusing me is that for storage used it says "0.002 GB-Mo" while there should be about 5GB on it..
@cperciva:
Hi all, I'm the author of tarsnap (and saw this forum appearing in my web server logs).
Hi, thanks for the info. Sounds good overall (better than S3, except that I like S3's charging model better since I don't risk losing access by failing to fund an account at any point, cause it funds itself).
I'm not sure though if anything cloud-style is actually better than classic methods (getting a storage box with SSH and rsync support with fixed fee and limit so you know you wont go over that and when you do it's upgrade time
Thanks.
Your storage charge is based on a 30 day model. So if its stored there for 30 days then you get billed for 5GB.
So you would see slight increase in storage cost each day. Like $ 0.0015
or something.
Regards
Hareem Haque
@memenode:
I like S3's charging model better since I don't risk losing access by failing to fund an account at any point, cause it funds itself
I might add an automatic funding mechanism in the future – my paypal-fu is rather limited, but I understand that paypal does have some sort of mechanism for recurring payments. In the mean time, I send out emails warning people when their tarsnap account balances get low, so as long as you're signed up for tarsnap with a working email address it would be very hard to accidentally fail to fund your account when needed.
> I'm not sure though if anything cloud-style is actually better than classic methods (getting a storage box with SSH and rsync support with fixed fee and limit so you know you wont go over that and when you do it's upgrade time
If you have lots of data to back up, running your own backup server might be the most cost-efficient approach – but it has the downside that you rent a server with a 1 TB disk you're paying for the entire TB disk even if it's only half full. Tarsnap (and S3 and other "cloud" storage) is more expensive per GB, but at least you're not paying for unused disk space.
@hareem:
Your storage charge is based on a 30 day model. So if its stored there for 30 days then you get billed for 5GB.
So you would see slight increase in storage cost each day. Like $ 0.0015
or something.
Oh I see. Makes sense.
@cperciva:
I might add an automatic funding mechanism in the future – my paypal-fu is rather limited, but I understand that paypal does have some sort of mechanism for recurring payments. In the mean time, I send out emails warning people when their tarsnap account balances get low, so as long as you're signed up for tarsnap with a working email address it would be very hard to accidentally fail to fund your account when needed.
Ah that takes care of it then.
@cperciva:
If you have lots of data to back up, running your own backup server might be the most cost-efficient approach – but it has the downside that you rent a server with a 1 TB disk you're paying for the entire TB disk even if it's only half full. Tarsnap (and S3 and other "cloud" storage) is more expensive per GB, but at least you're not paying for unused disk space.
:)
I'm not anywhere near needing a TB.
For example, there are multiple of tools for S3, but sometimes changes made by one are not recognizable by another tool and the more convenient ones are either not fully functional or are proprietary (with things like 30 day trials and such).
As for less control, for example, sometimes it could happen that an "internal server error" is something that I could potentially fix myself if I had SSH access or there would maybe be less of a chance of encountering those because I always enter into the same virtual space of the same box which if solid is solid. With clouds, you don't quite know where you are so to speak.. it's kinda random. Any time you connect you're relying on a set of unknown parameters.
That's at least according to my limited understanding of it. I'm basically comparing a VPS server to an account that just uses an unknown number of servers in a cluster with space (and other resources) being dedicated from any which one of them at any point in time.
Thanks
I use BackupPC to a machine I control. BackupPC does rsync, file pooling, and compression so you can save hundreds of point in times in little more disk space than a single backup. It's a nice tool, take a look at
If your worried about your data. You can always encrypt it prior to the sync operation. Remember S3sync will update the latest content that you specify so works best with incremental backups etc..
You can always customize the bash scripts to do whatever you want.
S3 has gone up to almost $3.5 pretty quickly and at this rate may reach $5 a month therefore nullifying the price advantage. I suppose it's just due to the specifics of my use case. I have multiple sites and am dumping up their databases daily meaning that it re-uploads them to S3 every day thus pumping up my bandwidth. Also I suppose the requests pile up easily.
For $5 a month I get unlimited bandwidth, enough space for my current needs and no worries about the price going beyond $5 so that at this point seems to be a better way for my specific case.
Thanks every for suggestions. I'll take another quick sweep of the available solutions before making a decision. No big hurry. At least I have some backup for the time being and I've now set syncs for once a week.
Jungledisk
The cool thing about JungleDisk (and S3 in general) is that you can easily access the data from any machine - which I've used more than once. I also have my S3 backup drive mounted on my node so I can get a file I've deleted or changed without going through the whole restore thing.
Regarding the errors you see with S3: these are normal. I'm using a bunch of Amazon Web Services, and when you look at their developer documentation, they tell you to expect errors and be prepared to retry. They have a massively parallel system that does fail on single transactions occasionally, but that overall works very well and is very reliable.