Best Distro, DB & Language for Large Data Volume?
I have a project coming up that will require me to batch process a large amount of data once a week (on the weekend). Here's the pertinent info:
while processing the files I will have exclusive access to the VPS & DB.
processing 20 files (containing csv or xml data) with a total of 300k to 350k records.
adding ~450k records to the db: 10 fields (4 indexes) each record ~96 bytes each (incl indexes).
Once the new data has been added I'll need to updated related records:
updating ~100k records: 20 fields (4 indexes), each record ~160 bytes each (incl indexes).
each update will require up to 30 simple math operations (add, sub, mult or div using two arguments).
The db is going to be kind of big:
15 to 18 months of history (65 - 78 weeks worth of records).
30M - 35M total records in DB.
Once each week's data is processed:
25 csv files will be produced (up to 100k records each).
during the week up to 20 concurrent users could be accessing the system performing simply queries via a web app.
I'm thinking of setting up three Linodes for this customer's services:
an appropriately sized VPS to handle the heavy lifting and for user queries.
a smaller VPS (separate data center) to function as a backup web server (will have copy of latest DB).
a smaller VPS to function as dev/test machine as well as to store most recent DB backups.
Performance while processing the batch files and updating the DB is important, but so is being a good 'VPS neighbor'.
Though I'm familiar with CentOS & Ubuntu, MySQL & PHP when dealing with web based projects, I don't want to limit myself to these if there are better solutions for the batch processing.
I'm not sure if any particular Linux distro is better suited to this type of batch processing. I'm guessing that VPS resources & tuning would have more of an impact than the distro, but I'm sure someone has some insight into this.
Is MySQL the best fit for this type of project? Does it scale well to this large of a DB? Would PostgreSQL be a better fit?
I'm aware that Python and Perl are faster than PHP at many things, but would either one of them be a better fit for this type of batch processing? Is there another language that would be a better fit? I've programmed in may different languages over the years, so I'm certainly open to suggestions.
Any insight would be greatly appreciated.
Thanks,
James
4 Replies
Both MySQL and Postgres will work fine. I'd use Postgres personally but again you are better off using the one you know best. Whatever you do you will need to index the right things and tune well.
PHP makes it easy to write code but it's a terrible language. It's good for user interface stuff though. The speed difference between PHP, Python, Perl, C, or whatever else doesn't matter as much as writing good code and using good algorithms. I'd write the user interface web stuff in PHP and the heavy lifting in Python.
I didn't think PHP was the way to go for the heavy lifting, but I do intend to use it for the UI since the web app should have less than 10k page views per month.
Python also has Celery
Otherwise, I'll second what sednet said. PHP for webapp, Python for heavy lifting, PostgreSQL over MySQL.
I have an app that does something similar, in a similar amount of data, except it's XML, not CSV. Because of the memory overhead associated with various lists and dictionaries, and in conjunction with said Python's problem with releasing the memory to the OS, I had to rewrite for "manual" creation of output files on disk (actually using tempfile.SpooledTemporaryFile that automatically spills to disk if it grows over XX MB) instead of using xml.etree library – which I use only to construct simple branches, not whole XML tree.
I hadn't thought of processing the batches across multiple nodes. I'll look into Celery.