Best Distro, DB & Language for Large Data Volume?

I realize asking these types of questions can result in some unpleasantness, so please be civil towards each other (and me).

I have a project coming up that will require me to batch process a large amount of data once a week (on the weekend). Here's the pertinent info:

  • while processing the files I will have exclusive access to the VPS & DB.

  • processing 20 files (containing csv or xml data) with a total of 300k to 350k records.

  • adding ~450k records to the db: 10 fields (4 indexes) each record ~96 bytes each (incl indexes).

Once the new data has been added I'll need to updated related records:

  • updating ~100k records: 20 fields (4 indexes), each record ~160 bytes each (incl indexes).

  • each update will require up to 30 simple math operations (add, sub, mult or div using two arguments).

The db is going to be kind of big:

  • 15 to 18 months of history (65 - 78 weeks worth of records).

  • 30M - 35M total records in DB.

Once each week's data is processed:

  • 25 csv files will be produced (up to 100k records each).

  • during the week up to 20 concurrent users could be accessing the system performing simply queries via a web app.

I'm thinking of setting up three Linodes for this customer's services:

  • an appropriately sized VPS to handle the heavy lifting and for user queries.

  • a smaller VPS (separate data center) to function as a backup web server (will have copy of latest DB).

  • a smaller VPS to function as dev/test machine as well as to store most recent DB backups.

Performance while processing the batch files and updating the DB is important, but so is being a good 'VPS neighbor'.

Though I'm familiar with CentOS & Ubuntu, MySQL & PHP when dealing with web based projects, I don't want to limit myself to these if there are better solutions for the batch processing.

I'm not sure if any particular Linux distro is better suited to this type of batch processing. I'm guessing that VPS resources & tuning would have more of an impact than the distro, but I'm sure someone has some insight into this.

Is MySQL the best fit for this type of project? Does it scale well to this large of a DB? Would PostgreSQL be a better fit?

I'm aware that Python and Perl are faster than PHP at many things, but would either one of them be a better fit for this type of batch processing? Is there another language that would be a better fit? I've programmed in may different languages over the years, so I'm certainly open to suggestions.

Any insight would be greatly appreciated.

Thanks,

James

4 Replies

The choice of which distro to use is religious not practical. Use whatever one you are most familiar with, it doesn't make much difference as long as it's fairly recent and still gets the security updates.

Both MySQL and Postgres will work fine. I'd use Postgres personally but again you are better off using the one you know best. Whatever you do you will need to index the right things and tune well.

PHP makes it easy to write code but it's a terrible language. It's good for user interface stuff though. The speed difference between PHP, Python, Perl, C, or whatever else doesn't matter as much as writing good code and using good algorithms. I'd write the user interface web stuff in PHP and the heavy lifting in Python.

Thanks for your input, sednet. The logic for the batch processing is straight forward and should be relatively easy to optimize, though tightening it up will be important when executing it over 300k times in a row.

I didn't think PHP was the way to go for the heavy lifting, but I do intend to use it for the UI since the web app should have less than 10k page views per month.

Just be careful if you use Python, esp. with SQLAlchemy, to work with data in chunks, as Python* is notorious for not releasing the memory back to the OS. And if you can parallelize your work, use the 'multiprocessing' lib, not threading, because of the GIL. Then again if your threads are mostly IO, the GIL might not matter.

Python also has Celery which is nice if you can parallelize across many nodes.

Otherwise, I'll second what sednet said. PHP for webapp, Python for heavy lifting, PostgreSQL over MySQL.

I have an app that does something similar, in a similar amount of data, except it's XML, not CSV. Because of the memory overhead associated with various lists and dictionaries, and in conjunction with said Python's problem with releasing the memory to the OS, I had to rewrite for "manual" creation of output files on disk (actually using tempfile.SpooledTemporaryFile that automatically spills to disk if it grows over XX MB) instead of using xml.etree library – which I use only to construct simple branches, not whole XML tree.


*) At least 2.6 on CentOS. I hear 3.x and 2.7 have patches that make it less of a problem.

Thanks Azathoth. I'll make sure I'm running 2.7

I hadn't thought of processing the batches across multiple nodes. I'll look into Celery.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct