Reading GZ file from ObjectStorage

Monthly once, I need to sync data from 3 sources. The data is downloaded from OpenLibrary.org and comes as *.txt.gz files.

I have a job that downloads the file, extracts it, parses it, and upload it to the database. This job has multiple threads - that is the input file is opened by 4 threads at a time and read from different portions (first 10000 by 1 thread, next 10000 by another, and so on).

Now, after the upload to DB is complete, I have no use of the GZ or TXT file. The GZ files are a total of 8GB, and I assume the extracted data would be about 80 GB.

Question - At first, I thought I'll add extra volume to handle this data. But, the use of this volume is just for a couple of days (until import to db is finished). So I was wondering, is there a way I can store this in Object Storage instead. (This will save cost since Linode charges a flat rate of $5 for upto 250GB. But if I take a volume, I will end up paying $10 per month for 100 GB!).

Is there a way I can open a file in ObjectStorage using native PHP functions, like gzreads? Also do you think it is feasible?

1 Reply

I can't directly answer your question…but I can give you something to try…

I suspect that gzopen is a wrapper around fopen…which accepts a URL as a filename.

  • Try using gzread to read from the resource returned by gzopen. Read the first 1Mb or something. You can put as much integrity checking into your program as you want to make sure you get the right stuff back.

  • Close the resource returned by gzopen with gzclose.

Make sure you do lots of error-checking. All of this shouldn't be very hard to do. You can hardcode the URL into a constant. Once you get everything working, you can add bells/whistles to your heart's content.

Once you figure out that his will work, you're probably going to have to figure out how to read/process your file in chunks so that you save VM and CPU cycles (check out nice(1) ).

-- sw

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct