pipe to gzip - avoiding corruption

So this is an extension of the question I asked in this thread:

http://www.linode.com/forums/viewtopic.php?t=5054

The use case here is consuming the Twitter streaming API. You open an HTTP connection and it sends chunked responses indefinitely. Based on the keywords being tracked this can get very noisy and produce a lot of data.

What I'm doing is piping the stream through my grep "tree" using tee. This allows me to run regex on the incoming stream to filter out noisy results. Because this could generate 300-400MB daily of unused data I want to pipe it to gzip. (don't want to discard anything due to false negatives)

The stream to the gzip process may be terminated at any point on either end of the connection. I'm worried that this puts me at risk for a corrupt gzip file.

I've been able to simulate a corrupt file by splitting the gz forcefully. I can send the files to a windows box unzip and open them in plain text, although with an error.

I've been unable to get gunzip to process the file and spit out the raw ascii.

Questions… is there a way to open/recover the contents of an ASCII gzip that terminates transfer early? Are there any options within gzip or another zip utility that have "transactions" so to speak. Meaning the zip utility will only write in full chunks and "rollback" if receiving a SIGTERM in the middle of writing a line of text.

Thanks in advance for any tips.

2 Replies

Sounds like you could pipe your output to split which would break it into files of a specified size. Then you could set up a cron job to periodically compress the resulting files.

Hmm, that's a good idea. So if I pipe std out into split at say 500MB, it will just buffer the output until it reaches 500MB and then spit it out to disk… then I just compress those files?

If so it seems like a great way to keep things in ASCII until I'm out of zone for potential failure. Going to give it a shot. Thanks!

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct