pipe to gzip - avoiding corruption

Question

pipe to gzip - avoiding corruption

general

forum:ross 14 years, 11 months ago

So this is an extension of the question I asked in this thread:

http://www.linode.com/forums/viewtopic.php?t=5054

The use case here is consuming the Twitter streaming API. You open an HTTP connection and it sends chunked responses indefinitely. Based on the keywords being tracked this can get very noisy and produce a lot of data.

What I'm doing is piping the stream through my grep "tree" using tee. This allows me to run regex on the incoming stream to filter out noisy results. Because this could generate 300-400MB daily of unused data I want to pipe it to gzip. (don't want to discard anything due to false negatives)

The stream to the gzip process may be terminated at any point on either end of the connection. I'm worried that this puts me at risk for a corrupt gzip file.

I've been able to simulate a corrupt file by splitting the gz forcefully. I can send the files to a windows box unzip and open them in plain text, although with an error.

I've been unable to get gunzip to process the file and spit out the raw ascii.

Questions… is there a way to open/recover the contents of an ASCII gzip that terminates transfer early? Are there any options within gzip or another zip utility that have "transactions" so to speak. Meaning the zip utility will only write in full chunks and "rollback" if receiving a SIGTERM in the middle of writing a line of text.

Thanks in advance for any tips.

2 Replies

forum:Vance · Answer 1 · Jan. 15, 2010, 12:38 p.m.

forum:Vance 14 years, 11 months ago

Sounds like you could pipe your output to split which would break it into files of a specified size. Then you could set up a cron job to periodically compress the resulting files.

forum:ross · Answer 2 · Jan. 15, 2010, 2:45 p.m.

forum:ross 14 years, 11 months ago

Hmm, that's a good idea. So if I pipe std out into split at say 500MB, it will just buffer the output until it reaches 500MB and then spit it out to disk… then I just compress those files?

If so it seems like a great way to keep things in ASCII until I'm out of zone for potential failure. Going to give it a shot. Thanks!

Compute

Storage

Networking

Databases

Services

Developer Tools

Industries

Pricing

Community

Engage With Us

pipe to gzip - avoiding corruption

2 Replies

Reply

Tips: