pipe to gzip - avoiding corruption
The use case here is consuming the Twitter streaming API. You open an HTTP connection and it sends chunked responses indefinitely. Based on the keywords being tracked this can get very noisy and produce a lot of data.
What I'm doing is piping the stream through my grep "tree" using tee. This allows me to run regex on the incoming stream to filter out noisy results. Because this could generate 300-400MB daily of unused data I want to pipe it to gzip. (don't want to discard anything due to false negatives)
The stream to the gzip process may be terminated at any point on either end of the connection. I'm worried that this puts me at risk for a corrupt gzip file.
I've been able to simulate a corrupt file by splitting the gz forcefully. I can send the files to a windows box unzip and open them in plain text, although with an error.
I've been unable to get gunzip to process the file and spit out the raw ascii.
Questions… is there a way to open/recover the contents of an ASCII gzip that terminates transfer early? Are there any options within gzip or another zip utility that have "transactions" so to speak. Meaning the zip utility will only write in full chunks and "rollback" if receiving a SIGTERM in the middle of writing a line of text.
Thanks in advance for any tips.
2 Replies
split
If so it seems like a great way to keep things in ASCII until I'm out of zone for potential failure. Going to give it a shot. Thanks!