looking for a grep trick

Can anyone think of a way to split a file by using to grep to match lines and pipe to a file, then somehow tee non-matches to another file.

I know I can use the -v option but my regular expression is going to be a very large OR statement against multi-gb files. I'd like to avoid scanning the file twice (once with -v and once without)

Any thoughts or tips would be greatly appreciated!

5 Replies

grep is designed to show matches, OR non-matches. It still only gives one output stream, whether that be stdout, piped to a file, whatever.

The only way to bend it that way would be to modify grep from source, letting it create 2 output streams, one for match, one for non-match.

There might be an easier way, but if you don't find it, it wouldn't be hard to write a little Perl or Python script for this…

I was thinking there had to be a way with tee, just couldn't nail it down until I found this example:

ls -A | tee >(grep ^[.] > hidden-files) >(grep -v ^[.] > normal-files) | less

Wasn't familiar with command substitution but I'm loving it.

from:

http://linux.byexamples.com/archives/14 … processes/">http://linux.byexamples.com/archives/144/redirect-output-to-multiple-processes/

:D

I spent way too much time trying out different approaches, so may as well post the results here. I tested with infile, a list of ten million pseudo-random numbers from 0 to 32767. Each command separates this into two output files - one with all five-digit numbers (about 70%), and one with all the others. Here are the results I got in terms of wall-clock times, with the fastest shown first and defined as 1x. Of course you'll get different results with your own input file, regex, and version of the various tools.

1.00x
grep '[0-9][0-9][0-9][0-9][0-9]' infile > outfile1 ; grep -v '[0-9][0-9][0-9][0-9][0-9]' infile > outfile2

The naive grep approach turned out to be the fastest, but it reads the input file twice. If the file is larger than memory, that will likely slow things considerably.

1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) < infile | grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2

1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) >( grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2 ) > /dev/null < infile

These use the tee trick so the input file only needs to be read once. The second form is easier to understand but may take a hair longer since it throws away output to /dev/null. Note that the >() syntax may not be available in shells besides bash.

1.62x
awk '{ if (/[0-9][0-9][0-9][0-9][0-9]/) print > "outfile1" ; else print > "outfile2" }' infile

Hey, don't forget the old-school text processing languages.

3.44x
perl -ne 'BEGIN { open(ONE, ">outfile1"); open(TWO, ">outfile2") } if (/[0-9][0-9][0-9][0-9][0-9]/) { print ONE } else { print TWO }' infile

Perl can do more than awk, but in this case takes about twice the time.

14.2x
sed -n -e 's/[0-9][0-9][0-9][0-9][0-9]/&/w outfile1
t
w outfile2' infile

15.1x
sed -n -e '/[0-9][0-9][0-9][0-9][0-9]/ w outfile1' -e '/[0-9][0-9][0-9][0-9][0-9]/! w outfile2' infile

I was a bit surprised to see sed come out so poorly. Not really sure why this is. As expected, testing against the regex once (the first form) is faster than doing it twice (the second), though the syntax is more difficult to follow.

Wow - thanks Vance, this is great stuff. I had thought about AWK and SED but didn't get around to testing. Glad to see that you confirmed grep is the fastest option.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct