looking for a grep trick
I know I can use the -v option but my regular expression is going to be a very large OR statement against multi-gb files. I'd like to avoid scanning the file twice (once with -v and once without)
Any thoughts or tips would be greatly appreciated!
5 Replies
The only way to bend it that way would be to modify grep from source, letting it create 2 output streams, one for match, one for non-match.
ls -A | tee >(grep ^[.] > hidden-files) >(grep -v ^[.] > normal-files) | less
Wasn't familiar with command substitution but I'm loving it.
from:
1.00x
grep '[0-9][0-9][0-9][0-9][0-9]' infile > outfile1 ; grep -v '[0-9][0-9][0-9][0-9][0-9]' infile > outfile2
The naive grep approach turned out to be the fastest, but it reads the input file twice. If the file is larger than memory, that will likely slow things considerably.
1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) < infile | grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2
1.03x
tee >( grep '[0-9][0-9][0-9][0-9][0-9]' > outfile1 ) >( grep -v '[0-9][0-9][0-9][0-9][0-9]' > outfile2 ) > /dev/null < infile
These use the tee trick so the input file only needs to be read once. The second form is easier to understand but may take a hair longer since it throws away output to /dev/null. Note that the >() syntax may not be available in shells besides bash.
1.62x
awk '{ if (/[0-9][0-9][0-9][0-9][0-9]/) print > "outfile1" ; else print > "outfile2" }' infile
Hey, don't forget the old-school text processing languages.
3.44x
perl -ne 'BEGIN { open(ONE, ">outfile1"); open(TWO, ">outfile2") } if (/[0-9][0-9][0-9][0-9][0-9]/) { print ONE } else { print TWO }' infile
Perl can do more than awk, but in this case takes about twice the time.
14.2x
sed -n -e 's/[0-9][0-9][0-9][0-9][0-9]/&/w outfile1
t
w outfile2' infile
15.1x
sed -n -e '/[0-9][0-9][0-9][0-9][0-9]/ w outfile1' -e '/[0-9][0-9][0-9][0-9][0-9]/! w outfile2' infile
I was a bit surprised to see sed come out so poorly. Not really sure why this is. As expected, testing against the regex once (the first form) is faster than doing it twice (the second), though the syntax is more difficult to follow.