There are many ways of doing this. Here are a few but use the Perl one it is orders of magnitude faster. I include the others for the sake of completeness:
Perl and hashes, ridiculously fast
perl -e 'open(A,"fileB"); while(<A>)++} while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA
gawk and associative arrays, much slower
gawk ' else}}' fileA fileB
grep
, ridiculously slow. You will need to modify your fileB slightly to make the patterns match only on the first linesed 's/\(.\)/^\1/' fileB > fileC grep -f fileC fileA
I created a couple of test files and it turns out that the Perl solutions is much faster than the grep
:
$ head -2 fileA GO:0032513_GO:0050129 GO:0050129_GO:0051712 $ head -2 fileB GO:0032513_GO:0050129 GO:0050129_GO:0051712 $ wc -l fileA fileB 1500000 fileA 20000000 fileB $ time perl -e 'open(A,"fileB"); while(<A>)++} while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA > /dev/null real 0m41.354s user 0m37.370s sys 0m3.960s $ time gawk ' else}}' fileA fileB real 2m30.963s user 1m23.857s sys 0m9.385s $ time (join -t, <(sort -n fileA) <(sort -n fileB) >/dev/null) real 8m29.532s user 13m52.576s sys 1m22.029s
So, the Perl scriptlet can go through a 20 million line file looking for 1.5 million patterns and finish in ~40 seconds. Not bad. The other two are much slower, gawk
took 2.5 minutes and the grep
one has been running for more than 15. Perl wins hands down.