Use grep to find differences between two large glossaries
I have a 78k line .txt file containing UK words and a 5k line .txt file containing the most common UK words. I want to sort out the most common words from a large list so that I have a new list with uncommon words.
I
managed to solve my problem on another thing, but I really want to know, what am I doing wrong because this doesn’t work.
I tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
But this procedure apparently gives me two empty files.
If I just run grep without cutting first, I get words that are known in both files.
I’ve tried this too:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
No luck either
Two text files, in case anyone wants to try it themselves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
Solution
After downloading your file, I noticed that (a) brit-a-z-sorted
.txt has a Microsoft line ending, while 5k-most-common-sorted.txt
has Unix line endings and (b) you are trying to do a whole line comparison (grep -x
). So, first we need to convert to a normal line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now we can use grep
to remove common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F
flag to ensure that words will be interpreted as fixed strings instead of regular expressions. This also speeds things up.
I noticed that there are a few words in the 5k-most-common-sorted.txt
file that are not in brit-a-z-sorted.txt
. For example, “British” is in a public file, but not in a larger file. Ordinary documents also have “aluminum”, while larger files only have “aluminum”.
What does the grep option mean? For those who are curious:
-f
indicates the read mode from file.
-F
means to treat them as fixed patterns, not regular expressions
-i
means to ignore case.
-x
means that the entire row matches
-v
indicates inverted matching. In other words, print those lines that do not match any pattern.