Linux – Use grep to find differences between two large glossaries

Use grep to find differences between two large glossaries… here is a solution to the problem.

Use grep to find differences between two large glossaries

I have a 78k line .txt file containing UK words and a 5k line .txt file containing the most common UK words. I want to sort out the most common words from a large list so that I have a new list with uncommon words.

I

managed to solve my problem on another thing, but I really want to know, what am I doing wrong because this doesn’t work.

I tried the following:

//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
But this procedure apparently gives me two empty files.

If I just run grep without cutting first, I get words that are known in both files.

I’ve tried this too:

sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
No luck either

Two text files, in case anyone wants to try it themselves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt

Solution

After downloading your file, I noticed that (a) brit-a-z-sorted.txt has a Microsoft line ending, while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do a whole line comparison (grep -x). So, first we need to convert to a normal line ending:

dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt

Now we can use grep to remove common words:

grep -xivFf  5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt

I also added the -F flag to ensure that words will be interpreted as fixed strings instead of regular expressions. This also speeds things up.

I noticed that there are a few words in the 5k-most-common-sorted.txt file that are not in brit-a-z-sorted.txt. For example, “British” is in a public file, but not in a larger file. Ordinary documents also have “aluminum”, while larger files only have “aluminum”.

What does the grep option mean? For those who are curious:

-f indicates the read mode from file.

-F means to treat them as fixed patterns, not regular expressions

-i means to ignore case.

-x means that the entire row matches

-v indicates inverted matching. In other words, print those lines that do not match any pattern.

Related Problems and Solutions