Linux – Merge two files without pseudo-duplicates

Merge two files without pseudo-duplicates… here is a solution to the problem.

Merge two files without pseudo-duplicates

I have two text files file1.txt and file2.txt, both of which contain lines of word like this:

fare
word
phrasing
world
world

and


fare
text
uncial
Tree-grown
world
phrasing

Or something like that. A word, I mean a string of letters a-z that may have accent marks, and the symbol -. My question is, how do I create a third file output from the linux command line (using awk, sed, etc.) .txt two files that meet the following three conditions:

  1. If the same word appears in two files, the third output.txt contains it only once.
  2. If a hyphenated version of a word in a file (e.g. fa-re in

  3. file2.txt) appears in another file, only the hyphenated version remains in output.txt (e.g., only fa-re is kept in our case).

Therefore, output.txt should contain the following text:

fare
word
phrasing
world
world
text
uncial

==

======================================================================================================================================

I have modified the file and given the output file.
I will try to manually make sure there are no words with different hyphens (e.g. wod-ed and wo-ded).

Solution

Another awk:

!( $1 in a) || $1 ~ "-" { 
    key = value = $1; gsub("-","",key); a[key] = value 
}
END { for (i in a) print a[i] }

$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re

Related Problems and Solutions