Merge two files without pseudo-duplicates
I have two text files file1
.txt and file2.txt
, both of which contain lines of word like this:
fare
word
phrasing
world
world
and
fare
text
uncial
Tree-grown
world
phrasing
Or something like that. A word, I mean a string of letters a-z that may have accent marks, and the symbol -
. My question is, how do I create a third file output from the linux command line (using awk
, sed
, etc.) .txt
two files that meet the following three conditions:
- If the same word appears in two files, the third output
.txt
contains it only once. - file2.txt) appears in another file, only the hyphenated version remains in output.txt (e.g., only
fa-re
is kept in our case).
If a hyphenated version of a word in a file (e.g. fa-re in
Therefore, output.txt should contain the following text:
fare
word
phrasing
world
world
text
uncial
==
======================================================================================================================================
I have modified the file and given the output file.
I will try to manually make sure there are no words with different hyphens (e.g. wod-ed and wo-ded).
Solution
Another awk:
!( $1 in a) || $1 ~ "-" {
key = value = $1; gsub("-","",key); a[key] = value
}
END { for (i in a) print a[i] }
$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re