Common Words
April 26, 2019
We write our solution in awk.
We must first decide on the definition of a word. Awk’s normal field splitting assumes a word is a maximal sequence of non-space characters, but that includes punctuation. We could consider a word as a maximal sequence of letters, but then contractions like didn’t would count as two words. For the sake of simplicity, and because it solves the problem for the given text, we assume awk’s standard field splitting.
We must also decide what to do with duplicates. If the word WORD appears once on one line and twice on the next, does that count 1 or 2? The only sensible solution is to count 1, even though that is harder to arrange than merely comparing words.
Here is our solution:
$ echo ' > word1 word2 word3 word4 > word4 word5 word6 word7 > word6 word7 word8 word9 > word9 word6 word8 word3 > word1 word4 word5 word4 > ' | awk -v n=3 ' > NR == 1 { for (i = 1; i NR > 1 { counter = 0 > for (i = 1; i if (word[$i]-- > 0) counter++ } > if (counter >= n) print $0 > delete word > for (i = 1; i ' word9 word6 word8 word3
If you want to change the definition of a word, set FS on the awk command line. You can run the program at https://ideone.com/e7zW3o.
Here’s a solution in Python. This solution essentially only considers the first occurrence of a word on each line. That is, a word appearing twice on line X is not counted as two matches if the word appears on line X – 1.
Example Usage:
A Haskell version.