Common Words

April 26, 2019

We write our solution in awk.

We must first decide on the definition of a word. Awk’s normal field splitting assumes a word is a maximal sequence of non-space characters, but that includes punctuation. We could consider a word as a maximal sequence of letters, but then contractions like didn’t would count as two words. For the sake of simplicity, and because it solves the problem for the given text, we assume awk’s standard field splitting.

We must also decide what to do with duplicates. If the word WORD appears once on one line and twice on the next, does that count 1 or 2? The only sensible solution is to count 1, even though that is harder to arrange than merely comparing words.

Here is our solution:

$ echo '
> word1 word2 word3 word4
> word4 word5 word6 word7
> word6 word7 word8 word9
> word9 word6 word8 word3
> word1 word4 word5 word4
> ' | awk -v n=3 '
> NR == 1 { for (i = 1; i  NR >  1 { counter = 0
>           for (i = 1; i                if (word[$i]-- > 0) counter++ }
>           if (counter >= n) print $0
>           delete word
>           for (i = 1; i  '
word9 word6 word8 word3

If you want to change the definition of a word, set FS on the awk command line. You can run the program at https://ideone.com/e7zW3o.

Pages: 1 2

2 Responses to “Common Words”

  1. Daniel said

    Here’s a solution in Python. This solution essentially only considers the first occurrence of a word on each line. That is, a word appearing twice on line X is not counted as two matches if the word appears on line X – 1.

    import sys
    
    n = int(sys.argv[1])
    last_words = set()
    for line in open(sys.argv[2]):
        line = line.strip()
        words = set(line.split())
        if len(words.intersection(last_words)) == n:
            print(line)
        last_words = words
    

    Example Usage:

    $ python common.py 3 words.txt
    word9 word6 word8 word3
    
  2. Globules said

    A Haskell version.

    import Control.Arrow ((>>>), (&&&))
    import Data.Function ((&), on)
    import Data.List (intersect, nub)
    import System.Environment (getArgs)
    import Text.Read (readMaybe)
    
    inCommon :: Eq a => Int -> [a] -> [a] -> Bool
    inCommon n xs ys = length (xs `intersect` ys) == n
    
    wordsInCommon :: Int -> [String] -> [String]
    wordsInCommon n ls = let lws = map (id &&& (nub . words)) ls
                         in zip lws (drop 1 lws) &
                            filter (uncurry (inCommon n `on` snd)) &
                            map (fst . snd)
    
    main :: IO ()
    main = do
      args <- getArgs
      case map readMaybe args of
        [Just n] -> interact $ lines >>> wordsInCommon n >>> unlines
        _        -> error "The number of words in common is required."
    
    $ ./common 0 < common.txt 
    word1 word4 word5 word4
    $ ./common 3 < common.txt 
    word9 word6 word8 word3
    $ ./common 5 < common.txt 
    $ 
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: