Common Words
April 26, 2019
Today’s exercise comes from Stack Overflow:
Given a text file like:
word1 word2 word3 word4 word4 word5 word6 word7 word6 word7 word8 word9 word9 word6 word8 word3 word1 word4 word5 word4Write a program that returns those lines that have n words in common with the previous line. For instance, given the input above, the only output line would be:
word9 word6 word8 word3
The original question requested a solution in sed or awk, but you are free to use any language.
Your task is to write a program to extract lines from a text file that have n words in common with the previous line. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Here’s a solution in Python. This solution essentially only considers the first occurrence of a word on each line. That is, a word appearing twice on line X is not counted as two matches if the word appears on line X – 1.
import sys n = int(sys.argv[1]) last_words = set() for line in open(sys.argv[2]): line = line.strip() words = set(line.split()) if len(words.intersection(last_words)) == n: print(line) last_words = wordsExample Usage:
A Haskell version.
import Control.Arrow ((>>>), (&&&)) import Data.Function ((&), on) import Data.List (intersect, nub) import System.Environment (getArgs) import Text.Read (readMaybe) inCommon :: Eq a => Int -> [a] -> [a] -> Bool inCommon n xs ys = length (xs `intersect` ys) == n wordsInCommon :: Int -> [String] -> [String] wordsInCommon n ls = let lws = map (id &&& (nub . words)) ls in zip lws (drop 1 lws) & filter (uncurry (inCommon n `on` snd)) & map (fst . snd) main :: IO () main = do args <- getArgs case map readMaybe args of [Just n] -> interact $ lines >>> wordsInCommon n >>> unlines _ -> error "The number of words in common is required."