Comm

May 10, 2011

The main function runs through the two files in tandem. If the current line from the first file is less than the current line from the second file, it goes to the first column, since it appears in the first file but not the second file, and a new line is read from the first file, whence the process continues; likewise, if the current line from the second file is less than the current line from the first file, it goes to the second column, since it appears in the second file but not the first file, and a new line is read from the second file, whence the process continues. When the two files have the same current line, it goes to the third column, since it appears in both files, and new lines are read from both files. End-of-file on either file means that all the remaining lines from the other file go in one of the first two columns, depending on which file is exhausted, and end-of-file on both files stops the iteration. Here’s the code:

(define (comm opts file1 file2)
  (let ((p1 (if (string=? file1 "-") (current-input-port) (open-input-file file1)))
        (p2 (if (string=? file2 "-") (current-input-port) (open-input-file file2))))
    (let loop ((f1 (read-line p1)) (f2 (read-line p2)))
      (cond ((and (eof-object? f1) (eof-object? f2))
              (close-input-port p1) (close-input-port p2))
            ((eof-object? f1)
              (putcol opts #\2 f2) (loop f1 (read-line p2)))
            ((eof-object? f2)
              (putcol opts #\1 f1) (loop (read-line p1) f2))
            ((string<? f1 f2)
              (putcol opts #\1 f1) (loop (read-line p1) f2))
            ((string<? f2 f1)
              (putcol opts #\2 f2) (loop f1 (read-line p2)))
            (else (putcol opts #\3 f1) (loop (read-line p1) (read-line p2)))))))

A line of output is potentially written after each pair of words is matched. Putcol gets both the line and column as parameters and writes the line in the proper column, unless that column is suppressed in the command-line arguments:

(define (putcol opts c str)
  (when (not (member (list c) opts))
    (when (char=? c #\2) (display #\tab))
    (when (char=? c #\3) (display #\tab) (display #\tab))
    (display str) (newline)))

Those two functions are assembled into a complete program callable from the command line at http://programmingpraxis.codepad.org/lE5eVcxf. We used read-line from the Standard Prelude and getopt from a previous exercise.

One use of comm is in this classic Unix spell-checking pipeline, here applied to the text of comm.ss:

$ cat comm.ss |
> tr 'A-Z' 'a-z' |
> tr -cs 'a-z' '\n' |
> sort |
> uniq |
> comm -23 -- - /usr/share/dict/words
arg
args
ascii
cadr
cddr
cdr
cmp
comm
cond
defn
df
dict
diff
eof
filename
getopt
len
lones
msg
newline
op
putcol
ss
str
substring
tr
uniq
usr
xs

Note that we needed a double-dash to separate the flags from the filenames, since the first filename, which is a single dash, is interpreted as an ill-formed argument; this differs from the standard comm command, which apparently processes its arguments itself instead of calling getopt. If you run that pipeline, be sure that the dictionary and the word list from the pipeline are both sorted in the same order.

Pages: 1 2

3 Responses to “Comm”

  1. My Haskell solution (see http://bonsaicode.wordpress.com/2011/05/10/programming-praxis-comm/ for a version with comments):

    import Control.Monad
    import System.Environment
    import Text.Printf
    import System.IO
    import qualified System.IO.Strict as SIO
    import GHC.IO.Handle
    
    comm :: (Num b, Ord a) => [b] -> [a] -> [a] -> [(a, b)]
    comm flags zs = filter ((`notElem` flags) . snd) . f zs where
        f xs     []     = map (flip (,) 1) xs
        f []     ys     = map (flip (,) 2) ys
        f (x:xs) (y:ys) = case compare x y of 
            LT -> (x,1) : f xs     (y:ys)
            GT -> (y,2) : f (x:xs) ys
            EQ -> (x,3) : f xs     ys
    
    columns :: [(String, Int)] -> IO ()
    columns xs = let width = maximum (map (length . fst) xs) + 2 in
        mapM_ (\(s,c) -> printf "%*s%-*s\n" ((c - 1) * width) "" width s) xs
    
    main :: IO ()
    main = do args <- getArgs
              columns =<< case args of
                  (('-':p:ps):fs) -> go (map (read . return) (p:ps)) fs
                  fs              -> go [] fs
        where go args ~[f1, f2] = liftM2 (comm args) (file f1) (file f2)
              file src = fmap lines $ if src == "-" then newStdIn
                                                    else readFile src
              newStdIn = catch (SIO.hGetContents =<< hDuplicate stdin)
                               (\_ -> return [])
    
  2. Mike said
    # -*- coding: cp1252 -*-
    import argparse
    
    def comparelines(f1,f2):
        line1 = f1.readline()
        line2 = f2.readline()
    
        while line1 or line2:
            if line1 and (line1 < line2 or line2==''):
                line, col = line1, 1
    
            elif line2 and (line2 < line1 or line1==''):
                line, col = line2, 2
    
            else:
                line, col = line2, 3
    
            yield line, col
    
            if col != 1: line2 = f2.readline()
            if col != 2: line1 = f1.readline()
    
    
    parser = argparse.ArgumentParser(
        description="select or reject lines common to two sorted files.",
        epilog="""\
    Reads file1 and file2, which should be ordered in ASCII collating sequence,
    and produces a three column output: lines only in file1; lines only in file2;
    and lines in both files. The filename ‘-’ means the standard input.
    Flags 1, 2, or 3 suppress printing of the corresponding column.  Thus,
        %(prog)s -12 prints only the lines common to the two files;
        %(prog)s -23 prints only lines in the first file but not in the second;
        %(prog)s -123 is a no-op."""
        )
    
    parser.add_argument('-1', dest='col1', action='store_false')
    parser.add_argument('-2', dest='col2', action='store_false')
    parser.add_argument('-3', dest='col3', action='store_false')
    parser.add_argument('file1', type=argparse.FileType('r'))
    parser.add_argument('file2', type=argparse.FileType('r'))
    
    
    args = parser.parse_args()
    column_flag = [None, args.col1, args.col2, args.col3]
    
    for line, col in comparelines(args.file1, args.file2):
        if column_flag[col]:
                print '{}{}'.format('\t'*(col-1), line.rstrip())
    
    
  3. arturasl said

    Solution in pascal: github

Leave a comment