Comm
May 10, 2011
The main function runs through the two files in tandem. If the current line from the first file is less than the current line from the second file, it goes to the first column, since it appears in the first file but not the second file, and a new line is read from the first file, whence the process continues; likewise, if the current line from the second file is less than the current line from the first file, it goes to the second column, since it appears in the second file but not the first file, and a new line is read from the second file, whence the process continues. When the two files have the same current line, it goes to the third column, since it appears in both files, and new lines are read from both files. End-of-file on either file means that all the remaining lines from the other file go in one of the first two columns, depending on which file is exhausted, and end-of-file on both files stops the iteration. Here’s the code:
(define (comm opts file1 file2)
(let ((p1 (if (string=? file1 "-") (current-input-port) (open-input-file file1)))
(p2 (if (string=? file2 "-") (current-input-port) (open-input-file file2))))
(let loop ((f1 (read-line p1)) (f2 (read-line p2)))
(cond ((and (eof-object? f1) (eof-object? f2))
(close-input-port p1) (close-input-port p2))
((eof-object? f1)
(putcol opts #\2 f2) (loop f1 (read-line p2)))
((eof-object? f2)
(putcol opts #\1 f1) (loop (read-line p1) f2))
((string<? f1 f2)
(putcol opts #\1 f1) (loop (read-line p1) f2))
((string<? f2 f1)
(putcol opts #\2 f2) (loop f1 (read-line p2)))
(else (putcol opts #\3 f1) (loop (read-line p1) (read-line p2)))))))
A line of output is potentially written after each pair of words is matched. Putcol
gets both the line and column as parameters and writes the line in the proper column, unless that column is suppressed in the command-line arguments:
(define (putcol opts c str)
(when (not (member (list c) opts))
(when (char=? c #\2) (display #\tab))
(when (char=? c #\3) (display #\tab) (display #\tab))
(display str) (newline)))
Those two functions are assembled into a complete program callable from the command line at http://programmingpraxis.codepad.org/lE5eVcxf. We used read-line
from the Standard Prelude and getopt
from a previous exercise.
One use of comm
is in this classic Unix spell-checking pipeline, here applied to the text of comm.ss:
$ cat comm.ss |
> tr 'A-Z' 'a-z' |
> tr -cs 'a-z' '\n' |
> sort |
> uniq |
> comm -23 -- - /usr/share/dict/words
arg
args
ascii
cadr
cddr
cdr
cmp
comm
cond
defn
df
dict
diff
eof
filename
getopt
len
lones
msg
newline
op
putcol
ss
str
substring
tr
uniq
usr
xs
Note that we needed a double-dash to separate the flags from the filenames, since the first filename, which is a single dash, is interpreted as an ill-formed argument; this differs from the standard comm
command, which apparently processes its arguments itself instead of calling getopt
. If you run that pipeline, be sure that the dictionary and the word list from the pipeline are both sorted in the same order.
My Haskell solution (see http://bonsaicode.wordpress.com/2011/05/10/programming-praxis-comm/ for a version with comments):
Solution in pascal: github