J K Rowling
July 19, 2013
The task decomposes into two parts: form a ranked list of the hundred (plus ties) most common words in a document, then compare the lists for two documents to assign a similarity score. The first part is straight forward; we use our new hash tables, the for-each-port
operator from our text-processing framework, and an auxiliary function read-word
that gets the next word from the input:
(define (read-word p) ; next maximal sequence of letters from current input
(let loop ((c (read-char p)) (cs (list)))
(cond ((eof-object? c) (if (null? cs) c (list->string (reverse cs))))
((char-alphabetic? c) (loop (read-char) (cons (char-downcase c) cs)))
((pair? cs) (list->string (reverse cs)))
(else (loop (read-char) cs)))))
(define (prep file-name) ; list of 100 most-frequent words in file-name
(define (lt? a b)
(if (= (cdr a) (cdr b))
(string<? (car a) (car b))
(< (cdr b) (cdr a))))
(let ((t (make-hash)))
(with-input-from-file file-name
(lambda ()
(for-each-port read-word
(lambda (word)
(set! t (t 'update word (lambda (k v) (+ v 1)) 1))))))
(let loop ((n 100) (ws (sort lt? (t 'enlist))) (prev 0) (zs (list)))
(cond ((null? ws) (reverse zs))
((= (cdar ws) prev)
(loop (- n 1) (cdr ws) prev
(cons (append (car zs) (list (caar ws))) (cdr zs))))
((<= n 0) (reverse zs))
(else (loop (- n 1) (cdr ws) (cdar ws) (cons (list (caar ws)) zs)))))))
We download five texts from Project Gutenberg: Robinson Crusoe by Daniel Defoe, A Tale of Two Cities by Charles Dickens, Treasure Island by Robert Louis Stevenson, and Adventures of Huckleberry Finn by Mark Twain, plus a mystery text whose authorship we seek to determine. Here is the ranked list for Defoe; note that it has 101 words, since there is a tie at the 100th position:
> (prep "defoe.txt")
(("the") ("i") ("and") ("to") ("of") ("a") ("my") ("was")
("in") ("that") ("it") ("had") ("as") ("for") ("me") ("but")
("with") ("not") ("which") ("he") ("them") ("so") ("this")
("they") ("all") ("or") ("at") ("him") ("be") ("on")
("were") ("by") ("we") ("could") ("upon") ("have") ("would")
("his") ("very") ("no") ("from") ("when") ("one") ("out")
("if") ("some") ("up") ("what") ("two") ("made") ("more")
("there") ("great") ("into") ("their") ("been") ("any")
("might") ("now") ("being" "myself") ("about" "found")
("came") ("should") ("time") ("little") ("much" "shore")
("did") ("first") ("than") ("before") ("boat") ("ship")
("other") ("after" "our") ("go") ("where") ("then") ("make")
("such") ("how") ("these") ("again") ("went") ("an")
("well") ("three") ("us" "way") ("you") ("place") ("down")
("come") ("though") ("is") ("friday") ("began" "island"))
The second part that compares two texts is tougher, and we consider two different algorithms; there are undoubtedly others. One algorithm compares each word in the two texts and sums the distances between them; for instance, if “the” is the first word in one text and the fourth word in the other text, the distance between them is 3. For words that appear in one ranking but not the other, the distance is assumed to be the number of words in the lists, which in our case is 100. Smaller rankings suggest similar texts, larger rankings suggest dissimilar texts:
(define (comp1 w1 w2) ; compare two word lists on number of swaps to make equal
(define (make-assoc ws)
(let loop ((k 1) (ws ws) (zs (list)))
(if (null? ws) zs
(loop (+ k (length (car ws))) (cdr ws)
(append (map (lambda (w) (cons w k)) (car ws)) zs)))))
(define (lookup w ws) (cond ((assoc w ws) => cdr) (else 1000)))
(let ((w1 (make-assoc w1)) (w2 (make-assoc w2)))
(let loop ((w w1) (s 0))
(if (pair? w)
(loop (cdr w) (+ s (min (abs (- (cdar w) (lookup (caar w) w2))) 100)))
(let loop ((w w2) (s s))
(if (null? w) s
(loop (cdr w) (+ s (if (= (lookup (caar w) w1) 1000) 100 0)))))))))
And here are the comparisons of the four known texts to the mystery text:
> (comp1 defoe mystery)
7338
> (comp1 dickens mystery)
6514
> (comp1 stevenson mystery)
5972
> (comp1 twain mystery)
5430
We got the right answer; the mystery text is The Adventures of Tom Sawyer by Mark Twain. But the comparison is hardly persuasive, with only a ten percent difference compared to Robert Louis Stevenson.
Let’s try something else. Instead of counting the distances between like words, we compute the longest common subsequence of the two ranked lists, using the algorithm of a previous exercise and matrices from the Standard Prelude; we handle ties by computing the number of items in two word lists that are equal:
(define (equal xs ys) ; assume xs and ys are sorted
(let loop ((xs xs) (ys ys) (z 0))
(cond ((or (null? xs) (null? ys)) z)
((string<? (car xs) (car ys)) (loop (cdr xs) ys z))
((string<? (car ys) (car xs)) (loop xs (cdr ys) z))
(else (loop (cdr xs) (cdr ys) (+ z 1))))))
(define (comp2 w1 w2) ; compare two word lists on longest common subsequence
(let* ((x-len (length w1)) (y-len (length w2))
(x1 (+ x-len 1)) (y1 (+ y-len 1))
(xv (list->vector w1)) (yv (list->vector w2))
(m (make-matrix x1 y1)))
(for (x 0 x1)
(for (y 0 y1)
(if (or (zero? x) (zero? y))
(matrix-set! m x y 0)
(let ((e (equal (vector-ref xv (- x 1))
(vector-ref yv (- y 1)))))
(if (positive? e)
(matrix-set! m x y (+ e (matrix-ref m (- x 1) (- y 1))))
(matrix-set! m x y (max (matrix-ref m (- x 1) y)
(matrix-ref m x (- y 1)))))))))
(matrix-ref m x-len y-len)))
And here are the comparisons of the four known texts to the mystery text:
> (comp2 defoe mystery)
21
> (comp2 dickens mystery)
26
> (comp2 stevenson mystery)
27
> (comp2 twain mystery)
26
That’s worse than the first method; there is little difference between any of the texts, and the mystery text is wrongly selected. Harrumph!
You can see all the code assembled at http://programmingpraxis.codepad.org/rz44vsq3. Perhaps one of you readers can find a better way to make the comparison.
Patrick Juola has a guest post on Language Log describing the approach he took.
[…] today’s Programming Praxis exercise, our goal is to write a program to analyse whether two books were […]
My Haskell solution (see http://bonsaicode.wordpress.com/2013/07/19/programming-praxis-j-k-rowling/ for a version with tests and comments):
It’s interesting; I actually worked out a few techniques similar to this back when I did my undergraduate thesis. I haven’t really worked in that area since then, but I went ahead and coded up some ideas in Racket:
– Authorship attribution: Part 1 (top n word ordering)
– Authorship attribution: Part 2 (stop word frequency, 4-grams)
So far the best has been identifying JK Rowling as #2 among my collection of science fiction and fantasy. Not too bad, but I have a few more ideas for a Part 3 (which I’ll probably post Tuesday-ish?).
If you want the code directly, everything I’ve got thus far is on GitHub: authorship attribution
And here’s the third and final part:
– Authorship attribution: Part 3 (word length distribution)
Hey! Would you mind if I share your blog with my myspace group?
There’s a lot of people that I think would really appreciate your content. Please let me know. Cheers