J K Rowling
July 19, 2013
It has been widely reported recently, including articles in the New York Times and the London Sunday Times, that J K Rowling wrote a book under a pseudonym that was discovered by a forensic linguist. Time magazine explains how the discovery was made:
As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.
“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”
The Time article gives a link to the program Juola used, but that site gives very little information about how the program works.
Your task is to write a program that compares the similarity of two texts to determine authorship; this task is purposely vague so you can make your own decisions about how to proceed. When you are finished, you are welcome to read or run a suggested solution or to post your own solution or discuss the exercise in the comments below.
Patrick Juola has a guest post on Language Log describing the approach he took.
[…] today’s Programming Praxis exercise, our goal is to write a program to analyse whether two books were […]
My Haskell solution (see http://bonsaicode.wordpress.com/2013/07/19/programming-praxis-j-k-rowling/ for a version with tests and comments):
It’s interesting; I actually worked out a few techniques similar to this back when I did my undergraduate thesis. I haven’t really worked in that area since then, but I went ahead and coded up some ideas in Racket:
– Authorship attribution: Part 1 (top n word ordering)
– Authorship attribution: Part 2 (stop word frequency, 4-grams)
So far the best has been identifying JK Rowling as #2 among my collection of science fiction and fantasy. Not too bad, but I have a few more ideas for a Part 3 (which I’ll probably post Tuesday-ish?).
If you want the code directly, everything I’ve got thus far is on GitHub: authorship attribution
And here’s the third and final part:
– Authorship attribution: Part 3 (word length distribution)
Hey! Would you mind if I share your blog with my myspace group?
There’s a lot of people that I think would really appreciate your content. Please let me know. Cheers