July 19, 2013
It has been widely reported recently, including articles in the New York Times and the London Sunday Times, that J K Rowling wrote a book under a pseudonym that was discovered by a forensic linguist. Time magazine explains how the discovery was made:
As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.
“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”
The Time article gives a link to the program Juola used, but that site gives very little information about how the program works.
Your task is to write a program that compares the similarity of two texts to determine authorship; this task is purposely vague so you can make your own decisions about how to proceed. When you are finished, you are welcome to read or run a suggested solution or to post your own solution or discuss the exercise in the comments below.