J K Rowling

July 19, 2013

It has been widely reported recently, including articles in the New York Times and the London Sunday Times, that J K Rowling wrote a book under a pseudonym that was discovered by a forensic linguist. Time magazine explains how the discovery was made:

As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.

“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”

The Time article gives a link to the program Juola used, but that site gives very little information about how the program works.

Your task is to write a program that compares the similarity of two texts to determine authorship; this task is purposely vague so you can make your own decisions about how to proceed. When you are finished, you are welcome to read or run a suggested solution or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

6 Comments »

6 Responses to “J K Rowling”

Globules said
July 19, 2013 at 12:29 PM
Patrick Juola has a guest post on Language Log describing the approach he took.
Programming Praxis – J K Rowling | Bonsai Code said
July 19, 2013 at 1:29 PM
[…] today’s Programming Praxis exercise, our goal is to write a program to analyse whether two books were […]

Remco Niemeijer said

July 19, 2013 at 1:29 PM

My Haskell solution (see http://bonsaicode.wordpress.com/2013/07/19/programming-praxis-j-k-rowling/ for a version with tests and comments):

import Data.Char
import Data.List
import Data.List.Split
import qualified Data.List.Key as K
import qualified Data.Map as M

data Info = Info { _words :: [String], _sentenceLength :: Float,
                   _paraLength :: Float, _puncPct :: Float }

avg :: (Fractional a, Integral a1) => [a1] -> a
avg xs = fromIntegral (sum xs) / fromIntegral (length xs)

sentenceLength :: String -> Float
sentenceLength = avg . map length . splitOneOf ".!?"

paragraphLength :: String -> Float
paragraphLength = avg . map (length . words . unlines) . splitOn [""] . lines

punctuationPct :: String -> Float
punctuationPct text = fromIntegral (length $ filter isPunctuation text) /
                      fromIntegral (length text) * 100

process :: String -> Info
process text = Info (words . filter (not . isPunctuation) $ map toLower text)
                    (sentenceLength text)
                    (paragraphLength text)
                    (punctuationPct text)

topNgrams :: Int -> [String] -> [[String]]
topNgrams n ws = take 100 . map fst . K.sort (negate . snd) . M.assocs $
                 M.fromListWith (+) . map (flip (,) 1 . take n) $
                 foldr ($) (tails ws) $ replicate n init

similarity :: Info -> Info -> Float
similarity (Info wsA slA plA puA) (Info wsB slB plB puB) =
  1 * fromIntegral (length $ intersect (topNgrams 3 wsA) (topNgrams 3 wsB)) +
  2 * fromIntegral (length $ intersect (topNgrams 4 wsA) (topNgrams 4 wsB)) +
  4 * fromIntegral (length $ intersect (topNgrams 5 wsA) (topNgrams 5 wsB)) -
  abs (slA - slB) - abs (plA - plB) - 10 * abs (puA - puB)

jpverkamp said
August 5, 2013 at 3:08 AM
It’s interesting; I actually worked out a few techniques similar to this back when I did my undergraduate thesis. I haven’t really worked in that area since then, but I went ahead and coded up some ideas in Racket:
– Authorship attribution: Part 1 (top n word ordering)
– Authorship attribution: Part 2 (stop word frequency, 4-grams)

So far the best has been identifying JK Rowling as #2 among my collection of science fiction and fantasy. Not too bad, but I have a few more ideas for a Part 3 (which I’ll probably post Tuesday-ish?).

If you want the code directly, everything I’ve got thus far is on GitHub: authorship attribution
jpverkamp said
August 6, 2013 at 3:49 PM
And here’s the third and final part:
– Authorship attribution: Part 3 (word length distribution)
Permaculture Design Course - Palmerston North City Council & RECAP said
August 22, 2013 at 5:52 AM
Hey! Would you mind if I share your blog with my myspace group?

There’s a lot of people that I think would really appreciate your content. Please let me know. Cheers

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Programming Praxis

J K Rowling

July 19, 2013

6 Responses to “J K Rowling”

Leave a comment

Categories

Archives

Archives

Programming Praxis

J K Rowling

July 19, 2013

Share this:

Related

6 Responses to “J K Rowling”

Leave a comment

Categories

Archives

Archives