J K Rowling

July 19, 2013

It has been widely reported recently, including articles in the New York Times and the London Sunday Times, that J K Rowling wrote a book under a pseudonym that was discovered by a forensic linguist. Time magazine explains how the discovery was made:

As one part of his work, Juola uses a program to pull out the hundred most frequent words across an author’s vocabulary. This step eliminates rare words, character names and plot points, leaving him with words like of and but, ranked by usage. Those words might seem inconsequential, but they leave an authorial fingerprint on any work.

“Propositions and articles and similar little function words are actually very individual,” Juola says. “It’s actually very, very hard to change them because they’re so subconscious.”

The Time article gives a link to the program Juola used, but that site gives very little information about how the program works.

Your task is to write a program that compares the similarity of two texts to determine authorship; this task is purposely vague so you can make your own decisions about how to proceed. When you are finished, you are welcome to read or run a suggested solution or to post your own solution or discuss the exercise in the comments below.


Pages: 1 2

6 Responses to “J K Rowling”

  1. Globules said

    Patrick Juola has a guest post on Language Log describing the approach he took.

  2. […] today’s Programming Praxis exercise, our goal is to write a program to analyse whether two books were […]

  3. My Haskell solution (see http://bonsaicode.wordpress.com/2013/07/19/programming-praxis-j-k-rowling/ for a version with tests and comments):

    import Data.Char
    import Data.List
    import Data.List.Split
    import qualified Data.List.Key as K
    import qualified Data.Map as M
    data Info = Info { _words :: [String], _sentenceLength :: Float,
                       _paraLength :: Float, _puncPct :: Float }
    avg :: (Fractional a, Integral a1) => [a1] -> a
    avg xs = fromIntegral (sum xs) / fromIntegral (length xs)
    sentenceLength :: String -> Float
    sentenceLength = avg . map length . splitOneOf ".!?"
    paragraphLength :: String -> Float
    paragraphLength = avg . map (length . words . unlines) . splitOn [""] . lines
    punctuationPct :: String -> Float
    punctuationPct text = fromIntegral (length $ filter isPunctuation text) /
                          fromIntegral (length text) * 100
    process :: String -> Info
    process text = Info (words . filter (not . isPunctuation) $ map toLower text)
                        (sentenceLength text)
                        (paragraphLength text)
                        (punctuationPct text)
    topNgrams :: Int -> [String] -> [[String]]
    topNgrams n ws = take 100 . map fst . K.sort (negate . snd) . M.assocs $
                     M.fromListWith (+) . map (flip (,) 1 . take n) $
                     foldr ($) (tails ws) $ replicate n init
    similarity :: Info -> Info -> Float
    similarity (Info wsA slA plA puA) (Info wsB slB plB puB) =
      1 * fromIntegral (length $ intersect (topNgrams 3 wsA) (topNgrams 3 wsB)) +
      2 * fromIntegral (length $ intersect (topNgrams 4 wsA) (topNgrams 4 wsB)) +
      4 * fromIntegral (length $ intersect (topNgrams 5 wsA) (topNgrams 5 wsB)) -
      abs (slA - slB) - abs (plA - plB) - 10 * abs (puA - puB)
  4. jpverkamp said

    It’s interesting; I actually worked out a few techniques similar to this back when I did my undergraduate thesis. I haven’t really worked in that area since then, but I went ahead and coded up some ideas in Racket:
    Authorship attribution: Part 1 (top n word ordering)
    Authorship attribution: Part 2 (stop word frequency, 4-grams)

    So far the best has been identifying JK Rowling as #2 among my collection of science fiction and fantasy. Not too bad, but I have a few more ideas for a Part 3 (which I’ll probably post Tuesday-ish?).

    If you want the code directly, everything I’ve got thus far is on GitHub: authorship attribution

  5. jpverkamp said

    And here’s the third and final part:
    Authorship attribution: Part 3 (word length distribution)

  6. Hey! Would you mind if I share your blog with my myspace group?

    There’s a lot of people that I think would really appreciate your content. Please let me know. Cheers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: