Word Frequencies
March 10, 2009
A classic unix pipeline provides one solution:
tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
For a Scheme solution, our strategy will be to read each word in the file, increment the count of that word in a hash table, then extract all the word/count pairs from the hash table, sort them in descending order by frequency, and print the first n of them. Hash tables are provided by the make-hash
and string-hash
procedures of the Standard Prelude. Since the problem did not specify the definition of a word, we make our own: a word is a maximal sequence of alphabetic characters, with upper-case and lower-case folded together. The read-word
function reads the next word from the current input:
(define (read-word)
(let loop ((c (read-char)) (cs '()))
(cond ((eof-object? c)
(if (null? cs) c
(list->string (reverse cs))))
((char-alphabetic? c)
(loop (read-char) (cons (char-downcase c) cs)))
((pair? cs) (list->string (reverse cs)))
(else (loop (read-char) cs)))))
Word-freq
reads each word and updates the hash table containing word/frequency pairs. At the end of the input, it extracts the word/frequency pairs, sorts them in descending order of frequency, and returns the first n pairs; take
comes from the Standard Prelude:
(define (word-freq n file-name)
(define (freq-gt? a b) (> (cdr a) (cdr b)))
(with-input-from-file file-name
(lambda ()
(let ((freqs (make-hash string-hash string=? 0 49999)))
(do ((word (read-word) (read-word)))
((eof-object? word) (take n (sort freq-gt? (freqs 'enlist))))
(freqs 'update word (lambda (k v) (+ v 1)) 1))))))
Applied to the text of the Holy Bible, which is available from Project Gutenberg, we get:
> (word-freq 25 "bible.txt")
(("the" . 62588) ("and" . 30875) ("of" . 30183)
("to" . 23023) ("you" . 14887) ("in" . 13357) ("he" . 10495)
("a" . 10150) ("i" . 9078) ("for" . 8983) ("his" . 8424)
("lord" . 8129) ("your" . 7398) ("with" . 7259)
("that" . 7187) ("is" . 7143) ("they" . 7005) ("not" . 6484)
("him" . 6140) ("will" . 6093) ("them" . 5831) ("be" . 5668)
("who" . 5611) ("from" . 5476) ("it" . 5395))
The program is available at http://programmingpraxis.codepad.org/CJcy6eGw. You can see Donald Knuth’s version of this program, and Doug McIlroy’s commentary on it, in Jon Bentley’s “Programming Pearls” column in the June 1986 edition of Communications of the ACM, which is reprinted in Knuth’s book Literate Programming.
perl -ne ‘while (/([a-z]+)/gi) {$words{$1}++} END{ map { print “$_ $words{$_}\n”} sort {$words{$b} $words{$a}} keys %words}’
A straightforward (though probably not very efficient) Haskell solution:
import System.Environment
import Control.Arrow
import Data.List
import Data.Ord
main = do [n, fileName] <- getArgs
content String -> [(String, Int)]
findMostCommonWords n = take n . reverse . sortBy (comparing snd) . map (head &&& length) . group . sort . words
Sigh… gotta love forms that don’t escape html characters. Let’s try that again.
import System.Environment
import Control.Arrow
import Data.List
import Data.Ord
main = do [n, fileName] <- getArgs
content <- readFile fileName
mapM_ print $ findMostCommonWords (read n) content
findMostCommonWords :: Int -> String -> [(String, Int)]
findMostCommonWords n = take n . reverse . sortBy (comparing snd) . map (head &&& length) . group . sort . words
Interesting challenge, here’s my solution in python.
http://pastebin.com/f1ac76000
Action CountWords = (filename, top) =>
{
foreach (var kv in
File.ReadAllText(filename)
.Split()
.GroupBy(w => w,
(w, c) => new
{ Word = w, Count = c.Count() })
.OrderByDescending(a => a.Count)
.Take(top))
Console.WriteLine(kv.Word + ” – ” + kv.Count);
};
http://pastebin.com/fbf06089
[…] Dictionaries are a common data type, which we have used in several exercises (Mark V. Shaney, Word Frequencies, Dodgson’s Doublets, Anagrams). Hash tables are often used as the underlying implementation […]
Here it is in ruby (commented so I don’t have to do it elsewhere) …
Clojure library has an handy frequencies function.
Hello guys,
Check my solution in Python. I went a bit further I cleaned the source from punctuation characters.
It brings more relevant output.
https://github.com/ftt/programming-praxis/blob/master/20090310-word-frequencies/word-frequencies.py
# In Ruby
text = File.read(ARGV[0])
n = ARGV[1].to_i
puts text
.scan(/\w+/)
.group_by(&:itself)
.map { |k, v| [k, v.count] }
.sort_by { |_, v| -v }
.take(n)
.map { |k, v| “#{k}: #{v}” }