Word Frequencies

March 10, 2009

A classic unix pipeline provides one solution:

tr -cs A-Za-z ' ' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

For a Scheme solution, our strategy will be to read each word in the file, increment the count of that word in a hash table, then extract all the word/count pairs from the hash table, sort them in descending order by frequency, and print the first n of them. Hash tables are provided by the make-hash and string-hash procedures of the Standard Prelude. Since the problem did not specify the definition of a word, we make our own: a word is a maximal sequence of alphabetic characters, with upper-case and lower-case folded together. The read-word function reads the next word from the current input:

(define (read-word) (let loop ((c (read-char)) (cs '())) (cond ((eof-object? c) (if (null? cs) c (list->string (reverse cs)))) ((char-alphabetic? c) (loop (read-char) (cons (char-downcase c) cs))) ((pair? cs) (list->string (reverse cs))) (else (loop (read-char) cs)))))

Word-freq reads each word and updates the hash table containing word/frequency pairs. At the end of the input, it extracts the word/frequency pairs, sorts them in descending order of frequency, and returns the first n pairs; take comes from the Standard Prelude:

(define (word-freq n file-name) (define (freq-gt? a b) (> (cdr a) (cdr b))) (with-input-from-file file-name (lambda () (let ((freqs (make-hash string-hash string=? 0 49999))) (do ((word (read-word) (read-word))) ((eof-object? word) (take n (sort freq-gt? (freqs 'enlist)))) (freqs 'update word (lambda (k v) (+ v 1)) 1))))))

Applied to the text of the Holy Bible, which is available from Project Gutenberg, we get:

> (word-freq 25 "bible.txt") (("the" . 62588) ("and" . 30875) ("of" . 30183) ("to" . 23023) ("you" . 14887) ("in" . 13357) ("he" . 10495) ("a" . 10150) ("i" . 9078) ("for" . 8983) ("his" . 8424) ("lord" . 8129) ("your" . 7398) ("with" . 7259) ("that" . 7187) ("is" . 7143) ("they" . 7005) ("not" . 6484) ("him" . 6140) ("will" . 6093) ("them" . 5831) ("be" . 5668) ("who" . 5611) ("from" . 5476) ("it" . 5395))

The program is available at http://programmingpraxis.codepad.org/CJcy6eGw. You can see Donald Knuth’s version of this program, and Doug McIlroy’s commentary on it, in Jon Bentley’s “Programming Pearls” column in the June 1986 edition of Communications of the ACM, which is reprinted in Knuth’s book Literate Programming.

Posted by programmingpraxis

Filed in Exercises

14 Comments »

14 Responses to “Word Frequencies”

mnp said
March 10, 2009 at 4:41 PM
perl -ne ‘while (/([a-z]+)/gi) {$words{$1}++} END{ map { print “$_ $words{$_}\n”} sort {$words{$b} $words{$a}} keys %words}’
FalconNL said
March 10, 2009 at 7:04 PM
A straightforward (though probably not very efficient) Haskell solution:

import System.Environment
import Control.Arrow
import Data.List
import Data.Ord

main = do [n, fileName] <- getArgs
content String -> [(String, Int)]
findMostCommonWords n = take n . reverse . sortBy (comparing snd) . map (head &&& length) . group . sort . words
FalconNL said
March 10, 2009 at 7:07 PM
Sigh… gotta love forms that don’t escape html characters. Let’s try that again.

import System.Environment
import Control.Arrow
import Data.List
import Data.Ord

main = do [n, fileName] <- getArgs
content <- readFile fileName
mapM_ print $ findMostCommonWords (read n) content

findMostCommonWords :: Int -> String -> [(String, Int)]
findMostCommonWords n = take n . reverse . sortBy (comparing snd) . map (head &&& length) . group . sort . words
cdsboy said
March 10, 2009 at 10:23 PM
Interesting challenge, here’s my solution in python.

http://pastebin.com/f1ac76000
Bengt said
March 12, 2009 at 5:47 AM
Action CountWords = (filename, top) =>
{
foreach (var kv in
File.ReadAllText(filename)
.Split()
.GroupBy(w => w,
(w, c) => new
{ Word = w, Count = c.Count() })
.OrderByDescending(a => a.Count)
.Take(top))
Console.WriteLine(kv.Word + ” – ” + kv.Count);
};
g said
April 22, 2009 at 2:01 AM
http://pastebin.com/fbf06089
Treaps « Programming Praxis said
June 26, 2009 at 9:05 AM
[…] Dictionaries are a common data type, which we have used in several exercises (Mark V. Shaney, Word Frequencies, Dodgson’s Doublets, Anagrams). Hash tables are often used as the underlying implementation […]

Benoît Huron said

May 12, 2010 at 8:10 PM


import qualified Data.Map as M
import Data.List
import Data.Char
import Data.Function
import System.Environment


isWord   :: String -> Bool
isWord s = length s > 2

buildFrequencyMap         :: String -> M.Map String Int
buildFrequencyMap content = foldl' insertWord M.empty mots
    where insertWord m word = M.insertWith' (+) word 1 m
          mots = filter isWord $ map (map toLower . takeWhile isLetter) (words content)

sortByFrequency   :: M.Map String Int -> [(String, Int)]
sortByFrequency m = reverse $ sortBy (compare `on` snd) (M.toList m)

displayWords               :: Int -> [(String, Int)] -> IO ()
displayWords n frequencies = mapM_ (putStrLn . format) (take n frequencies)
    where format (w, x) = (show x) ++ " -> " ++ w

main = do
  args <- getArgs
  case args of
    [n, f] -> displayWords (read n) . sortByFrequency . buildFrequencyMap =<< readFile f
    _ -> error "Deux arguments requis"

slabounty said

November 11, 2010 at 2:34 AM

Here it is in ruby (commented so I don’t have to do it elsewhere) …

# Write a program that takes a filename and a parameter n and prints the n most
# common words in the file, and the count of their occurrences, in descending
# order.

# Require for the command line options processor.
require 'getoptlong'


# Set up the command line options
opts = GetoptLong.new(
    ["--number", "-n", GetoptLong::REQUIRED_ARGUMENT],
    ["--verbose", "-v", GetoptLong::NO_ARGUMENT]
    )

# Set the default values for the options
number = 10
$verbose = false

# Parse the command line options. If we find one we don't recognize
# an exception will be thrown and we'll rescue with a message.
begin
    opts.each do | opt, arg|
        case opt
        when "--number"
            number = arg.to_i
        when "--verbose"
            $verbose = true
        end
    end
rescue
    puts "Illegal command line option."
    exit
end

# Create the word frequency hash.
word_freq = Hash.new(0)

# Loop through the remaining arguments which we'll assume are 
# file names.
ARGV.each do |file_name|
    File.open(file_name) do | file |

        # Loop through each line of the file
        while line = file.gets
            # Split on non-words or digits. This will throw out punctuation and
            # numbers. If numbers are to be included, remove the \d from the regex.
            words = line.split(/[\W\d]/)
            words.each do |word|
                # For some reason we're getting empty strings (probably a bad regex above) so
                # just toss them out here.
                word_freq[word] += 1 if word != "" 
            end
        end
    end
end

# Print out the n most frequent words. First we'll sort which normally
# will sort on the key, we'll pass a block so that we can sort on the 
# value. Then we'll reverse to get the largest values first, and finally
# we'll pull out the first n.
word_freq.sort{|a,b| a[1]<=>b[1]}.reverse[0, number].each do |v|
    puts "#{v[0]} #{v[1]}"
end

kawas said

September 28, 2011 at 8:30 AM

Clojure library has an handy frequencies function.

(defn word-frequencies [filepath n]
  (let [wrds (.split (.toLowerCase (slurp filepath)) "[^a-zA-Z]+")]
    (take n (reverse (sort-by second (frequencies wrds))))))

Pavel Bogomolenko said

December 20, 2011 at 3:34 PM

Hello guys,

Check my solution in Python. I went a bit further I cleaned the source from punctuation characters.
It brings more relevant output.

from operator import itemgetter
from string import punctuation

def countWords(filename, n):
  all_words = (word.strip(punctuation).lower() for line in open(filename) for word in line.split())
  
  words = {}
  for word in all_words:
    words[word] = words.get(word, 0) + 1

  #sort by number and slice
  sort_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:int(n)]
  for index, words in enumerate(sort_words):
    print "%d. %s - %d" % (index + 1, words[0], words[1])

if __name__ == "__main__":
  from optparse import OptionParser

  parser = OptionParser()
  parser.add_option("-f", "--file", dest="filename",
                    help="Select a FILE", metavar="FILE")
  parser.add_option("-n", "--num", help="Number of words to display")
  (options, args) = parser.parse_args()
  countWords(options.filename, options.num)

ftt said
January 26, 2014 at 3:23 AM
https://github.com/ftt/programming-praxis/blob/master/20090310-word-frequencies/word-frequencies.py

Lucas A. Brown said

April 7, 2014 at 9:58 AM

#! /usr/bin/env python

def word_frequencies(filename):
    g = open(filename, "r")
    f = g.read()
    g.close()
    words = {}
    f = ''.join((c.lower() if c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" else " ") for c in f)
    for w in f.split():
        if w == '': continue
        if w in words: words[w] += 1
        else: words[w] = 1
    return words

if __name__ == "__main__":
    from sys import argv
    words = word_frequencies(argv[1])
    words = [(w, words[w]) for w in words]
    words.sort(key=lambda w: w[1])
    words.reverse()
    for i in xrange(min(len(words), int(argv[2]))): print words[i][0], words[i][1]

? said
June 4, 2016 at 1:49 PM
# In Ruby

text = File.read(ARGV[0])
n = ARGV[1].to_i

puts text
.scan(/\w+/)
.group_by(&:itself)
.map { |k, v| [k, v.count] }
.sort_by { |_, v| -v }
.take(n)
.map { |k, v| “#{k}: #{v}” }

Programming Praxis