Heavy Hitters (The Britney Spears Algorithm)
June 2, 2015
We begin by reprising the algorithm that counts the words in a file from a previous exercise; functions hashtable-pairs and read-word are given with the solution at http://ideone.com/BPHZqT:
(define (word-freq n file-name)
(define (freq-gt? a b) (> (cdr a) (cdr b)))
(with-input-from-file file-name (lambda ()
(let ((freqs (make-hashtable string-hash string=?)))
(do ((word (read-word) (read-word)))
((eof-object? word) (map car (take n
(sort freq-gt? (hashtable-pairs freqs)))))
(hashtable-update! freqs word add1 0))))))
> (word-freq 25 "bible.txt")
("the" "and" "of" "to" "that" "in" "he" "shall" "unto" "for"
"i" "his" "a" "lord" "they" "be" "is" "him" "not" "them"
"it" "with" "all" "thou" "thy")
We combine the first two clauses of the Misra-Gries algorithm, using the hashtable-update! function with a default to increase the count of an item already in the table or to insert a new item in a partially-empty table:
(define (misra-gries n file-name)
(define (freq-gt? a b) (> (cdr a) (cdr b)))
(let ((keys (make-hashtable string-hash string=?)))
(with-input-from-file file-name (lambda ()
(do ((word (read-word) (read-word)))
((eof-object? word) (map car (take n
(sort freq-gt? (hashtable-pairs keys)))))
(if (or (hashtable-contains? keys word)
(< (hashtable-size keys) n))
(hashtable-update! keys word add1 0)
(vector-for-each
(lambda (word)
(hashtable-update! keys word sub1 1)
(when (zero? (hashtable-ref keys word 0))
(hashtable-delete! keys word)))
(hashtable-keys keys))))))))
Running the Misra-Gries algorithm with only n items in the array doesn’t work, since there is the constant churn of new items at the back end of the array. With 100 items in the array, the result is much better, and with 200 items in the array, the result matches exactly the actual frequencies:
> (misra-gries 25 "bible.txt")
("the" "and" "of" "to" "how" "new" "ebooks" "gutenberg" "our"
"project" "subscribe" "literary" "donations" "web" "site"
"archive" "including" "make" "newsletter" "hear" "email"
"about" "foundation" "produce" "help")
> (take 25 (misra-gries 100 "bible.txt"))
("the" "and" "of" "to" "that" "in" "he" "shall" "for" "unto"
"i" "his" "a" "lord" "is" "they" "be" "not" "him" "them"
"god" "but" "which" "you" "ye")
> (take 25 (misra-gries 200 "bible.txt"))
("the" "and" "of" "to" "that" "in" "he" "shall" "unto" "for"
"i" "his" "a" "lord" "they" "be" "is" "him" "not" "them"
"it" "with" "all" "thou" "thy")
The space-saving algorithm is slower than the other two because of the cost of recomputing the minimum at each step; a better data structure would be useful:
(define (space-saving n file-name)
(define (freq-gt? a b) (> (cdr a) (cdr b)))
(let ((keys (make-hashtable string-hash string=?)))
(with-input-from-file file-name (lambda ()
(do ((word (read-word) (read-word)))
((eof-object? word) (map car (take n
(sort freq-gt? (hashtable-pairs keys)))))
(if (or (hashtable-contains? keys word)
(< (hashtable-size keys) n))
(hashtable-update! keys word add1 0)
(let ((kv-pairs (hashtable-pairs keys)))
(let loop ((min-k (caar kv-pairs))
(min-v (cdar kv-pairs))
(kv-pairs (cdr kv-pairs)))
(if (pair? kv-pairs)
(if (< (cdar kv-pairs) min-v)
(loop (caar kv-pairs)
(cdar kv-pairs)
(cdr kv-pairs))
(loop min-k min-v (cdr kv-pairs)))
(begin
(hashtable-delete! keys min-k)
(hashtable-update! keys word
add1 min-v)))))))))))
The space-saving algorithm suffers the same problem with a small array as the Misra-Gries algorithm. With 25 items in the array, only the first four items are correct. With 100 items in the array, it gives the same result as Misra-Gries, except that “not” and “him” are reversed. With 200 items in the array, it is perfect:
> (space-saving 25 "bible.txt")
("the" "and" "of" "to" "new" "ebooks" "how" "gutenberg"
"project" "subscribe" "newsletter" "hear" "email" "about"
"our" "site" "archive" "including" "make" "this"
"information" "foundation" "produce" "help" "tm")
> (take 25 (space-saving 100 "bible.txt"))
("the" "and" "of" "to" "that" "in" "he" "shall" "for" "unto"
"i" "his" "a" "lord" "is" "they" "be" "him" "not" "them"
"god" "but" "which" "you" "ye")
> (take 25 (space-saving 200 "bible.txt"))
("the" "and" "of" "to" "that" "in" "he" "shall" "unto" "for"
"i" "his" "a" "lord" "they" "be" "is" "him" "not" "them"
"it" "with" "all" "thou" "thy")
In practice, of course, both algorithms can be run against a continuous stream of data, and queried at any time. The take function appears in the Standard Prelude. You can see all three functions assembled at http://ideone.com/BPHZqT.
In Python3.
from collections import Counter from string import ascii_lowercase, punctuation from pprint import pprint TABLE =str.maketrans(ascii_lowercase, ascii_lowercase, ".,!?#)(*&^%$<>~") def misra_gries(stream, n): counts = Counter() for e in stream: if e in counts or len(counts) < n: counts[e] += 1 else: for c in list(counts.keys()): counts[c] -= 1 if counts[c] == 0: del counts[c] return counts.most_common(n) def metwally_agrawal_abaddi(stream, n): counts = Counter() for e in stream: if e in counts or len(counts) < n: counts[e] += 1 else: k, v = counts.most_common(n)[-1] del counts[k] counts[e] = v + 1 return counts.most_common(n) def wordgen(fname): with open(fname) as f: for line in f: words = line.lower().split() for w in words: yield w.translate(TABLE) def wordcounter(stream, n): return Counter(stream).most_common(n) N = 100 fname = "huckleberry_finn.txt" pprint(misra_gries(wordgen(fname), N)) pprint(metwally_agrawal_abaddi(wordgen(fname), N)) pprint(wordcounter(wordgen(fname), N))Also Python Counter. I grabbed some text from a web page I happened to have open and tested with random character frequencies: choose a character from the text 1000 times, track 6 counts.
from collections import Counter from random import choice text = ' '.join(''' if the new item is already in the array increment the associated count else if there are less than n items in the array add the new item to the array with a count of 1 else for each item in the array decrement the associated count and remove any items whose count becomes zero'''.split()) def first(n, stream): def dec(): for item in bag: bag[item] -= 1 return + bag bag = Counter() for item in stream: if item in bag: bag[item] += 1 elif len(bag) < n: bag[item] = 1 else: bag = dec() return set(bag) def other(n, stream): def rep(new): old, _ = bag.most_common()[-1] bag[new] = bag[old] + 1 del bag[old] bag = Counter() for item in stream: if item in bag: bag[item] += 1 elif len(bag) < n: bag[item] = 1 else: rep(item) return set(bag) for k in '---': print(other(6, (choice(text) for r in range(1000))), first(6, (choice(text) for r in range(1000)))) print() print(Counter(text).most_common(10))The frequency distribution goes rather flat after the two greatest hits, so there is variation, but those two, space and ‘e’, are in every sample. (The first algorithm results to the right, because their size varies and the other does not.)
{'c', 's', ' ', 'a', 'e', 'r'} {'e', 'o', ' ', 'n', 'i'} {' ', 'd', 'o', 'i', 'e', 'r'} {'s', ' ', 'u', 'y', 'e', 't'} {'c', 's', ' ', 'i', 'e', 't'} {'e', 't', 'r', ' '} [(' ', 54), ('e', 38), ('t', 25), ('a', 21), ('r', 16), ('n', 16), ('i', 15), ('s', 13), ('h', 13), ('o', 13)]Here’s a stab at it using Julia (just because).
function bscounter(iterable, size::Int, adjust::Function) counter = Dict{eltype(iterable), Int}() for ch in iterable if haskey(counter, ch) || length(counter) < size counter[ch] = get(counter, ch, 0) + 1 else adjust(counter, ch) end end return counter end function mg_rule(counter, elem) for k in keys(counter) counter[k] -= 1 if counter[k] == zero(k) delete!(counter, k) end end end function maa_rule(counter, elem) mink,minv = first(counter) for kv in counter k,v = kv if k < mink mink = k minv = v end end delete!(counter, mink) counter[elem] = minv + 1 end ### test text = "7675767476757636564525434132" # There are seven 7's, six 6's, five 5's, ... sz = 5 println("Using MG-rule: ", bscounter(text, sz, mg_rule)) println("Using MAA-rule: ", bscounter(text, sz, maa_rule)) #prints Using MG-rule: ['6'=>3,'7'=>4,'5'=>2,'4'=>1] Using MAA-rule: ['6'=>6,'7'=>7,'2'=>6,'5'=>5,'4'=>4]It looks like neither the MG-rule or the MAA-rule are certain to provide the correct results.
It’s easy to see that the associative array in question is functioning as a bounded multiset or bag, one that can contain only a given number of unique items. We can abstract the algorithms in these terms by saying that both algorithms start with:
If the item is not in the bag and the bag is not full, put the item in the bag.
Then the MG algorithm says:
Otherwise, remove one instance of every unique item in the bag.
The MAA algorithm, on the other hand, says:
Otherwise, remove all instances of the least common item in the bag and add the item to the bag.
The MG algorithm seems peculiar in that the new item is not in fact added to the bag, and indeed there is no guarantee that the next new item will be added either, since the bag may still be full; on the other hand, given an input where all items are distinct, a bag with maximum size n will be emptied completely every n items. The MAA algorithm always does add the new item to the bag, but is underspecified: it doesn’t say what to do if there is more than one “least common item”.
A solution in (Guile) scheme.