Heavy Hitters (The Britney Spears Algorithm)

June 2, 2015

We’ve looked at algorithms to process a stream of data in previous exercises (streaming median, streaming knapsack, reservoir sampling), and we also looked at an algorithm that finds the most frequent items (the “heavy hitters”) in a file. Today’s exercise combines the two to find the most frequent items in a stream; it’s called the “Britney Spears” algorithm because it is the algorithm used to cull the most frequently-seen items in Twitter feeds and other social media, and the pop starlet always seems to be on that list.

There are two similar algorithms used to find the n most frequently-appearing items in a stream. The first, due to Jayadev Misra and David Gries, uses an initially-empty associative array of items and their associated counts and processes each item in the input stream according to the following algorithm:

    if the new item is already in the array
        increment the associated count
    else if there are less than n items in the array
        add the new item to the array with a count of 1
    else for each item in the array
             decrement the associated count and
             remove any items whose count becomes zero
    

The other algorithm, called the “space-saving” algorithm, is due to Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. The first to parts of the algorithm are the same as the Misra-Gries algorithm. But, when a new item is found and there is no space to store it, the item already in the array that has the smallest associated count is replaced by the new item, and its count is incremented; thus, the counts are never reset, only the item is replaced.

With either algorithm, the array can be queried at any time to see the current list of most-frequent items in the array.

Your task is to implement the two heavy hitter algorithms. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

5 Responses to “Heavy Hitters (The Britney Spears Algorithm)”

  1. Paul said

    In Python3.

    from collections import Counter
    from string import ascii_lowercase, punctuation
    from pprint import pprint
    
    TABLE =str.maketrans(ascii_lowercase, ascii_lowercase, ".,!?#)(*&^%$<>~")
    
    def misra_gries(stream, n):
        counts = Counter()
        for e in stream:
            if e in counts or len(counts) < n:
                counts[e] += 1
            else:
                for c in list(counts.keys()):
                    counts[c] -= 1
                    if counts[c] == 0:
                        del counts[c]
        return counts.most_common(n)
    
    def metwally_agrawal_abaddi(stream, n):
        counts = Counter()
        for e in stream:
            if e in counts or len(counts) < n:
                counts[e] += 1
            else:
                k, v = counts.most_common(n)[-1]
                del counts[k]
                counts[e] = v + 1
        return counts.most_common(n)
    
    def wordgen(fname):
        with open(fname) as f:
            for line in f:
                words = line.lower().split()
                for w in words:
                    yield w.translate(TABLE)
    
    def wordcounter(stream, n):
        return Counter(stream).most_common(n)
    
    N = 100
    fname = "huckleberry_finn.txt"
    pprint(misra_gries(wordgen(fname), N))
    pprint(metwally_agrawal_abaddi(wordgen(fname), N))
    pprint(wordcounter(wordgen(fname), N))
    
  2. Jussi Piitulainen said

    Also Python Counter. I grabbed some text from a web page I happened to have open and tested with random character frequencies: choose a character from the text 1000 times, track 6 counts.

    from collections import Counter
    from random import choice
    
    text = ' '.join('''                                                                                              
        if the new item is already in the array                                                                      
            increment the associated count                                                                           
        else if there are less than n items in the array                                                             
            add the new item to the array with a count of 1                                                          
        else for each item in the array                                                                              
                 decrement the associated count and                                                                  
                 remove any items whose count becomes zero'''.split())
    
    def first(n, stream):
        def dec():
            for item in bag: bag[item] -= 1
            return + bag
    
        bag = Counter()
        for item in stream:
    	if item in bag:     bag[item] += 1
    	elif len(bag) < n:  bag[item] = 1
    	else:               bag = dec()
    
        return set(bag)
    
    def other(n, stream):
        def rep(new):
            old, _ = bag.most_common()[-1]
            bag[new] = bag[old] + 1
            del bag[old]
    
        bag = Counter()
        for item in stream:
            if item in bag:     bag[item] += 1
            elif len(bag) < n:  bag[item] = 1
            else:               rep(item)
    
        return set(bag)
    
    for k in '---':
        print(other(6, (choice(text) for r in range(1000))),
              first(6, (choice(text) for r in range(1000))))
    print()
    print(Counter(text).most_common(10))
    

    The frequency distribution goes rather flat after the two greatest hits, so there is variation, but those two, space and ‘e’, are in every sample. (The first algorithm results to the right, because their size varies and the other does not.)

    {'c', 's', ' ', 'a', 'e', 'r'} {'e', 'o', ' ', 'n', 'i'}
    {' ', 'd', 'o', 'i', 'e', 'r'} {'s', ' ', 'u', 'y', 'e', 't'}
    {'c', 's', ' ', 'i', 'e', 't'} {'e', 't', 'r', ' '}
    
    [(' ', 54), ('e', 38), ('t', 25), ('a', 21), ('r', 16), ('n', 16), ('i', 15), ('s', 13), ('h', 13), ('o', 13)]
    
  3. Mike said

    Here’s a stab at it using Julia (just because).

    function bscounter(iterable, size::Int, adjust::Function)
        counter = Dict{eltype(iterable), Int}()
        for ch in iterable
            if haskey(counter, ch) || length(counter) < size
                counter[ch] = get(counter, ch, 0) + 1
            else
                adjust(counter, ch)
            end
        end
        return counter
    end
    
    function mg_rule(counter, elem)
        for k in keys(counter)
            counter[k] -= 1
            if counter[k] == zero(k)
                delete!(counter, k)
            end
        end
    end
    
    function maa_rule(counter, elem)
        mink,minv = first(counter)
        for kv in counter
            k,v = kv
            if k < mink
                mink = k
                minv = v
            end
        end
        delete!(counter, mink)
        counter[elem] = minv + 1
    end
    
    ### test
    text = "7675767476757636564525434132"   # There are seven 7's, six 6's, five 5's, ...
    sz = 5
    
    println("Using MG-rule: ", bscounter(text, sz, mg_rule))
    println("Using MAA-rule: ", bscounter(text, sz, maa_rule))
    
    #prints
    Using MG-rule: ['6'=>3,'7'=>4,'5'=>2,'4'=>1]
    Using MAA-rule: ['6'=>6,'7'=>7,'2'=>6,'5'=>5,'4'=>4]
    

    It looks like neither the MG-rule or the MAA-rule are certain to provide the correct results.

  4. John Cowan said

    It’s easy to see that the associative array in question is functioning as a bounded multiset or bag, one that can contain only a given number of unique items. We can abstract the algorithms in these terms by saying that both algorithms start with:

    If the item is not in the bag and the bag is not full, put the item in the bag.

    Then the MG algorithm says:

    Otherwise, remove one instance of every unique item in the bag.

    The MAA algorithm, on the other hand, says:

    Otherwise, remove all instances of the least common item in the bag and add the item to the bag.

    The MG algorithm seems peculiar in that the new item is not in fact added to the bag, and indeed there is no guarantee that the next new item will be added either, since the bag may still be full; on the other hand, given an input where all items are distinct, a bag with maximum size n will be emptied completely every n items. The MAA algorithm always does add the new item to the bag, but is underspecified: it doesn’t say what to do if there is more than one “least common item”.

  5. r. clayton said

    A solution in (Guile) scheme.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: