We’ve looked at algorithms to process a stream of data in previous exercises (streaming median, streaming knapsack, reservoir sampling), and we also looked at an algorithm that finds the most frequent items (the “heavy hitters”) in a file. Today’s exercise combines the two to find the most frequent items in a stream; it’s called the “Britney Spears” algorithm because it is the algorithm used to cull the most frequently-seen items in Twitter feeds and other social media, and the pop starlet always seems to be on that list.

There are two similar algorithms used to find the n most frequently-appearing items in a stream. The first, due to Jayadev Misra and David Gries, uses an initially-empty associative array of items and their associated counts and processes each item in the input stream according to the following algorithm:

    if the new item is already in the array
        increment the associated count
    else if there are less than n items in the array
        add the new item to the array with a count of 1
    else for each item in the array
             decrement the associated count and
             remove any items whose count becomes zero

The other algorithm, called the “space-saving” algorithm, is due to Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. The first to parts of the algorithm are the same as the Misra-Gries algorithm. But, when a new item is found and there is no space to store it, the item already in the array that has the smallest associated count is replaced by the new item, and its count is incremented; thus, the counts are never reset, only the item is replaced.

With either algorithm, the array can be queried at any time to see the current list of most-frequent items in the array.

Your task is to implement the two heavy hitter algorithms. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


Pages: 1 2