Approximate Median

August 21, 2020

[ I offer my apologies to my readers for my recent absence. My employer, a local community college, is struggling with this virus business, revising nearly all of its business practices, and my programmer colleagues and I have been very busy. The Fall semester starts next week (mostly on-line classes, some on-campus classes for science labs and the nursing students), so hopefully things on the virus front will get better soon. But we are also in the middle of changing our main computing system from running on HP-UX on fifteen-year old hardware to Linux on new hardware, and having all kinds of setup problems (all of the people who set up the current system twenty years ago are gone, and no one seems to know how to set up the new system), so maybe not too soon. I hope all is well with all of you. — Phil ]

We have previously studied algorithms for the streaming median and sliding median that calculate the median of a stream of numbers; the streaming median requires storage of all the numbers previously seen, and the sliding median requires storage of the last k numbers in the stream, for some k.

Today’s exercise estimates the median of a stream of numbers while storing only two numbers:

The idea is at each iteration the median inches toward the input signal at a constant rate. The rate depends on what magnitude you estimate the median to be. I use the average as an estimate of the magnitude of the median, to determines the size of each increment of the median. If you need your median accurate to about 1%, use a step-size of 0.01 * the average.

Your task is to write a program that estimates the streaming median according to the given algorithm. When you are finished, you are welcome to read or run the suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Loglog

August 4, 2020

There are 19,055 distinct words in the Bible:

$ cat bible.txt | tr -cs A-Za-z ‘
‘ | sort -u | wc -w
19055

It’s easy enough to count the number of distinct items in a set (its “cardinality”) when the set is small, but when the set is large, the intermediate storage required for the distinct items can be overwhelming.

Phillipe Flajolet and various co-authors wrote a series of papers in which they developed methods of estimating the cardinality of a set with only a small amount of auxiliary storage, using randomization; Flajolet’s algorithms can be seen as an improvement on Robert Morris’ counting algorithm that we studied in a previous exercise. We will study Flajolet’s loglog algorithm in today’s exercise and perhaps have a look at his other algorithms in future exercises.

The basic idea is to apply a hash function to each element of the set. The first bit of the hash value will be zero about half the time, the first two bits of the hash value will be zero about a quarter of the time, the first three bits of the hash value will be zero about an eighth of the time, and so on; by looking at the maximum number of leading zero-bits, we can estimate the cardinality of the set. Flajolet extends this algorithm by splitting the counts among 2k buckets and averaging the estimated cardinalities; the bucket is selected randomly by looking at the last k bits of the hash value.

Your task is to implement Flajolet’s loglog algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2