Loglog
August 4, 2020
The basic algorithm is in the count
function:
(define (count k) (let* ((2^k (expt 2 k)) (counts (make-vector 2^k -1))) (do ((word (read-word) (read-word))) ((eof-object? word) (vector->list counts)) (let* ((hash (pearson word)) (bucket (modulo hash 2^k)) (last-1bit (zerobits (quotient hash 2^k)))) (vector-set! counts bucket (max (vector-ref counts bucket) (min last-1bit (- 32 k))))))))
The function hashes each word, uses the trailing k bits to compute the bucket, and computes the number of trailing 0-bits in the remaining 32−k bits; we use trailing 0-bits rather than leading 0-bits because we have a function to do that, and it doesn’t matter to the result. The count
function returns a list of bucket maximums that is used by the loglog
function to compute the estimated cardinality:
(define (loglog k) (let ((x (exact->inexact (average (count k))))) (* (expt 2 x) (expt 2 k) 0.79402)))
Here, 0.79402 is the mysterious “alpha” in Flajolet’s algorithm. The results are pretty good:
> (for-each (lambda (k) (display k) (display #\tab) (display (bible loglog k)) (newline)) (range 4 13)) 4 23859.021427027903 5 24915.350550619143 6 20725.542600847963 7 18699.14586307692 8 18150.428679395118 9 18775.245177885612 10 19173.38662399013 11 18992.54592680173 12 19264.451665847893
More buckets provide better discrimination than fewer buckets, as you would expect. With 2048 buckets, the error is 19055 − 18993 = 62, which is a difference of less than a third of a percent from the actual value. And any of the estimates starting with 128 buckets is sufficiently good.
You can run the program at https://ideone.com/5r6lwH, where you will also see the various auxiliary functions and contributions from the Standard Prelude. The example used there is the Gettysburg Address of Abraham Lincoln, which uses 141 distinct words; the loglog algorithm with k = 7 estimates 149 distinct words.