Loglog

August 4, 2020

The basic algorithm is in the count function:

(define (count k)
  (let* ((2^k (expt 2 k)) (counts (make-vector 2^k -1)))
    (do ((word (read-word) (read-word)))
        ((eof-object? word) (vector->list counts))
      (let* ((hash (pearson word))
             (bucket (modulo hash 2^k))
             (last-1bit (zerobits (quotient hash 2^k))))
        (vector-set! counts bucket
          (max (vector-ref counts bucket) (min last-1bit (- 32 k))))))))

The function hashes each word, uses the trailing k bits to compute the bucket, and computes the number of trailing 0-bits in the remaining 32−k bits; we use trailing 0-bits rather than leading 0-bits because we have a function to do that, and it doesn’t matter to the result. The count function returns a list of bucket maximums that is used by the loglog function to compute the estimated cardinality:

(define (loglog k)
  (let ((x (exact->inexact (average (count k)))))
    (* (expt 2 x) (expt 2 k) 0.79402)))

Here, 0.79402 is the mysterious “alpha” in Flajolet’s algorithm. The results are pretty good:

> (for-each (lambda (k)
              (display k) (display #\tab)
              (display (bible loglog k)) (newline))
            (range 4 13))
4       23859.021427027903
5       24915.350550619143
6       20725.542600847963
7       18699.14586307692
8       18150.428679395118
9       18775.245177885612
10      19173.38662399013
11      18992.54592680173
12      19264.451665847893

More buckets provide better discrimination than fewer buckets, as you would expect. With 2048 buckets, the error is 19055 − 18993 = 62, which is a difference of less than a third of a percent from the actual value. And any of the estimates starting with 128 buckets is sufficiently good.

You can run the program at https://ideone.com/5r6lwH, where you will also see the various auxiliary functions and contributions from the Standard Prelude. The example used there is the Gettysburg Address of Abraham Lincoln, which uses 141 distinct words; the loglog algorithm with k = 7 estimates 149 distinct words.

Posted by programmingpraxis

Filed in Exercises

Programming Praxis

Loglog

August 4, 2020

Leave a comment

Categories

Archives

Archives

Programming Praxis

Loglog

August 4, 2020

Share this:

Related

Leave a comment

Categories

Archives

Archives