## Loglog

### August 4, 2020

The basic algorithm is in the `count` function:

```(define (count k)
(let* ((2^k (expt 2 k)) (counts (make-vector 2^k -1)))
((eof-object? word) (vector->list counts))
(let* ((hash (pearson word))
(bucket (modulo hash 2^k))
(last-1bit (zerobits (quotient hash 2^k))))
(vector-set! counts bucket
(max (vector-ref counts bucket) (min last-1bit (- 32 k))))))))```

The function hashes each word, uses the trailing k bits to compute the bucket, and computes the number of trailing 0-bits in the remaining 32−k bits; we use trailing 0-bits rather than leading 0-bits because we have a function to do that, and it doesn’t matter to the result. The `count` function returns a list of bucket maximums that is used by the `loglog` function to compute the estimated cardinality:

```(define (loglog k)
(let ((x (exact->inexact (average (count k)))))
(* (expt 2 x) (expt 2 k) 0.79402)))```

Here, 0.79402 is the mysterious “alpha” in Flajolet’s algorithm. The results are pretty good:

```> (for-each (lambda (k)
(display k) (display #\tab)
(display (bible loglog k)) (newline))
(range 4 13))
4       23859.021427027903
5       24915.350550619143
6       20725.542600847963
7       18699.14586307692
8       18150.428679395118
9       18775.245177885612
10      19173.38662399013
11      18992.54592680173
12      19264.451665847893```

More buckets provide better discrimination than fewer buckets, as you would expect. With 2048 buckets, the error is 19055 − 18993 = 62, which is a difference of less than a third of a percent from the actual value. And any of the estimates starting with 128 buckets is sufficiently good.

You can run the program at https://ideone.com/5r6lwH, where you will also see the various auxiliary functions and contributions from the Standard Prelude. The example used there is the Gettysburg Address of Abraham Lincoln, which uses 141 distinct words; the loglog algorithm with k = 7 estimates 149 distinct words.