July 12, 2013
Our Standard Prelude has included a simple implementation of hash tables since the very beginning. That implementation is probably twenty years old, or more; it’s one of the first pieces of code that I wrote when I was first learning Scheme, and even though it has had a few improvements since then, it’s time for a better implementation of hash tables. We’ll do that today.
Let’s review. A hash table provides storage for key/value pairs, where the only operation permitted is to compare two keys for equality; unlike a binary search tree, there is no ordering relationship between two keys. The hash table stores key/value pairs in an array of lists; a hash function converts a key to an address in the array, then a linear search through the list at that array location identifies the key/value pair. The size of the array is fixed in advance; if it is too small, then the chains at each array location grow too long, which slows down the operation of the hash table, but if it is too large, space is wasted. Assuming that the array isn’t too very small, so that the chains don’t grow too long, and assuming a fair hash function, hash tables provide O(1) access to any key/value pair.
A dynamic hash table adjusts automatically to the number of key/value pairs that it stores. That implies some kind of compromise in the O(1) behavior of the hash table. One possibility is to store the chains in some kind of tree structure, but that means each access will cost O(log n). Another possibility is to double the size of the array whenever the load factor becomes too high, but that means pauses whenever the array is resized, which can be annoying (or worse) for some applications, and still leaves the asymptotic behavior at something greater than O(1).
So we cheat. We will store the chains in a two-stage data structure that consists of arrays of width w stored in the growable arrays of a previous exercise, and systematically increase the size of the array at each insertion instead of stopping from time to time for a big resizing. The arrays of width w mean that the base of the logarithm is w instead of 2, so for instance to store a million key/value pairs with w = 256 and a load factor of 5 we need 200,000 chains which will be stored in 800 elements of the growable array at a maximum depth from the root of 9, and we’ll say that 9 is close enough to 1 that O(log n) ≈ O(1). It’s fun to cheat!
To control the doubling of the array we maintain three variables: u is the current number of chains, m is the current number of available chains, and p is the index number of the next chain to be split. Variables u and m are initialized to w; p runs from 0 to m, which doubles when p reaches it, resetting p to 0 for the next doubling. When the average load factor is exceeded, p increases, the keys at bucket p are rehashed and split into two buckets at p and p + m, and p and u are increased by one. The table shrinks in a similar way during deletions, except that m and u never fall below w.
Your task is to write a program that maintains dynamic hash tables. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.