Minimum Hamming Distance
November 12, 2013
Every autumn, a month or two after the start of a new academic year, I receive a handful of emails from students who expect me to help them with their homework. I never respond. Some of the requests are polite — “Dear Kind Sir;” most of them are demanding — “It has to be in C# and cannot exceed 200 lines;” all of them have spelling and grammar mistakes. Today’s exercise is from an email that demanded a response by midnight (the writer didn’t say what time zone), but that was a few weeks ago, and the homework deadline is well past, and at least the question is entertaining.
The Hamming distance between two numbers is the number of bits that must be changed to make the bit representations of the two numbers equal. For instance, the number 79 is represented as 1001111 in binary, the number 83 is represented as 1010011 in binary, and the Hamming distance between them is 3, as the second, third and fourth bits (counting from the right, starting with zero) differ. The assignment was to read a list of numbers and find the two with the minimum Hamming distance between them.
Your task is to write somebody’s homework assignment, as described above — you must finish it before midnight! When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Technically speaking, isn’t it always *after* midnight (and thus, never a suitable time to feed your Gremlin?)
In Python. The function min_hamming returns the pair with the smallest Hamming distance, as requested, and not the minimum Hamming distance.
Haskell solution
Past 6 pm, my zone. Well before a midnight, my zone. Python 3. The bit counter ignores not only zeroes but also a prefix “0b” in the binary string representation of the xor. (I also considered negative numbers but decided to ignore them. The sign is ignored, but I suspect the result might not be a Hamming distance if negative numbers were involved.)
from operator import xor
from itertools import combinations
def ham(p): return sum(1 for d in bin(xor(*p)) if d == '1')
def minham(ms): return min(combinations(ms, 2), key=ham)
The test program is longer than the program. I added the model solution example numbers to my own test numbers afterwards.
if __name__ == '__main__':
for numbers in (31, 4, 15, 926), (13500, 1935, 9645, 5790):
pairs = tuple(combinations(numbers, 2))
print('The Hammings of some pairs:')
print(*zip(map(ham, pairs), pairs), sep = '\n')
print('With "the" pair of fewest different bits:', minham(numbers))
else:
print('And the Hamming of 79 and 83:', ham((79, 83)))
Originally I had an assertion that the minimun distance pair was unique. I removed the assertion. My test shows that the model pairs are much tied.
The Hammings of some pairs:
(4, (31, 4))
(1, (31, 15))
(4, (31, 926))
(3, (4, 15))
(6, (4, 926))
(5, (15, 926))
With "the" pair of fewest different bits: (31, 15)
The Hammings of some pairs:
(8, (13500, 1935))
(4, (13500, 9645))
(4, (13500, 5790))
(4, (1935, 9645))
(4, (1935, 5790))
(8, (9645, 5790))
With "the" pair of fewest different bits: (13500, 9645)
And the Hamming of 79 and 83: 3
None of us seems interested in the part of the assignment that says to “read” the list of numbers.
Scheme needs the arg min and arg max that some machine learning communities or some such use so much. Something like the following.
(define (argmin proc args)
(let loop ((least (proc (car args)))
(arg (car args)) (args (cdr args)))
(if (null? args) arg
(let ((weight (proc (car args))))
(if (< weight least)
(loop weight (car args) (cdr args))
(loop least arg (cdr args)))))))
To be used where Python would have the key for min, as I did above.
(let ((numbers '(3 -1 4 1 -5 9)))
(display "Numbers: ") (write numbers) (newline)
(display "Absolutely least: ") (write (argmin abs numbers)) (newline))
(let ((strings '("three" "one" "four" "one" "five" "nine")))
(display "Strings: ") (write strings) (newline)
(display "Shortest: ") (write (argmin string-length strings)) (newline))
That writes the following.
Numbers: (3 -1 4 1 -5 9)
Absolutely least: -1
Strings: ("three" "one" "four" "one" "five" "nine")
Shortest: "one"
The present problem model solution would be modified like so:
(define (min-hamm xs) (argmin hamming (pairings xs)))
The example would then show the minimum distance pair instead of the minimum distance.
> (min-hamm '(13500 1935 9645 5790))
(13500 9645)
FORTH version:
In Smalltalk:
Each number is transformed into its equivalent bit string and paired with each other without repetition:
The resulting collection of pairs is minimized with respect to the hamming function:
Here’s an implementation in Python. The problem statement seems a
bit ambiguous to me; it doesn’t specify that the pairs should be of
two distinct numbers. Hence the snarky solution is to return zero,
the distance from the first number in your list to itself.
Re snarky, I think it’s reasonably clear that the numbers need to be distinct, because they need to be two: “The assignment was to read a list of numbers and find the two with the minimum Hamming distance between them.”
(And, AFAICS, there’s still no candidate solution that reads the numbers from somewhere. We are all just putting test numbers in the source code. We are all failing that part of the assignment. And I agree with us :)
The problem is very interesting and remembers me of a similar one that asks to find the closest pair of 2D points, that can be solved in O(n*log(n)). Despite the fact that in both cases we have metrics, Hamming and euclidean, in this current case we have a possibly 32-dimensional space with a different structure, if we assume that all the numbers are 32-bit integers. Probably there are more subtle differences, but I’m still wondering if this problem can be solved better than Theta(n^2). An interesting idea that crossed my mind was to use BK-trees, which is a very nice data structure which I think fits perfectly here, but still it is hard to evaluate the complexity of the queries.
@cosmin: While 32-bit, 2’s-complement integers are a possible space, there’s nothing in the problem statement to indicate that it was intended. Taking that to be implicit is a bit like assuming that all text is in ASCII unless otherwise stated. In both cases, an explicit assumption is fine (and in this case needed if you want to include negative numbers).
The complexity question should be interesting.
I’m thinking there could be a radixsort-inspired approach to speeding things up. Suppose n numbers are represented with k bits each. Assume n>=2 throughout.
Base step: If k=1, then the minimum Hamming distance (mindist) and a minimal pair (minpair) can be found in a trivial amount of time (If n>=3, then mindist is 0, and minpair can be found in at most two comparisons. If n=2, then minpair takes no more work and mindist takes one comparison).
Recursive step: Spray the numbers into 2k bins, with all numbers whose i’th bit is 0 going into bin (i,0) and those whose i’th bit is 1 going into bin (i,1), the i’th bit stripped out with a couple bitwise operations. This spraying operation is clearly O(nk). Check if all bins contain at most one element (O(k)). If so, choose the contents of any two as a minimal pair and report their Hamming distance. If not, discard all bins with less than two elements. Use the algorithm recursively to find the minimal pair and distance in each bin. Then choose the minimal one from among these, which is O(k).
The most expensive operation in the recursive step is the spraying itself. This takes O(nk) at the top level. At the jth level we have (at most) something vaguely like 2^j k (k-1) … (k-i) bins, into which we’re sorting something vaguely like k(k-1)…(k-i)n elements. The total should be something vaguely like O(2^k (k!)^2 n), I believe, but I haven’t done a careful calculation, so I could be *completely* wrong. Also, if my estimate is an upper bound, there may be a modification that will give a slightly better one: when n gets sufficiently large compared to k, it eventually gets *easier* to find the minimum Hamming distance, and may or may not get easier to find a minimum pair. Note that (to take an extreme example) if n>2^k then the Hamming distance *must* be 0.
Forget what I wrote above—it’s awful.
A recent paper on this very topic: http://www.singacom.uva.es/~edgar/cactc2012/trabajos/CACT2012_Villa.pdf
[…] Problem statement from Programming Praxis: […]