## Correct Horse Battery Staple, Improved

### April 26, 2013

My solution to the previous exercise has a horrible performance bug. Because of its use of `shuffle`

, the function takes time proportional to the length of the dictionary instead of the length of the passphrase. For a four-word passphrase and the 3097-word dictionary used in the solution, that’s a three-order-of-magnitude performance bug. Ouch!

Your task is to fix your solution if you got it wrong like I did. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisements

Some comments: the resulting algorithm is O(k^2) instead of the claimed O(k) (with k being the length of the passphrase) due to the membership testing. This could be improved to O(k log k) by using a set instead of a list for the resulting passphrase or O(k) with something like a bloom filter. Technically though, the algorithm is still O(n), with n being the length of the dictionary, since you still need to evaluate the entire dictionary.

As for real-world timings, even using an O(n log n) shuffling algorithm like I did, generating a 4-word password takes about 8 ms for a 3097-word dictionary and 70 ms for a 48246-word dictionary. I’m sure the new version is faster, but I don’t generally find myself registering accounts on new sites 14 to 125 times a second :)

Rather then using “member” you can use the same method as in “shuffle”, but only do n steps of the loop. For small values of n and a large dictionary this is not going to make much of a difference though. You can also solve this problem using a streaming algorithm that runs through the list at most once. For each word in the dictionary you do a Bernoulli trial to see whether to accept it or not. The probability for each trial depends on the number of words left in the dictionary and the number of words still to be selected. The streaming approach requires a lot of random numbers so is only a good choice if random access is a problem. Both approaches are explained in Knuth’s TAOCP.

Implementation of the first approach in scheme. Something similar is probably implemented by the “random.sample” function in python, used by one of the solutions to the original exercise.

Note that this versions mutates the words vector.

The python random.sample function returns the samples in the order they were selected, and does not alter the input population. It uses one of two approaches depending on which takes less memory: a set of n elements or a copy of the population.

When a set uses less memory, a set is used to ensure no duplicate items are selected:

When a copy of the population would use less memory an approach like Jan Van lent described is used:

I would be interested to see an exercise to determine how much entropy corresponds to a passphrase. Ideally we could reproduce Randall’s xkcd numbers but I remember seeing some debate about his numbers.