Correct Horse Battery Staple, Improved

April 26, 2013

We use the same dictionary as the previous exercise, but store the words in a vector rather than a list:

(define words '#(abandon ... youth))

(define len (vector-length words))

Then we loop, adding a new word to the outgoing list, skipping any already in the output, until we have the desired number of words:

(define (xkcd n)
  (let loop ((n n) (ws (list)))
    (if (zero? n) ws
      (let ((w (vector-ref words (randint len))))
        (if (member w ws) (loop n ws)
          (loop (- n 1) (cons w ws)))))))

Here are some examples:

> (do ((i 0 (+ i 1))) ((= 20 i))
    (display (xkcd 4)) (newline))
(thrive sudden sharply driving)
(critical behalf swell flying)
(strategic propose saving bitter)
(history disturbing skull sheet)
(drama truth saving spray)
(decide skill typical producer)
(safety longtime proof border)
(spend northern contractor retired)
(along window involve feeling)
(coming notebook silly compel)
(essence seven creative researcher)
(disaster mention magnitude diagnose)
(vessel confront cultural stable)
(diagnose supporter province realm)
(again plain talent blanket)
(marker football involve fleet)
(shower television outline chill)
(student feeling package sharp)
(dancer architect accept deadline)
(remarkable bother racial psychology)

We used the random number generator from the Standard Prelude. You can run the program at


Pages: 1 2

5 Responses to “Correct Horse Battery Staple, Improved”

  1. Some comments: the resulting algorithm is O(k^2) instead of the claimed O(k) (with k being the length of the passphrase) due to the membership testing. This could be improved to O(k log k) by using a set instead of a list for the resulting passphrase or O(k) with something like a bloom filter. Technically though, the algorithm is still O(n), with n being the length of the dictionary, since you still need to evaluate the entire dictionary.
    As for real-world timings, even using an O(n log n) shuffling algorithm like I did, generating a 4-word password takes about 8 ms for a 3097-word dictionary and 70 ms for a 48246-word dictionary. I’m sure the new version is faster, but I don’t generally find myself registering accounts on new sites 14 to 125 times a second :)

  2. Jan Van lent said

    Rather then using “member” you can use the same method as in “shuffle”, but only do n steps of the loop. For small values of n and a large dictionary this is not going to make much of a difference though. You can also solve this problem using a streaming algorithm that runs through the list at most once. For each word in the dictionary you do a Bernoulli trial to see whether to accept it or not. The probability for each trial depends on the number of words left in the dictionary and the number of words still to be selected. The streaming approach requires a lot of random numbers so is only a good choice if random access is a problem. Both approaches are explained in Knuth’s TAOCP.

  3. Jan Van lent said

    Implementation of the first approach in scheme. Something similar is probably implemented by the “random.sample” function in python, used by one of the solutions to the original exercise.

    (define (xkcd-2 n)
      (let loop ((i 0) (ws (list)))
        (if (> i n) ws
          (let* ((r (randint (- len i)))
                 (w (vector-ref words r)))
            (vector-set! words r (vector-ref words n)) ; mutates words!
            (vector-set! words n w)                    ; not necessary if copy of words is used
            (loop (+ i 1) (cons w ws))))))
    (do ((i 0 (+ i 1))) ((= 20 i))
        (display (xkcd-2 4)) (newline))

    Note that this versions mutates the words vector.

  4. Mike said

    The python random.sample function returns the samples in the order they were selected, and does not alter the input population. It uses one of two approaches depending on which takes less memory: a set of n elements or a copy of the population.

    When a set uses less memory, a set is used to ensure no duplicate items are selected:

    n = len(population)
    result = []
    used = set()
    for i in range(k):
       r = random.randint(0, n)
       while r in used:
          r = random.randint(0, n)

    When a copy of the population would use less memory an approach like Jan Van lent described is used:

    n = len(population)
    result = []
    source = population[:]
    for i in range(k):
       r = random.randint(0, n-i)
       source[r] = source[-1]
  5. Mitchell Perilstein said

    I would be interested to see an exercise to determine how much entropy corresponds to a passphrase. Ideally we could reproduce Randall’s xkcd numbers but I remember seeing some debate about his numbers.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: