## Subset Sum, Meet In The Middle

### March 30, 2012

We looked at the brute-force algorithm for the subset sum problem in the previous exercise. Today we look at a different algorithm that solves the same problem; the new algorithm is more efficient, but still exponential, with a time complexity of *0*(*n*2^{n/2}).

The new algorithm, called the *meet in the middle* algorithm, splits the input list into equal-size halves. The first half is subject to the same algorithm as the previous exercise, in which all subsets are generated, their sums computed, and the sums compared to the target. It is possible but unlikely that the target will be found in the first half. If not, the algorithm generates all subsets of the second half and checks each sum to see if the difference between target and sum was a sum in the first half, in which case the required subset has been found.

Your task is to implement the meet in the middle algorithm for solving the subset sum problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Subset Sum

### March 27, 2012

The subset-sum problem asks you to find the subset of a set of integers that sums to a given target; for instance, given the set {267 439 869 961 1000 1153 1246 1598 1766 1922} and target 5842, the subset {869 961 1000 1246 1766} sums to 5842. This is a well-known NP problem, and the standard solution via dynamic programming takes time *O*(*n*2^{n}). The basic idea is straight forward: generate a list of subsets, compute the sum of each, and return the one that sums to the target.

The dynamic programming solution has the same *O*(*n*2^{n}) time complexity, but builds up the solution in stages. A matrix has *n* rows and 2^{n} columns; the rows are marked with the *n* input values, the columns are marked with the various subset-sums, and each cell contains a list of items from the *n* values up to the current row that sums to the column total. We’ll look at an example: find the subset of items from {1 -3 4 2} that sums to 0. After the first element of the list is considered, the first row has two columns, the null column that we ignore and the column 1:

1 | |

1 | 1 |

With the second row, there are four columns: the null column that we ignore, and columns -3, 1, and 2:

-3 | 1 | 2 | |

1 | 1 | ||

-3 | -3 | 1 | -3 1 |

When we add the third row, there are eight columns: the null column that we ignore, the six columns -3, -2, 1, 2, 4, and 5, and a hidden column for 1, which can be formed in two different ways, as 1 by itself and as the sum of -3 and 4:

-3 | -2 | 1 | 2 | 4 | 5 | |

1 | 1 | |||||

-3 | -3 | -3 1 | 1 | |||

4 | -3 | -3 1 | 1 | -3 1 4 | 4 | 1 4 |

When we add the fourth row, there are sixteen columns, but only eleven appear in the table: -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, and 7. We ignore the null column, and there are four hidden columns corresponding to the following sum-pairs, only one of which appears in the table (it doesn’t matter which): 1 = -3 + 4, 2 = -3 + 1 + 4, 1 + 2 = -3 + 2 + 4, and 4 = -3 + 1 + 2 + 4:

-3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |

1 | 1 | ||||||||||

-3 | -3 | -3 1 | 1 | ||||||||

4 | -3 | -3 1 | 1 | -3 1 4 | 4 | 1 4 | |||||

2 | -3 | -3 1 | -3 2 | -3 1 2 | 1 | -3 1 4 | 1 2 | 4 | 1 4 | 2 4 | 1 2 4 |

The solution is the subset {-3 1 2}. I apologize for my lack of table-fu.

Your task is to write a function that solves the subset-sum problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Base-26 Arithmetic

### March 23, 2012

A reader named Prashant recently wrote to suggest an exercise in base-26 arithmetic:

Write a function that takes two base-26 numbers in which digits are represented by letters with A=0, B=1, … Z=25 and returns their product using the same notation. As an example, CSGHJ × CBA = FNEUZJA.

Prashant was worried that the problem was specific to C/C++, but that’s not an issue.

Your task is to write the base-26 multiplication function. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Factoring Multiple RSA Keys

### March 20, 2012

On February 15th, the *New York Times* published an article that described an attack on the RSA key-generation algorithm by Arjen Lenstra and others; Nadia Heninger and her friends did the same study, but have not yet published their work. The *Times* missed the essence of the Lenstra paper and reported that two out of every thousand internet passwords are insecure, which is incorrect; Dan Kaminsky gives a better description of the vulnerability, and explains why there is little to worry about.

The problem was not in the RSA algorithm itself, but in generating the keys that are used by the RSA algorithm. We studied the key-generation algorithm in a previous exercise. Briefly, the key-generation algorithm selects two large primes, usually called *p* and *q*, and uses them to create a public key and a private key; the public key is the product of *p* and *q*. The security of the RSA algorithm relies on the fact that it is difficult to factor the public key into its two components *p* and *q*. However, if you do, it is easy to determine the corresponding private key, and anyone who does so can decrypt any message thus encoded.

What Lenstra/Heninger found was that, due to a flaw in some implementations of the key-generation algorithm, multiple private keys used the same p. Here’s what Lenstra/Heninger and their colleagues did: First, they collected a large corpus of RSA public keys by scanning every IPv4 address on the internet using `nmap`

; both teams collected somewhere around twelve million keys. Then, they computed the gcd of each key paired with every other key; where the gcd was not 1, it was one of the factors of the key, and the other factor could easily be found by division, and from the two factors the private key could be determined.

Today’s exercise challenges you to recreate the computation that Lenstra/Heninger used to find the factors. We’ll use this set of 21 public keys; these keys are small enough to be easily factored, but real keys are much bigger:

`708894553 704488409 705674273`

707478271 710167019 708093251

702013379 704030867 708691187

708374743 712748719 713581951

704387447 707015741 704308279

710872423 707947453 704996429

706323757 705031051 714623803

We’ll use two different methods to find the factorable keys. The first method, mentioned above, is to pair each key with every other key and compute the gcd. The time complexity of that algorithm is *O*(*n* ^{2}), which is fine if *n*=21 but not so fine if *n*=12,000,000; 144trillion of anything will take a while. A second algorithm first multiplies all the keys, then for each key *k* computes gcd(*k*, *k _{prod}* (mod

*k*

^{2}) /

*k*). The second algorithm takes time

*O*(

*n*log log

*n*); the product takes linear time, the gcd with each key takes linear time, and there is an additional factor of log log

*n*because of the big-integer arithmetic on the product of the keys. The algorithm that Lenstra/Heninger actually used was neither of these; a paper by Daniel Bernstein gives the algorithm, but it’s rather more complicated than we care to deal with today, and the linear algorithm described above isn’t too bad, as long as

*n*isn’t too large and as long as you have a good big-integer library.

Your task is to write functions that implement the two algorithms given above and use them to factor as many of the keys as you can. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Sum of squares of two largest of three values

### March 16, 2012

Today’s exercise comes to us from the book *Structure and Interpretation of Computer Programs* by Abelson and Sussman (exercise 1.3):

Define a procedure that takes three numbers as arguments and returns the sum of the squares of the two larger numbers.

Your task is to write the indicated function. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Perfect Power Predicate

### March 13, 2012

A number *n* is a perfect power if there exists a *b* and *e* for which *b ^{e}* =

*n*. For instance 216 = 6

^{3}= 2

^{3}· 3

^{3}is a perfect power, but 72 = 2

^{3}· 3

^{2}is not. Testing for perfect powers is similar to other powering predicates we have seen, and is useful in some factoring algorithms.

The trick to determining if a number is a perfect power is to know that, if the number is a perfect power, then the exponent *e* must be less than log_{2} *n*, because if *e* is greater then 2^{e} will be greater than *n*. Further, it is only necessary to test prime *e*s, because if a number is a perfect power to a composite exponent it will also be a perfect power to the prime factors of the composite component; for instance, 2^{15} = 32768 = 32^{3} = 8^{5} is a perfect cube root and also a perfect fifth root.

Your task is to write a function to determine if a number is a perfect power. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Sparse Sets

### March 9, 2012

We look today at a clever data structure for storing sparse sets of integers on the range 0 .. *u*−1 and performing initialization, lookup, and insertion is time *O*(1) and iteration in *O*(*n*), where *n* is the number of elements in the set. The data structure was studied in a 1993 article “An Efficient Representation for Sparse Sets” by Preston Briggs and Linda Torczon, in Exercise 1.9 (1.8 in the first edition) of Jon Bentley’s book *Programming Pearls*, and in exercise 2.12 of the 1974 book *The Design and Analysis of Computer Algorithms* by Al Aho, John Hopcroft and Jeffrey Ullman; the data structure itself dates to the folklore of computing.

The data structure considers a universe of integers from 0 to *u*−1; depending on the circumstances, the integers probably map to something else, but we don’t care about that. Any given set consists of *n* items chose from the universe; there are no duplicates. Note that *n* ≤ *u*, certainly, and likely *n* is much less than *u* — otherwise, you would probably use a bit vector to represent the set. Note also that we are optimizing for speed at the expense of space, as a bit vector takes *u* bits but our data structure takes 2*u* integers.

Think about a bit vector. Setting a bit is a constant-time operation, as is checking if a bit is set or unset. But initializing the bit vector and iterating over the set elements of the bit vector each take time proportional to the size of the bit vector. Our sparse sets reduce the iteration to time proportional to the size of the set (rather than the size of the universe) and reduce the initialization time to a constant.

The sparse set is represented by two vectors that we will call dense (abbreviated *D*) and sparse (abbreviated *S*). Initially *n*, the number of elements in the set, is zero; the two vectors are uninitialized and may contain anything. To add an element 0 ≤ *k* < *u* to a set that does not already contain *k*, we set *D*[*n*] to *k*, *S*[*k*] to *n*, and increase *n* by 1, an operation that takes constant time. After this, the two vectors point to each other, which gives a test of set membership that also works in constant time: an element *k* is in the set if and only if *S*[*k*] < *n* and *D*[*S*[*k*]] == *k>*. Note that if *k* is not a member of the set, the value of *S*[*k*] doesn’t matter; either it *S*[*k*] will be greater than *n* or it will point to an element of *D* that doesn’t point back to it. The diagram above right shows a set with the elements 5, 1 and 4; the blue boxes may contain any value. To iterate over the elements of the set, read *D*[0 .. *n*−1], which takes time *O*(*n*), and to clear the set make *n* = 0, which takes time *O*(1); note in particular that clearing the set doesn’t require reinitialization. Other operations, including size-of, delete, union, intersection, difference, and set-equality are possible, and equally time-efficient compared to bit vectors, but we won’t discuss them here, since they are seldom used with this representation of sets. A common use of these sparse sets is with register allocation algorithms in compilers, which have a fixed universe (the number of registers in the machine) and are updated and cleared frequently during a single processing run.

Your task is to implement the insert, lookup, iterate and clear operations for sparse sets as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Union Route Cipher

### March 6, 2012

One of the most successful military ciphers in history was fielded by the Union armies during the American Civil War. It is known that the Confederacy never cracked the cipher; in fact, they even took to publishing intercepted messages in their newspapers, in the hope that someone could read them. The cipher was devised by Anton Stager, the general superintendent of the Western Union Telegraph Company, at the request of the government of Ohio; later, General George McClellan, who was in charge of the armies of Ohio, introduced the cipher throughout the Union armies. You can read more about the cipher here and here.

The cipher works in two phases. First, some of the words with military significance are replaced by code words; for instance, attack could be replaced by tulip, and the phrase at dawn could be replaced by stripe, so the code tulip stripe means attack at dawn. The lexicon included names of people (Lincoln, various generals), places (Richmond, hilltop), times (Tuesday, 4:30pm, dawn), and actions (attack, reconnoiter); a single cleartext word could admit multiple code words, multiple cleartext words could be encoded as a single word, and even digits and punctuation had code words. The final version of the lexicon included 1608 codewords.

The second phase was a route transposition, and there were many variants. For instance, a route designated willow might call for six columns with words chosen in the order down column three, up column four, down column two, down column six, up column one, and down column five; nulls were added to pad out the last row, and an additional null was added in a seventh column.

Here’s an example: On June 1, 1863, President Lincoln sent the following telegram:

`FOR COLONEL LUDLOW:`

RICHARDSON AND BROWN, CORRESPONDENTS OF THE TRIBUNE,

CAPTURED AT VICKSBURG, ARE DETAINED AT RICHMOND.

PLEASE ASCERTAIN WHY THEY ARE DETAINED

AND GET THEM OFF IF YOU CAN.

LINCOLN.

The lexicon included certain codewords: VENUS for COLONEL, WAYLAND for CAPTURED, ODOR for VICKBURG, NEPTUNE for RICHMOND, and ADAM for LINCOLN. An additional codeword NELLY gives the time of dispatch as 4:30PM. Thus the message becomes:

`FOR VENUS LUDLOW:`

RICHARDSON AND BROWN, CORRESPONDENTS OF THE TRIBUNE,

WAYLAND AT ODOR, ARE DETAINED AT NEPTUNE.

PLEASE ASCERTAIN WHY THEY ARE DETAINED

AND GET THEM OFF IF YOU CAN.

ADAM NELLY

Now the cipher clerk picks a route GUARD that calls for five columns in the order up column one, down column two, up column five, down column four, and up column three. He writes the message in five columns, the last row padded with nulls:

`FOR VENUS LUDLOW RICHARDSON AND`

BROWN CORRESPONDENTS OF THE TRIBUNE

WAYLAND AT ODOR ARE DETAINED

AT NEPTUNE PLEASE ASCERTAIN WHY

THEY ARE DETAINED AND GET

THEM OFF IF YOU CAN

ADAM NELLY THIS FILLS UP

Now the message is pulled off in column route order and nulls are added after each column:

`up column one ADAM THEM THEY AT WAYLAND BROWN FOR KISSING`

down column two VENUS CORRESPONDENTS AT NEPTUNE ARE OFF NELLY TURNING

up column five UP CAN GET WHY DETAINED TRIBUNE AND TIMES

down column four RICHARDSON THE ARE ASCERTAIN AND YOU FILLS BELLY

up column three THIS IF DETAINED PLEASE ODOR OF LUDLOW COMMISSIONER

Then the final message is read off in order following the route indicator:

`GUARD ADAM THEM THEY AT WAYLAND BROWN FOR KISSING VENUS CORRESPONDENTS AT NEPTUNE ARE OFF NELLY TURNING UP CAN GET WHY DETAINED TRIBUNE AND TIMES RICHARDSON THE ARE ASCERTAIN AND YOU FILLS BELLY THIS IF DETAINED PLEASE ODOR OF LUDLOW COMMISSIONER`

Deciphering simply reverses the process. The recipient counts the words in the message, draws a grid of the proper size, fills in words in route order, converts codewords to their plaintext equivalents, and reads the message.

The Union route cipher had the strong advantage that it worked with words rather than individual letters, making errors in transcription and telegraphy much less common. The disadvantage, of course, was that codebooks had to be painstakingly controlled; loss of a single codebook would give the enemy access to all your communications. But that didn’t happen, and the cipher remained secure throughout the war.

Your task is to write a program that encrypts and decrypts messages according to the Union route cipher; make your own conventions for the lexicon and routes. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

## Balanced Delimiters

### March 2, 2012

These interview questions are addicting; I have to stop with them and get on something else. But one more won’t hurt:

Write a function that takes a string and determines if the delimiters in the string are balanced. The pairs of delimiters are (), [], {}, and <>, and delimiters may be nested. In addition, determine that string delimiters ‘ and ” are properly matched; other delimiters lose their magical delimiter-ness property within quoted strings. Any delimiter is escaped if it follows a backslash.

Your task is to write the function to determine if a string has balanced delimiters. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.