String Search: Boyer-Moore

August 28, 2009

The two previous exercises discussed the brute-force and Knuth-Morris-Pratt algoritms for searching strings. Today we discuss the Boyer-Moore string search algorithm, invented by Bob Boyer and J Strother Moore in 1977, in a variant devised by Nigel Horspool.

The Boyer-Moore algorithm is a “backwards” version of the Knuth-Morris-Pratt algorithm. It looks at the last character of the pattern first, working its way right-to-left until it finds a mis-match, when it slides the pattern right along the search string for a skip size based on the current character.

Consider the pattern ABABAC, the same pattern used in the prior exercise. The skip array is:

Pages: 1 2

We examined the brute-force method of searching for a pattern in a string in the previous exercise. Today, we look at a better string-searching method developed by Donald Knuth and Vaughn Pratt, and independently by James Morris, which they published jointly in 1977.

In the brute-force method, if the beginning of a pattern matches at the current position of the text, but leads to a mis-match before the end of the pattern, the search restarts at the beginning of the pattern moved one character along in the string. The Knuth-Morris-Pratt method takes advantage of the partial-match; since the characters are already known, and in fact are part of the pattern, it is possible to pre-compute based on the pattern, even before the search begins, where is the next character at which a search could possibly succeed. Consider for instance the pattern ABABAC being matched against the string ABABABAC. The first comparison looks like this:

str: A B A B A B A C
pat: A B A B A C

The first five characters match, the sixth fails. But instead of sliding the pattern one character to the right, as in the brute-force search,

str: A B A B A B A C
pat:   A B A B A C

we can slide the pattern two characters to the right, since we already know the five characters of the pattern that matched:

str: A B A B A B A C
pat:     A B A B A C

And we’re done. The skip size can be computed before the search ever starts, just by comparing the pattern to itself; the skip array, which is the same length as the pattern, tells how many characters to skip if the mis-match is found at a given pattern character. The first element of the skip array is always 0; if the first character of the pattern doesn’t match the first character of the string, slide the pattern one character to the right (that is, skip 0 characters). The second element of the skip array is also 0, because the A that is the first character of the pattern doesn’t match the B that is the second character of the pattern. The third element of the skip array is 1, because the A that is the third character of the pattern matches the A that is the first character of the pattern. The fourth element of the skip array is 2, because the AB that is the third and fourth characters of the pattern matches the AB that is the first and second characters of the pattern. The fifth element of the skip array is 3, because the ABA that is the third through fifth characters of the pattern matches the ABA that is the first through third characters of the pattern. And the sixth element of the skip array is 0, because the C that is the sixth character of the pattern doesn’t match any previous characters in the pattern.

Thus, the Knuth-Morris-Pratt algorithm works by pre-computing the skip array, then comparing the pattern to the string, character by character, occasionally skipping characters according to the skip array.

Your task is to write a function that searches for a pattern in a string using the Knuth-Morris-Pratt algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

String Search: Brute Force

August 21, 2009

In this exercise, and several that will follow, we will examine algorithms for searching a string. Our goal is to write a function that takes a pattern and a string and either returns the index of the first location of the pattern or an indication that the pattern is not present in the string; an optional third argument allows the search to start at an arbitrary point in the string. Our pattern is simply a fixed string; unlike regular expressions, there are no meta-characters.

Our first string-search algorithm uses brute force: place the pattern at each possible location in the search string and compare character-by-character to determine if there is a match.

Your task is to write a function that performs brute-force string searching as described above. Since we will be writing several other string searching functions, you should also write a test suite that will find any errors in your function. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Blum Blum Shub

August 18, 2009

A stream cipher takes its input one plaintext character at a time, producing the corresponding ciphertext character before it moves on to the next plaintext character. An easy way to do this is to seed a pseudo-random number generator and xor its output with the stream of plaintext characters; because of the xor, decryption is the exact same operation as encryption. The pseudo-random number generator must be crypographically secure; such common pseudo-random number generators as linear congruential, lagged fibonacci, and mersenne twister all fail due to serial correlation between successive values.

A simple pseudo-random number generator that is cryptographically secure is known as Blum Blum Shub, after its three inventors. Successive values are computed as the least-significant bit (or bits) of the number xk = xk-12 (mod n). The parameter n is the product of two large primes p and q, each congruent to 3 modulo 4. The initial value x0 is calculated as s2 (mod n), where s is the seed, which must be coprime to n.

Your task is to write a function that enciphers and deciphers a stream of characters as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

Pairing Heaps

August 14, 2009

We previously implemented priority queues using leftist heaps. In this exercise we will implement the same library using a different data structure known as a pairing heap. Pairing heaps are an unusual data structure; they are simple to implement, and performance is good, but their time complexity is not proved, only conjectured, and the analysis has defied many good algorithmists. Our presentation is based on Chris Okasaki’s book Purely Functional Data Structures.

Pairing heaps are heap-ordered multi-way trees, with each node of each tree less than its children. A pairing heap is either empty or is a node consisting of an item and a list of pairing heaps, none of which may be empty. The find-first function simply returns the item at the root of the tree of nodes. Merge takes two trees and makes the tree with the larger root the leftmost child of the tree with the smaller root. Insert builds a singleton node, then calls merge.

The fundamental operation on pairing heaps is the delete-first operation, which discards the root of the tree and merges its children in two passes. The first pass runs from left to right, merging child nodes pair-wise, the first child with the second, the third child with the fourth, and so on. The second pass merges the resulting trees, again pair-wise, from right to left.

Your task is to implement priority queues using pairing heaps, using the same interface as the prior exercise. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Update: The Daily WTF

August 14, 2009

Alex Papadimoulis has agreed to change the name of his series of programming exercises, thus ending the dispute between us. I thank him for agreeing to the change, and wish him well.

To all of my regular readers: I apologize again for involving you in this dispute. The next exercise will be published in a few hours, and I hope you are as anxious to get back to programming as I am.

To all those people who found my blog for the first time: Welcome! Please stay. Enjoy the exercises. And please post your solutions, so we may all learn from your work.

The Daily WTF, a web site that chronicles “curious perversions in information technology,” recently introduced a new feature called Programming Praxis in which simple programming exercises are assigned to readers who post their solutions and discuss the exercise in the comments. Alex Papadimoulis runs The Daily WTF.

On June 23rd, Programming Praxis published an exercise based on one of the stories at The Daily WTF. That was done only after consulting with The Daily WTF to ensure there was no copyright violation, and credited The Daily WTF as the source of the exercise, even providing a link back to the original The Daily WTF article.

Papadimoulis liked what he saw at Programming Praxis, and began discussing with me some kind of collaboration between the two web sites. After some discussion, on July 22nd The Daily WTF published a programming exercise of its own, based on the Russian peasant multiplication algorithm. That article used the phrase “Programming Praxis” in its title, and credited me with the idea, but did not refer to the Programming Praxis web site. That article was a success, generating over seven hundred comments with a high signal-to-noise ratio, and Papadimoulis and I began seriously discussing a collaboration.

While we were discussing how a collaboration would work, on July 29th The Daily WTF published a second programming exercise second article that also used the phrase “Programming Praxis” in its title, and borrowed the exercise from one previously discussed at Programming Praxis, but did not credit me or refer to the Programming Praxis web site.

At that point discussions about collaboration broke down. The problem was that the two sites had different goals: The Daily WTF is primarily entertainment, and Programming Praxis is primarily educational. The difference was highlighted by the decision to use the Josephus problem; Papadimoulis selected that problem because he thought of a neat way to use an animated gif to show how the soldiers die. I notified Papadimoulis that no collaboration was possible, and asked him not to use the name “Programming Praxis” in any future exercises he might publish.

Papadimoulis never responded to my email, but did respond on The Daily WTF by publishing on August 5th another exercise based on a common mathematical problem. The problem used the phrase “Programming Praxis” in its title, and Papadimoulis wrote, in the first comment, that “Programming Praxis” now had its own category on The Daily WTF; he also asked readers to submit tips for future “Programming Praxis” articles.

The name “Programming Praxis” belongs to me, not Papadimoulis. I have been publishing under that name twice a week for six months, and own the programmingpraxis.com domain. Papadimoulis is using the name without my permission, and against my expressed wishes.

Papadimoulis’ improper use of the name has already caused confusion in the marketplace of ideas. At proggit (http://www.reddit.com/r/programming/comments/95o33/programming_praxis_josephus_circle/), a reader named “bhrgunatha” says “I think the real WTF here is they are taking these exercises from the actual programming praxis site apparently against the license.”

After publication of the third exercise, I sent an email demanding that Papadimoulis cease and desist from using the phrase “Programming Praxis” to describe his weekly programming exercises. After four days, Papadimoulis responded that it is now too late to change the name, and that we would have to be happy to share it. I am not happy to share my name, and on Monday I sent a registered letter for next-day delivery demanding that Papadimoulis cease and desist from using the phrase “Programming Praxis.” However, The Daily WTF continues to use the phrase “Programming Praxis,” publishing yet another exercise under that name today.

A log of emails between Papadimoulis and me, complete except for a few emails arranging the time of our telephone conversation, appears on the next page. The emails show the history of the situation as it transpired.

I must defend my name. Thus, I am publishing this account of what happened. I will also explore trademark protection for my name, and such other legal action as may be required.

Thank you to all my regular readers for listening to my story. If you wish to help, you may act to provide wide attention to this situation in the blogosphere; feel free to post a link to this blog entry to your favorite forum, and add your comments. Your code, your comments and your private emails inspire me to continue publishing Programming Praxis. I apologize that I must engage you in this ugliness.

/s/ Philip L. Bewig

Pages: 1 2

Someone mentioned Uncle Bob’s Bowling Game Kata to me a few days ago; it is apparently famous as a tutorial on object-oriented programming and test-driven development, though I had never heard of either Uncle Bob or his Bowling Game Kata.

A game of tenpins bowling lasts ten frames, in each of which the bowler makes one or two attempts to knock down ten pins arranged in a triangle. If the bowler knocks down all ten pins on the first attempt (that’s called a “strike”), he scores ten pins plus the number of pins knocked down on his next two rolls. If the bowler knocks down all ten pins after two attempts (that’s called a “spare”), he scores ten pins plus the number of pins knocked down on his next roll. If the bowler fails to knock down all ten pins (that’s called an “open frame”), he scores the number of pins he knocked down. The scores accumulate through all ten frames. At the last frame, if necessary, the pins are reset for one or two additional rolls to count the final bonus. The sample scoresheet below shows how the calculations work:

Drawing by my daughter Kate

For instance, the score in the second frame is 9, the sum of the two balls in the open frame. The score in the third frame is 15, which is 10 for the spare plus 5 on the next ball. The score in the ninth frame is 20, which is 10 for the strike plus 10 for the spare on the first two balls of the tenth frame. The score in the tenth frame is 16, which is 10 for the spare plus 6 for the extra ball, which is a bonus ball not really part of any frame (the two balls of the tenth frame have already been rolled).

Your task is to write a function that calculates the score of a tenpins bowling game. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2

ADFGX

August 7, 2009

The ADFGX cipher, and its later sibling the ADFGVX cipher, were field ciphers used by the German Army during the First World War. It is called ADFGX because those are the only letters that appear in the ciphertext, chosen to reduce operator error during Morse code transmission because they sound so different.

ADFGX is a fractionating transposition cipher invented by German Army Colonel Fritz Nebel in March 1918, and famously cryptanalyzed by French Army Lieutenant Georges Painvin in April 1918. The French version of the story is that Painvin broke the cipher in a single, sustained forty-eight hour session, reading a message that suggested where Ludendorff intended to attack, allowing the French to foil the attack and win the war; the German version differs, but the winners write the history.

ADFGX works in two phases. The first uses a 5×5 polybius square to represent the alphabet, with I and J combined:

  A D F G X
A b t a l p
D d h o z k
F q f v s n
G g j c u x
X m r e w y

The plaintext message is converted to ciphertext by noting the row and column headers where each character is located; for instance:

P  R  O  G  R  A  M  M  I  N  G  P  R  A  X  I  S
AX XD DF GA XD AF XA XA GD FX GA AX XD AF GX GD FG

Then, the partially-enciphered message is subject to columnar transposition; the original German version of the cipher added nulls at the end of short columns, but we will eschew this method in favor of the normal columnar transposition. Cryptographically, this is a rather strong method, especially in the days when cryptanalysis was done without aid of computers, because the two ciphertext characters that represent a single plaintext character are split, or fractionated, from one another. Using the transposition keyword TULIP, the transposition worked like this:

T U L I P
3 4 1 0 2
A X X D D
F G A X D
A F X A X
A G D F X
G A A X X
D A F G X
G D F G

Then the columns are read off in order: DXAFXGGXAXDAFFDDXXXXAFAAGDGXGFGAAD.

Decipherment is the inverse operation.

There are two keys, the polybius square and the transposition. German doctrine called for the square to be changed weekly and the transposition key to be changed daily, and use of the ADFGX cipher was restricted to high-level communications only. Painvin attacked the cipher by comparing multiple messages with identical beginnings, from which he was able to work out the transposition matrix (he kept swapping columns until a frequency count of digrams had the same shape as German-army plain-text), then he solved the remaining monoalphabetic substitution cipher on digrams by frequency analysis. Though the cipher is no longer secure, the combination of fractionated substitution and transposition is still used as the basis of some modern ciphers, including DES.

Your task is to implement encryption and decryption of the ADFGX cipher. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

With this exercise, we return to the original arrangement in which the solution appears on the same day as the exercise. Delaying the solution makes the blog harder to maintain, and confuses readers who don't know where to write their comments. Solutions that appeared in the last few weeks have been merged with their corresponding exercises; in particular, the solution to Lenstra's Algorithm that was promised for today has been posted along with the original exercise.

Pages: 1 2

Lenstra’s Algorithm

August 4, 2009

Hendrik Lenstra devised the elliptic curve factorization algorithm in 1987, an algorithm that is simultaneously elegant and of immense practical importance. This exercise describes his original algorithm. With some algorithmic tweaks that we won’t discuss here, Lenstra’s algorithm is generally quick to find factors in the 20- to 25-digit range, slower to find factors in the 40-digit range, and ineffective if the factors are much larger than about 50 digits; at this writing, the largest factor found by the elliptic curve algorithm has 67 digits.

Lenstra’s elliptic curve factorization algorithm works by searching for a point on the elliptic pseudo-curve Ea,b (mod n), where n =pq, that exists modulo p but not modulo q, or vice versa. In the previous exercise, we searched for that point by picking an initial point and repeatedly adding it to multiples of itself until the elliptic algebra failed.

Instead of repeated addition, Lenstra speeds the search by borrowing from Pollard’s p-1 method the idea of searching over a large product of small integers. Pollard’s method uses the least common multiple of numerous small integers, but Lenstra instead uses a large factorial. For instance, if you multiply 10000! by a point on an elliptic pseudo-curve, you are likely to find a factor of a 50-digit number somewhere in the middle of the computation. But Lenstra’s method admits a continuation that Pollard’s lacks; if one elliptic curve fails, you can choose another and try again. Or you can also increase the multiplier; if you don’t find a factor with a limit of 10000! and 1000 curves, you can try again with a limit of 1000000! and 1000000 curves.

Of course, there are some details to fill in, and instead of computing 10000! and then performing the elliptic multiplication, the work is done in small steps, checking the elliptic algebra after each. Here is Lenstra’s original algorithm, searching a single curve to a given limit:

To get started, choose a parameter for the limit, choose a random pseudo-elliptic curve Ea,b (mod n) and choose a random point (x,y) on the curve. Calculate the discriminant d = 4a3 + 27b2 of the curve; if d = 0, the curve has a cusp or is self-intersecting, so choose a different curve. Then you can check if you were lucky; if gcd(d,n) > 1, d is a factor, so report it and quit. But most often d and n are co-prime, and you must continue.

The main body of Lenstra’s algorithm is a double-nested loop starting with random point (x,y):

for each prime p less than the limit
    for as many times as the base-p integer logarithm of the limit
        calculate a new (x,y) on Ea,b (mod n) as p times the old (x,y)
        if the calculation fails, report the finding of a factor

If you find a factor, the loop exits early; if the loop finishes, you can either choose a different curve, or increase the limit, or quit and report failure.

The elliptic addition, doubling and multiplication functions from the prior exercise need to be modified to recognize an error of the elliptic algebra before it occurs. For instance, when adding two points, first compute the denominator of the slope x1x2, and check that it has an inverse by calculating gcd(x1x2,n) before computing the slope and determining the sum of the two points; if the gcd is 1, you can proceed to calculate the inverse, if it is n you have been horribly unlucky (the point you are calculating isn’t on either of the two true sub-curves of the pseudo-curve) and you must exit the loop with failure, but if the gcd is strictly between 1 and n, it is a factor of n, so you can report it and quit. You may wish to calculate the integer logarithm in the inner loop by multiplying an accumulating product times the base p during the loop, stopping the loop when the accumulator exceeds the limit. The calculation of the large factorial is implicit in the two loops.

It is hard to choose the limit and the number of curves. In his original paper, Lenstra gives a formula for calculating exactly the limit and the number of curves, but unfortunately, the formula relies on the factorization, which of course is not yet known; his formula does give him a good way to prove the time-complexity bound, however. There are several papers that examine the question, but specific advice must be based on the specific implementation. The only general advice we can give is that the limit and number of curves must increase as the number to be factored increases, using too big or too small a limit and number of curves will cost you time rather than save it, and you will have to experiment with your implementation to know what works for you. Be aware that there can be considerable variance in the time it takes to perform a factorization, depending on how the random number generator chooses the parameters of the search.

Lenstra’s algorithm, as given above, searches a single curve to a given limit. To make a complete factorization, you must arrange to call it repeatedly in the event of failure, and then check the result, which may itself be composite, calling Lenstra’s algorithm recursively until the factorization is complete. And any real implementation of Lenstra’s algorithm first uses some light-weight method, such as trial division by small primes or Pollard’s rho algorithm, to quickly reduce the scope of a factorization.

Your task is to write a function that factors integers using Lenstra’s algorithm. Use your function to find the factors of (1081-1)/9; you may wish to do a timing comparison with Pollard’s rho factorization function. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Pages: 1 2