Steganography

June 10, 2011

In his book Dead or Alive, Tom Clancy describes a cryptographic system used by terrorists. His description is incomplete, but it seems to be a two-stage system, with a hand-operable cipher hidden by steganography inside images on a web site. Clancy talks about a one-time pad that doesn’t really seem to be a one-time pad and creates a stream of two-digit numbers using the middle-square method; it may sound good to his readers, but even my limited knowledge of cryptography suggests it’s bad crypto. Or, on one crypto forum where I asked about it, “really really awful” crypto.

Let’s see if we can do better than Clancy. We have four objectives: The system must be hand-operable by terrorists in similar situations to Clancy’s. The system must use both cryptography and steganography, as Clancy’s did. The system must be easily explainable in the context of a novel such as Clancy’s. And the system must be reasonably secure, certainly better than Clancy’s “really, really, really awful” system.

We’ll use Playfair for the cipher and hide our message inside the text of a typical spam email — everybody ignores spam, anyway, so what better place to hide a message? For Playfair we’ll use an 8×8 grid with 26 upper-case letters, 26 lower-case letters, 10 digits, a space, and a period as the only punctuation character. The daily passphrase is the first sentence of the lead editorial in the Wall Street Journal; as I write this on June 6th, the passphrase is “President Obama’s visit to a Chrysler plant in Toledo, Ohio, on Friday was the culmination of a campaign to portray the auto bailouts as a brilliant success with no unpleasant side effects.” We’ll refer to the previous exercise for details. If you don’t like Playfair, bifid makes a reasonable alternative.

The primary point of today’s exercise is to discuss steganography, a word which derives from the Greek for “hidden writing;” cryptography, by contrast is “secret writing.” In ancient times, steganography was performed by shaving a slave’s head, tattooing the message on his scalp, waiting for his hair to grow back, then having him travel to the intended recipient; nowadays, there are numerous programs that hide a message inside a JPEG image. We’ll hide our message in a plain text spam message like this one:

Get V I A G R A today!!!! Call (638)555-1212!!!!!

subduct mythos qua backrest chanter Kioto cronyism Lettish Badajoz Saida moody megavolt gondola coward Tibetan stoss andiron magenta Biisk Henry tumbler coquet SHAPE affable flattery blear Bahaism lance meteor limbate hit anyway yoni Hengist phaeton Papua snick whiffle ankh Firdausi Chaplin triolein ampliate hum putsch desire buttocks Golconda groat fickle mensural utopia oecology scapula bruit Stuart foamy Jane futures Vedic Halifax misquote agitate whereon resonate melodic aground smoky muezzin riddance Aarau dB elm robin bugloss duckbill pe last pow chanter winglet temporal yeanling Sidon Auckland regimen Cheviot skatole gobo splenic neolith amid braiding lowlife riant Sunnite styrene ywis teacart flyspeck deplore chyack Titan Percy hidalgo sniffle unbridle zig kinsfolk immense opaline bebeeru heeled topsail yurt lobby trucking stridor Selden mullet

We’ve all seen spam like that; the extra words are intended to get the message past the spam trap. We can hide a message in the spam in this manner: Each word after the empty line represents a binary 1-bit if it has an odd number of characters and a binary 0-bit if it has an even number of characters. A word is a maximal sequence of non-white characters.

Your task is to write functions that perform encryption and decryption using the system described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Big Numbers: Division

June 7, 2011

We continue our examination of big-integer arithmetic today with a look at division. Long division was hard in grade school, and it’s hard for computers, too, with tricky algorithms and lots of special cases. Fortunately, Donald Knuth has made things easier for us, and we will be following his Algorithm 4.3.1 D.

Division takes as input two numbers, n (the dividend, also called the numerator) and d (the divisor, also called the denominator) and returns two numbers q (the quotient) and r (the remainder) such that q · d + r = n, with 0 ≤ r < d. The basic idea of Algorithm D is to take successive partial divisions, where in each case the partial dividend has one more digit than the divisor, each successive partial division revealing one more digit of the quotient. As with the other arithmetic operators, there is a notion of carry from one partial division to the next.

We’re not going to give a detailed explanation here, because Knuth says it far better; you can run to your nearest copy of Knuth, or peek at the solution. Beware that the code is lengthy, and there are lots of tricky bits, and plan from the start to do lots of debugging.

Your task is to write a function that performs division on big integers. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2 3

Mersenne Primes

June 3, 2011

Numbers of the form Mn = 2n−1 are known as Mersenne numbers, named for the French monk Marin Mersenne who studied them early in the seventeenth century. Of the infinite set of Mersenne numbers, 47 are currently known to be prime; see Sloane’s A000043 for a list of their indices. Mersenne primes can be identified by the Lucas-Lehmer test, devised by Édouard Lucas in the 1870s and cast into its modern form by Derrick Lehmer in 1930:

For p an odd prime, the Mersenne number 2p−1 is prime if and only if 2p−1 divides Sp−1 where Sn+1 = Sn2−2 and S1 = 4.

The special form of Mersenne primes makes them easy to identify, and for many years the largest known prime has been a Mersenne prime. A cooperative project on the internet, the Great Internet Mersenne Prime Search (GIMPS), has found all of the recent Mersenne primes, because the numbers have grown so large that a single computer can’t handle the workload.

Your task is to use the Lucas-Lehmer test to find the Mersenne primes through M256. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

In the previous exercise, we decided on a representation for big numbers and wrote some simple functions concerning them. In today’s exercise we look at addition, subtraction, and multiplication. We will use the standard grade-school algorithms; there are better ways to do multiplication, but it’s best to get something working first, and maybe later we can improve it.

The trick with all of these functions is to handle the sign separately from the arithmetic. Consider addition. If you want to calculate a + b = c, you must first figure out which of |a| + |b|, |a| − |b|, or |b| − |a| is equal to |c|, perform the appropriate arithmetic, then attach the appropriate sign. Likewise subtraction, except that the result may have fewer big-digits than either input, so you have to be careful to remove redundant leading zeros from the result. Multiplication is simpler; the result is positive if the signs are the same, otherwise negative.

For addition, the arithmetic is done by iterating over the big-digits from least to most significant, adding the two big-digits in each position and carrying to the next position any excess over the digit base. Since the carry is always either 0 or 1, it is not possible for a single carry to propagate beyond the next big-digit (though several big-digit sums may incur carries in succession). In those cases where the addition is done by subtraction, the carry (in grade-school they called it borrow) is either 0 or -1, and it may indeed propagate to multiple big-digits (think of decimal 1000 − 1, where the borrow propagates three positions).

Subtraction is done by changing the sign and calling the function for addition, in order to eliminate the need for duplicating code.

Multiplication is only a little bit harder. The grade-school method forms partial products for each big-digit, and adds all the partial products at the end, but it is easier to add the partial products at each big-digit. The carry may be greater than 1, but it is still true that a single carry cannot propagate beyond the next big-digit.

Your task is to write functions that perform big number addition, subtraction and multiplication. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Upside Up

May 27, 2011

An “upside up” number is a number that reads the same when it is rotated 180°. For instance, 689 and 1961 are upside up numbers.

Your task is to find the next upside up number greater than 1961, and to count the number of upside up numbers less than ten thousand. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Over the next few exercises we’ll be developing a library of functions that provide arithmetic on big numbers: integers that are too large to fit into a machine word. Most languages, including Scheme, provide big numbers, either natively or through a standard library, so you are unlikely to need a library for big numbers. On the other hand, coding big numbers is fun, and makes a good exercise. We’ll get started today by defining a representation for big numbers and writing a few small functions.

The first decision to make when writing a big number library is to choose the radix in which numbers are represented. With binary computers, it is most economical to make the radix the largest convenient power of two; for a 32-bit CPU, the preferred radix is 216, which means that single-place multiplication plus a carry never overflows a machine word. During development, we choose a radix of 1000 (“millenial digits”), which are convenient for debugging since the numbers are easily read by human eyes, but we’ll be careful not to rely on the radix, so that it is easy to switch to a larger radix when development is finished.

The second decision is how to represent numbers. A number is, of course, represented as a polynomial in the base of the radix, with the digits stored separately. We use a list, rather than an array, since lists are more natural in Scheme, and since lists don’t impose any additional limits on the size of the numbers. The exact representation we choose is called the signed magnitude representation, where the head of the list gives the number of digits in the list, as well as the sign of the number, and the tail of the list gives the digits themselves, least-significant digit first, always stored as a positive number. Thus, the number 12345678 is stored in our representation as the list (3 678 345 12) and the number -87654321 is stored as (-3 321 654 87), 1 is (1 1), -1 is (-1 1), and zero is (0). Among other things, this representation makes it easy to compare numbers: first compare the signed magnitude, and only compare the numbers digit-by-digit if the signed magnitudes are the same.

In today’s exercise we will implement the following procedures: functions to convert between big numbers and the native numbers of the underlying language, functions to take the absolute value of a big number and to negate a big number, predicates to identify positive, negative or zero big numbers and even or odd big numbers, and functions to compare two big numbers.

Your task is to implement those big number functions described using a signed-magnitude representation; we’ll write functions to do arithmetic on big numbers in future exercises. When you are finished, you are welcome to read or run a suggested solution, or to discuss the exercise or post your own solution in the comments below.

Pages: 1 2

ISBN Validation

May 20, 2011

In a previous exercise we studied Luhn’s algorithm for validating credit card numbers. In today’s exercise we will study the algorithms for validating 10-digit and 13-digit ISBN book numbers, and fetch the author and title of the book from an internet server.

The term ISBN refers to the 10-digit book number that libraries and bookstores use to catalog their collections. The ISBN number consists of four variable-length parts, separated by dashes or spaces, that always have 10 digits in total: the first part is a region, the second part is a publisher, the third part is a title, and the fourth part is a modulo-11 check digit, which may be a digit 0 through 9, or the letter X, which represents 10. To validate the number, ignore the dashes and spaces, multiply the ten digits by 10, 9, 8, …, 1 and sum the digital products; if the sum is 0 modulo 11, the ISBN number is valid.

The term EAN refers to the new 13-digit book numbers. The EAN number consists of five variable-length parts, separated by dashes or spaces, that always have 13 digits in total: the first part is the three digits 9 7 8, the second, third and fourth parts are the region, publisher and title of the ISBN, and the fifth part is a check digit. To validate the number, ignore the dashes and spaces, multiply the ten digits by 1, 3, 1, 3, …, 1 (alternating 1 and 3) and sum the digital products; if the sum is 0 modulo 10, the EAN number is valid.

In practice, the dashes or spaces are frequently omitted. It is easy to convert between ISBN and EAN by adding or deleting the 978 prefix and recomputing the check digit. The standard definitions of ISBN and EAN are available at http://www.isbn.org/standards/home/isbn/international/html/usm7.htm. There are several internet servers that provide free ISBN lookup services; we will use the one at http://isbndb.com.

Your task is to write functions that validate ISBN and EAN numbers, convert between them, and fetch the title and author associated with an ISBN from an internet server. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Two Bad Sorts

May 17, 2011

We have some fun today implementing two really bad sorting algorithms.

Stooge sort is a recursively-defined algorithm that sorts n items in O(nlog 3 / log 1.5) comparisons: If the value at the end of the list is less than the value at the start of the list swap them. Then, if there are three are more items in the list, recursively sort the first two-thirds of the items in the list, then the last two-thirds of the items in the list, and finally (again) the first two-thirds of the items in the list. This is so bad that it takes a few seconds just to convince yourself that it actually works.

Bogosort is particularly bad, with factorial time complexity, on average, and no guarantee that it ever terminates: collect the items in the list in random order. If they are sorted, return them, otherwise try again with a new random ordering of the items in the list.

Your task is to implement stooge sort and bogosort. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

In 1981, John D. Dixon, a mathematician at Carleton University, developed the integer factorization algorithm that bears his name. Dixon’s algorithm is not used in practice, because it is quite slow, but it is important in the realm of number theory because it is the only sub-exponential factoring algorithm with a deterministic (not conjectured) run time, and it is the precursor to the quadratic sieve factorization algorithm, which is eminently practical.

Dixon’s algorithm is based on the well-known fact of number theory that, given x2y2 (mod n), with x ≠ ±y (mod n), it is likely that gcd(xy, n) will be a factor of n. The algorithm is similar to the CFRAC algorithm of Morrison and Brillhart: the first phase finds smooth relations, and the second phase uses linear algebra to combine them and form a factorization. The difference is the first phase, where CFRAC looks in the convergents of the continued fraction of the square root of the number being factored, but Dixon’s algorithm simply chooses numbers at random.

Thus, the first stage of Dixon’s method is this: choose a bound B and identify the factor base of all primes less than B that are quadratic residues of n (their jacobi symbol is 1). Then choose a positive integer r < n, calculate its square modulo n, and if the square is smooth over the factor base, add r and its square to the linear algebra; repeat this step until you have found a sufficient number of smooth squares. The second stage is identical to CFRAC.

Your task is to write a function that factors integers by Dixon’s algorithm. When you are finished you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Comm

May 10, 2011

We have today another exercise in our continuing series of Unix V7 commands:

NAME

comm — select or reject lines common to two sorted files

SYNOPSIS

comm [ -[123] ] file1 file2

DESCRIPTION

Comm reads file1 and file2, which should be ordered in ASCII collating sequence, and produces a three column output: lines only in file1; lines only in file2; and lines in both files. The filename ‘-‘ means the standard input. Flags 1, 2, or 3 suppress printing of the corresponding column. Thus comm -12 prints only the lines common to the two files; comm -23 prints only lines in the first file but not in the second; comm -123 is a no-op.

Your task is to implement the comm command. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2