Run Length Encoding

February 26, 2010

A book by Brian Kernighan and P. J. Plauger, Software Tools in Pascal, describes a pair of programs, called compress and expand, that provide run-length encoding of text files. The compress program takes a text file as input and writes a (hopefully smaller) version of the text file as output; the expand program inverts that operation. Compression is achieved by replacing runs of four or more of the same character with a three-character code consisting of a tilde, a letter A through Z indicating 1 through 26 repetitions, and the character to be repeated. Runs longer than 26 characters are replaced by multiple encodings, andany literal appearance of the tilde in the input is encoded as a run of length 1. For instance, the string ABBB~CDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE is encoded as ABBB~A~C~ED~ZE~DE.

Your task is to write programs that compress and expand a text file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

Engineering A Sort Function

February 23, 2010

We examined quicksort in a previous exercise. Today we examine a version of quicksort that Jon Bentley spoke about at Google, reprising the quicksort function that he and Doug McIlroy explained in their article “Engineering a Sort Function” in Software—Practice and Experience (Volume 23, Number 11, Pages 1249–1265, November 1993).

The Bentley/McIlroy quicksort is characterized by an adaptive pivot selection, a fast approaching-pointers partition, use of ternary comparison operator (less-than, equal-to, greater-than) to enable a “fat pivot,” and insertion sort for small sub-arrays. The quicksort also uses a fast swap that adapts to the size of the data elements. The adaptive pivot selection uses the middle element for small arrays, the median of the first, middle and last elements for medium-sized arrays, and the median of nine equally-spaced elements for larger arrays. This description doesn’t really do justice to the code; you should read the article for details.

Your task is to write a quicksort function using the same algorithm as Bentley and McIlroy. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Sieve Of Atkin, Improved

February 19, 2010

We examined the Sieve of Atkin for enumerating prime numbers in a recent exercise. Though the Sieve of Atkin has the potential to be faster than the Sieve of Eratosthenes, our implementation was not; the Sieve of Eratosthenes takes 1.219 seconds to enumerate the first million primes, our implementation of the Sieve of Atkins takes 22.954 seconds, nineteen times as long, to perform the same task.

The problem is that the naive implementation we used performs useless work. For instance, consider that you are sieving the numbers to a million, with the x and y variables of the naive implementation ranging from one to a thousand. In the first case, with 4x² + y² = k, 4·1000² + 1000² = 5000000, which is far beyond the end of the range being sieved. Reducing the limits means a lot less work and a much faster program.

Your task is to rewrite the naive Sieve of Atkins to eliminate useless work. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Soundex

February 16, 2010

Soundex is an algorithm originally developed by Margaret Odell and Robert Russell in the early 1900s to convert people’s last names into a standard encoding in a way that brings together similar names, thus correcting misspellings. Knuth (AoCP3, introduction to Chapter 6) describes the algorithm as follows:

  1. Retain the first letter of the name, and drop all occurrences of a, e, h, i, o, u, w, y in other positions.
  2. Assign the following numbers to the remaining letters after the first:
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
  3. If two or more letters with the same code were adjacent in the original name (before step 1), omit all but the first.
  4. Convert to the form “letter, digit, digit, digit” by adding trailing zeros (if there are less than three digits), or by dropping rightmost digits (if there are more than three).

The soundex method isn’t perfect. As Knuth points out, related names like Rogers and Rodgers, or Sinclair and St. Clair, or Tchebysheff and Chebyshev, are not matched. On the other hand, soundex works well enough to be generally useful, and if your first attempt at a match fails, it isn’t too hard to try again with an alternate spelling.

Your task is to write a function that takes a name and returns its associated soundex code. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Sieve Of Atkin

February 12, 2010

We examined the use of the Sieve of Eratosthenes for enumerating the prime numbers, a method that dates to the ancient Greeks, in two previous exercises. Just a few years ago, in 2004, A. O. L. Atkin and D. J. Bernstein developed a new method, called the Sieve of Atkin, that performs the same task more quickly. The new method first marks squarefree potential primes, then eliminates those potential primes that are not squarefree.

Atkin’s sieve begins with a boolean array (bits are fine) of length n equal to the number of items to be sieved; each element of the array is initially false.

Squarefree potential primes are marked like this: For each element k of the array for which k ≡ 1 (mod 12) or k ≡ 5 (mod 12) and there exists some 4x² + y² = k for positive integers x and y, flip the sieve element (set true elements to false, and false elements to true). For each element k of the array for which k ≡ 7 (mod 12) and there exists some 3x² + y² = k for positive integers x and y, flip the sieve element. For each element k of the array for which k ≡ 11 (mod 12) and there exists some 3x² – y² = k for positive integers x and y with x > y, flip the sieve element.

Once this preprocessing is complete, the actual sieving is performed by running through the sieve starting at 7. For each true element of the array, mark all multiples of the square of the element as false (regardless of their current setting). The remaining true elements are prime.

Your task is to write a function to sieve primes using the Sieve of Atkin. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Numerical Integration

February 9, 2010

Computing the definite integral of a function is fundamental to many computations. In many cases, the integral of a function can be determined symbolically and the definite integral computed directly. Another approach, which we shall examine in today’s exercise, is to compute the definite integral numerically, an operation that numerical analysts call quadrature. We will consider a function f for which we are to find the definite integral over the range a to b, with a < b.

Recall from your study of calculus that the definite integral is the area under a curve. If we slice the curve into hundreds or thousands or millions of vertical strips, we can approximate the area under the curve by summing the areas of the individual strips, assuming that each is a rectangle with height equal to the value of the curve at the midpoint of the strip; this is, obviously, the rectangular method of quadrature. The trapezoidal method is similar, but uses the average of the value of the curve at the two ends of the strip; the strip then forms a trapezoid with parallel sides, a flat bottom on the horizontal axis, and a slanted top that follows the curve. A final method, called Simpson’s method, fits a parabola to the top of the strip by taking the height of the curve as its value at the left side of the strip plus its value at the right side of the strip plus four times its value at the midpoint of the strip. Accuracy improves as we move from the rectangular method (one point) to the trapezoidal method (two points) to Simpson’s method (three points), because the approximation more closely fits the curve, and also as the number of slices increases. The trapezoid rule is illustrated above right.

Many curves are characterized by large portions that are nearly flat, where the approximations work well, with a few small portions that change rapidly, where the approximations fail. If we increase the number of slices, there is little effect on the flat parts of the curve, so much of the work is wasted. Adaptive quadrature works by measuring the approximations at various slices and narrowing the slices where the curve is changing rapidly, allowing us to control the approximation precisely. Adaptive quadrature is a recursive process; if approximations with n and n/2 slices are close enough, recursion stops, otherwise the range is split in two and adaptive quadrature is applied to each of the two halves separately.

Your task is to write functions that perform numerical integration by the rectangular, trapezoidal and Simpson’s methods, as well as a function that performs adaptive quadrature. Use your functions to calculate the logarithmic integral \mathrm{li}(x) = \int_2^x \frac{dt}{\log{t}} for x = 1021, where the logarithmic integral is an approximation of the number of primes less than its argument. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

We examined the Sieve of Eratosthenes for computing prime numbers less than n in our second exercise nearly a year ago. In today’s exercise, we update that algorithm to compute the prime numbers in a range L to R, where the range doesn’t start at zero and is too large to fit in memory all at once. Specifically, we discuss a function that takes parameters L, for the left end of the range, and R, for the right end of the range, and a parameter B that divides RL; we also assume that the primes less than the square root of R are known. In practice, L and R will be large, maybe 1016 or 1018, and B will be somewhat smaller, maybe 108 or 1010.

We will explain the algorithm by computing the primes in the range 100 to 200, with B = 10; since we don’t look at even numbers, the algorithm will make five passes, each time sieving twenty numbers.

There are six primes less than the square root of 200: 2, 3, 5, 7, 11, and 13; we will ignore 2, since it is even. Associated with each prime Pk is a number we will call Qk that is the offset within B of the first prime in the current sieving block that is divisible by Pk. For instance, when sieving from 100 to 120, the primes are 3, 5, 7, 11, and 13 and the associated Qs are 2, 2, 2, 10, and 8. Each Qk can be initialized by computing Qk = (-1/2 × (L + 1 + Pk)) mod Pk and re-initialized for the next 2B block by computing Qk = (QkB) mod Pk; for instance, the Qs at the second sieving block from 120 to 140 are 1, 2, 6, 0, and 11.

Within each sieving block, we create an array of booleans (in practice, it is common to use bits) that is initialized to true. Then each prime is sieved in the normal way starting at the Qk offset. In our example, prime 3 with associated Q = 2 causes us to reset bits 2, 5, and 8 to false. Likewise, prime 5 causes us to reset bits 2 and 7, prime 7 causes us to reset bits 2 and 9, prime 11 has no odd multiples in the current sieving block, and prime 13 causes us to reset bit 8. After sieving, the bits that are still true are 0, 1, 3, 4, and 6, which correspond to primes 101, 103, 107, 109, and 113.

It is instructive to take five minutes to work out the example sieve by hand.

Your task is to write a function that performs a segmented sieve of Eratosthenes as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

Proving Primality

February 2, 2010

We examined algorithms for checking the primality of a number in two previous exercises. Both algorithms are probabilistic; the Rabin-Miller algorithm has known false positives, and Carl Pomerance proved that the Baillie-Wagstaff algorithm has an infinite number of false positives, even though none are known.

Today’s exercise describes an algorithm for proving that a number is prime, not probably but certainly. The algorithm works by a theorem of Edouard Lucas, which is based on Fermat’s Little Theorem (if n is prime then bn-1 ≡ 1 (mod n) for every integer b coprime to n):

If, for some integer b, bn-1 ≡ 1 (mod n), and if b(n-1)/q ≢ 1 (mod n) for any prime divisor q of n-1, then n is a prime number.

Thus, if the factors of n-1 are known, it is easy to determine the primality of n.

Your task is to write a function that determines the primality of an integer using Lucas’ primality test; use it to prove the primality of 289-1. When you are finished, you are welcome to read or run a suggested solution , or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2