Logarithm Tables
September 30, 2011
While cleaning-up my study, I found my old tables of logarithms and anti-logarithms that must date to my high-school days.
For my readers who were too young to have had the privilege of using logarithm tables, Tony Audsley gives a good explanation, and the tables themselves are available on the next page. Used properly, the tables give logarithms and anti-logarithms (mantissa) to four places after the decimal point, with an error no greater than 1 in the last digit; you have to figure out the exponents (characteristic) yourself, by inspection.
Your task is to write programs that create four-place tables of logarithms and anti-logarithms; please let us know by leaving a comment if you remember using such tables in the past. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Statistics
September 27, 2011
In today’s exercise we calculate some of the basic measures in statistics: mean, standard deviation, linear regression, and correlation. The only hard part is that different sources use different standard names to refer to the different statistics. The formulas are shown below; all the summations are over $i$ from 1 to the number of items $n$:
mean:
standard deviation:
linear regression:
slope:
intercept:
correlation:
Your task is to write functions to compute these basic statistics. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Array Duplicates
September 23, 2011
We have today another exercise from our large file of interview questions:
You are given an array with integers between 1 and 1,000,000. One integer is in the array twice. How can you determine which one?
Your task is to write code to solve the array duplicates problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Project Euler Problem 3
September 20, 2011
Problem 3 at Project Euler asks the reader to solve this problem:
The prime factors of 13195 are 5, 7, 13 and 29. What is the largest prime factor of the number 600851475143?
Given the amount of factoring code on this website, I find that problem quite simple. But judging from Reddit and Stack Overflow, other programmers find the problem anything but simple. If you look at the dates, you will notice that the problem seems to come up every few months. And if you read the comments on Reddit and Stack Overflow, the saddest part is that a lot of the answers are wrong; in fact, some of them are mind-numbingly wrong. So I decided to write a canonical solution to the problem:
Project Euler Problem 3 asks for the largest prime factor of the number 600851475143. We consider first a function to find all the prime factors of any number. A simple way, not necessarily the best way but sufficient for this task and for many others, works in two steps: first remove factors of 2 while the number is even, then perform trial division by the odd numbers starting from 3 until the factorization is complete. Here’s a formal statement of the algorithm to find the factors of a given input number n:
1. Create a list of factors, initially empty.
2. If n is even, add 2 to the end of the list of factors, divide n by 2, and go to Step 2.
3. Set f ← 3.
4. Calculate the quotient q and remainder r when dividing n by f, so that n = q f + r with 0 ≤ r < f.
5. If r > 0, set f ← f + 2 and go to Step 4.
6. Otherwise, r = 0. Add f to the end of the list of factors. Set n ← q.
7. If n < f2, add n to the end of the list of factors, output the list of factors, and stop.
8. Otherwise, go to Step 4.
Consider as an example the factorization of 13195; it’s odd, so we pass Step 2 and go to Step 3 with f = 3. The quotient and remainder of 13195 ÷ 3 are 4398 and 1, so in Step 5 we set f = 3 + 2 = 5 and go to Step 4. The quotient and remainder of 13195 ÷ 5 are 2639 and 0, so in Step 6 we add 5 to the end of the list of factors, making the list {5}, and we set n = 2639. Then, in Step 7, we note that 2639 < 52 is false, so we go to Step 8 and then to Step 4. In Step 4 we calculate the quotient and remainder of 2639 ÷ 5 as 527 and 4, and since the remainder is greater than 0 we set f = 5 + 2 = 7 and go to Step 4. In Step 4 we calculate the quotient and remainder of 2639 ÷ 7 as 377 and 0, so in Step 6 we add 7 to the end of the list of factors, making the list {5, 7}, and we set n = 377. Then, in Step 7, we note that 377 < 72 is false, so we go to Step 8 and then to Step 4. In Step 4 we calcuate the quotient and remainder of 377 ÷ 7 as 53 and 6, and since the remainder is greater than 0 we set f = 7 + 2 = 9 and go to Step 4. In Step 4 we calculate the quotient and remainder of 377 ÷ 9 as 41 and 8, and since the remainder is greater than 0 we set f = 9 + 2 = 11 and go to Step 4. In Step 4 we calculate the quotient and remainder of 377 ÷ 11 as 34 and 3, and since the remainder is greater than 0 we set f = 11 + 2 = 13 and go to Step 4. In Step 4 we calculate the quotient and remainder of 377 ÷ 13 as 29 and 0, so in Step 6 we add 13 to the end of the list of factors, making the list {5, 7, 13}, and we set n = 29. Then, in Step 7, we note 29 < 132, so we add 29 to the end of the list of factors, making the list {5, 7, 13, 29}, output the list of factors, and stop. Thus, the complete factorization is 13195 = 5 × 7 × 13 × 29.
Given a function to find the prime factors of a number, the solution to Project Euler Problem 3 is easy: it’s just the last prime factor in the list, which is necessarily the largest prime factor in the list since we worked with increasing f and always added prime factors to the end of the list.
You may notice that the challenge number in the problem, 600851475143, is too large to represent using 32-bit arithmetic. That’s part of the Project Euler task — you need to find a library that allows you to do arithmetic on numbers longer than 32 bits. In fact, Project Euler uses this problem as kind of a gateway exercise, because many of the later problems also require arithmetic on integers longer than 32 bits. With C, the long long datatype is big enough, but you might need the gmp library for larger integers, with Java you probably want the BigInteger library, and with other languages you will need to find the corresponding library yourself, and let the compiler know where it is so it can be linked to your factoring code. Or you may prefer to use a language, such as Python or Lisp, that provides built-in support for integers longer than 32 bits.
If you want to know more about integer factorization, I humbly recommend the prime-number exercises at my blog.
Your task is to implement the function described above. If you solve the Project Euler problem, please publish only your code but not the solution, as we respect Project Euler’s request not to publish answers. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Pollard’s P-1 Factorization Algorithm, Revisited
September 16, 2011
We have studied John Pollard’s p−1 algorithm for integer factorization on two previous occasions, giving first the basic single-stage algorithm and later adding a second stage. In today’s exercise we look at a somewhat different version of the second stage, known as the improved standard continuation, that greatly improves the speed of the algorithm. The trick is to remove the modular exponentiations in the second stage, replacing them with modular multiplications, which are obviously much faster. Let’s begin with an example, computing the factorization of 16309 with B1=10 and B2=50:
For the first stage we will need the primes less than B1, which are 2, 3, 5 and 7. We will also need to know the integer logarithms of B1 with respect to each of those primes, which are 3, 2, 1, 1; that is, 23=8 is the largest power of 2 less than 10, 32=9 is the largest power of 3 less than 10, 51=5 is the largest power of 5 less than 10, and 71=7 is the largest power of 7 less than 10. Q is initially 2. For each prime we calculate q = qpa where p is the prime and a is the integer logarithm. Thus, q = 28 mod 16309 = 256 after prime 2, q = 2569 mod 16309 = 7011 after prime 3, q = 70115 mod 16309 = 4239 after prime 5, and q = 42397 mod 16309 = 9884 after prime 7. We calculated the greatest common divisor of q−1 and n at each step, and in each case the gcd was 1 (for instance, at the last step gcd(9883, 16309) = 1) and the first stage failed to split a factor.
For the second stage we will need the next prime larger than B1, which is 11, and the differences between the primes from there to B2, which are 2, 4, 2, 4, 6, 2, 6, 4, 2 and 4, representing the primes 13, 17, 19, 23, 29, 31, 37, 41, 43 and 47. We also need the ending q from the first stage and the modular exponentiation qd mod n for each of the differences 2, 4 and 6, which are 98842 mod 16309 = 2546, 98844 mod 16309 = 7443 and 98846 mod 16309 = 15129, respectively.
With this setup, the second stage is very fast. Q is initialized with the next prime, 11, so q = 988411 mod 16309 = 13427, and p = 1. Now we iterate over the differences. The first difference is 2, representing the prime 13, and the modular multiplication is q = 13427 × 2546 mod 16309 = 1478 with p = 1 × 147 mod 16309 = 1477. The second difference is 4, representing the prime 17, and q = 1478 × 7443 mod 16309 = 8488 with p = 1477 × 8487 mod 16309 = 9987. The third difference is 2, representing the prime 19, and q = 8488 × 2546 mod 16309 = 1023 with p = 9987 × 1022 mod 16309 = 13589. The fourth difference is 4, representing the prime 23, and q = 1023 × 7443 mod 16309 = 14195 with p = 13589 × 14194 mod 16309 = 12032. We’ve been taking the gcd(p, n) at each step, and they have all been 1, but here gcd(12032, 16309) = 47, which is a factor of 16309. The complete factorization is 16309 = 47 × 347. We can summarize the calculations like this:
B1=10, B2=50
ps = 2, 3, 5, 7, 11
as = 3, 2, 1, 1
ds = 2, 4, 2, 4, 6, 2, 6, 4, 2, 4
First Stage
init: q=2
2^3: 2^8 mod 16309 = 256
3^2: 256^9 mod 16309 = 7011
5^1; 7011^5 mod 16309 = 4239
7^1; 4239^7 mod 16309 = 9884
gcd(9883, 16309) = 1
Second Stage
9884^2 mod 16309 = 2546
9884^4 mod 16309 = 7443
9884^6 mod 16309 = 15129
init: q = 9884^11 mod 16309 = 13427, p = 1
2(13): q = 13427 * 2546 mod 16309 = 1478, p = 1 * 1477 mod 16309 = 1477
4(17): q = 1478 * 7443 mod 16309 = 8488, p = 1477 * 8487 mod 16309 = 9987
2(19): q = 8488 * 2546 mod 16309 = 1023, p = 9987 * 1022 mod 16309 = 13589
4(23): q = 1023 * 7443 mod 16309 = 14195, p = 13589 * 14194 mod 16309 = 12032
gcd(12032, 16309) = 47
You may recall from the earlier exercises that Pollard’s p−1 method is based on Fermat’s little theorem ap ≡ a (mod p), which can equivalently be stated ap−1 ≡ 1 (mod p). As a consequence, if p−1 divides M, then p divides gcd(2M−1, n). Pollard’s p−1 method computes M as the least common multiple of a bound B; thus, if all the factors of p−1 are less than B, then gcd(2lcm[1..B] − 1, n) is a factor of n.
In the two-stage version of Pollard’s p−1 method, also called the large-prime version, there are two bounds, B1 and B2. The first stage calculates the gcd as described above. The second stage then computes gcd(2k · lcm[1..B1] − 1, n) for each prime k such that B1 < k < B2. This is useful because most numbers factor as the product of several small factors and a large factor; numbers that factor as the product of two large primes are relatively rare.
We saw in the example how the least common multiple of the integers less than a bound B can be computed economically using powers of the prime numbers less than the bound. The large-prime variant can be computed economically by a trick. At the end of the first stage we have computed q = 2lcm[1..B1] mod n; that was 9884 in the example above. Then we store a table of the differences of the primes from B1 to B2, and for each difference d compute qd; since these differences are small and few in number, the table is small and quick to compute. Then we can compute each 2k · lcm[1..B1] by successively multiplying each q by the successive differences in the primes, thus converting a costly modular exponentiation to a much simpler modular multiplication. Finally, at each step we subtract 1 and compute the gcd, stopping when we find a factor.
In addition to the modular expontiations, the other large time cost of the algorithm is the computation of the greatest common divisor, and there is another trick that reduces that cost. Instead of taking the greatest common divisor after each step in either stage, take it only periodically, say after every hundred steps. If the gcd at that point is between 1 and n, then you have your factor and you are finished. If the gcd at that point is 1, then all of the intermediate gcds would also have been 1, but none of them needed to be calculated. Finally, if the gcd at that point is n, there are two factors since the last gcd calculation, so backtrack to the prior gcd and recompute the steps, taking the gcd at each step.
Your task is to write a function that finds factors using the p−1 algorithm, using both tricks described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Tetrahedral Numbers
September 13, 2011
Before I started writing my own programming exercises, I enjoyed solving other programming exercises available on the internet, many of them mathematical in nature. Today’s exercise comes from http://open-cs.net/problems.php?id=18 and concerns triangular and tetrahedral numbers.
A triangular number tells the number of ways that balls can be stacked in a triangle. The first triangular number is 1, the second is 3 (a row of 1 plus a row of 2), the third triangular number is 6 (the first two rows plus a row of 3), the fourth triangular number is 10 (adding a row of 4), the fifth triangular number is 15 (think of the 15 balls on a pool table), and so on.
A tetrahedral number is the three-dimensional equivalent of a triangular number; think of cannonballs stacked in a three-sided pyramid. The top layer has one ball, the second layer has 3 balls (the second triangular number) so the second tetrahedral number is 1 + 3 = 4, the third layer has 6 balls giving a total of 10 balls in the tetrahedron, and so on.
Your task is to find the base of the tetrahedron that contains 169179692512835000 balls. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Mersenne Twister
September 9, 2011
We have written random number generators in several previous exercises. Today we look at the Mersenne Twister of Makoto Matsumoto and Takuji Nishimura. Here is their description of their algorithm:
We propose a new random number generator Mersenne Twister. An implemented C-code MT19937 has the period 219937−1 and 523-dimensional equidistribution property, which seems to be the best among all generators ever implemented. There are two new ideas added to the previous twisted GFSR to attain these records. One is an incomplete array to realize a Mersenne-prime period. The other is a fast algorithm to test the primitivity of the characteristic polynomial of a linear recurrence, named inversive-decimation method. This algorithm does not require even the explicit form of the characteristic polynomial. It needs only (1) the defining recurrence, and (2) some fast algorithm that obtains the present state vector from its 1-bit output stream. The computational complexity of the inversive-decimation method is the order of the algorithm in (2) multiplied by the degree of the characteristic polynomial. To attain higher order equidistribution properties, we used the resolution-wise lattice method, with Lenstra’s algorithm for successive minima.
If that’s not quite clear, their reference implementation is given on the next page. Note that this implementation is the original Mersenne Twister MT19937; there are several variants.
Though George Marsaglia has strongly criticized it as needlessly complex, the Mersenne Twister is quite a popular random number generator. It takes a 32-bit integer other than 0 as a seed, returns a 32-bit integer each time it is called, and has a period of 219937−1 ≅ 4·106001. The Mersenne twister is suitable for simulation but not for cryptography.
Your task is to implement the Mersenne twister in your favorite language. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Deques
September 6, 2011
A normal linked list can be accessed only at its head. A double-ended queue, or deque (pronounced “deck”), can be accessed at either end. Like a normal list, a deque can be null. New elements can be added at either end, the element at either end of a non-null deque can be fetched, and the element at either end of a non-null deque can be deleted. Deques are a combination of stacks and queues.
Your task is to write a function library that implements deques; you should be sure that all operations are performed in constant time. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Two String Exercises
September 2, 2011
These two problems seem to be on every list of programming interview questions:
1) Remove all duplicate characters from a string. Thus, “aaabbb” becomes “ab” and “abcbd” becomes “abcd”.
2) Replace all runs of consecutive spaces with a single space. Thus, “a.b” is unchanged and “a..b” becomes “a.b”, using a dot to make the space visible.
Your task is to write the two requested functions. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.