## Sieve Of Atkin, Improved

### February 19, 2010

We examined the Sieve of Atkin for enumerating prime numbers in a recent exercise. Though the Sieve of Atkin has the potential to be faster than the Sieve of Eratosthenes, our implementation was not; the Sieve of Eratosthenes takes 1.219 seconds to enumerate the first million primes, our implementation of the Sieve of Atkins takes 22.954 seconds, nineteen times as long, to perform the same task.

The problem is that the naive implementation we used performs useless work. For instance, consider that you are sieving the numbers to a million, with the *x* and *y* variables of the naive implementation ranging from one to a thousand. In the first case, with 4*x*² + *y*² = *k*, 4·1000² + 1000² = 5000000, which is far beyond the end of the range being sieved. Reducing the limits means a lot less work and a much faster program.

Your task is to rewrite the naive Sieve of Atkins to eliminate useless work. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Here is my python version of an improved Sieve of Atkins. On my laptop it calculates the primes up to 1 million in .16 seconds, and primes to 10 million in about 1.5 seconds. That is 3.5 times faster than a basic Sieve of Eratosthenes and about 5.6 times faster than a naive Sieve of Atkins.

To imptove the naive sieve, I first calculated appropriate loop limits. Second I replaced squaring of x and y with additions using the well known formula:(x+1)^2 = x^2 + (2x + 1), and changed the loop counters to return the difference between the squares (i.e., the 2x+1). Obviously, the factor on the x^2 term could also be included in the loop counter. For example, for the 4x^2 term goes 4 16 36 … , so the loop was set up to return the differences 4, 12, 20, ….

Because of the way the loops counters incremented, I suspected there would be a pattern to the results of the n%12 calculations in the y^2 loops. After confirming the pattern, I changed the loop variables and split some loops in two, so that the loops only iterate over values of x and y for which the n%12 test would pass. This greatly reduced the number of loop iterations and eliminated the n%12 test. The start, stop, and step values for the loops were tricky to figure out.

Although I plan further optimization of the 3x^2 – y^2 section when I have time, I thought I’d post what I have now. Also, I didn’t optimize memory usage at all, so I’m sure that can be reduced.

I made the above code MUCH faster by simply replacing the line:

primes = [0] * limit

with:

primes = numpy.zeros(limit, dtype = ‘bool’)

Obviously this requires the module numpy. This also helps considerably with the “memory management” he was talking about I would think!

I’m still trying to wrap my head around the rest of of the code (really trying to wrap my head around the sieve in genera). Will post when I find out more!

It’s simple. We can cut the loop by using these condition:

if (xx + yy >= limit) {

break;

}

Using JavaScript running on Chrome’s V8, it took about 2.8 seconds to generate primes with limit of 10 million. Here’s the code:

function sieveOfAtkin(limit){

limitSqrt = Math.sqrt(limit);

sieve = [];

//prime start from 2, and 3

sieve[2] = true;

sieve[3] = true;

for (x = 1; x <= limitSqrt; x++) {

xx = x*x;

for (y = 1; y = limit) {

break;

}

// first quadratic using m = 12 and r in R1 = {r : 1, 5}

n = (4 * xx) + (yy);

if (n <= limit && (n % 12 == 1 || n % 12 == 5)) {

sieve[n] = !sieve[n];

}

// second quadratic using m = 12 and r in R2 = {r : 7}

n = (3 * xx) + (yy);

if (n y && n <= limit && (n % 12 == 11)) {

sieve[n] = !sieve[n];

}

}

}

for (n = 5; n <= limitSqrt; n++) {

if (sieve[n]) {

x = n * n;

for (i = x; i <= limit; i += x) {

sieve[i] = false;

}

}

}

// primes are the one which sieve[n] returns true

return sieve;

}

Could you please explain how the you got the limits?

This is written in F# under the CLR virtual machine as I prefer the functional programming style; I have translated the codes to Scala under JVM and have gotten about the same results, which makes sense as both are under virtual machines using JIT compilation and both have built in array bounds checks. These versions have very fast bit counting functions to obtain the number of primes rather than using enumeration as the enumeration can take much longer than the actual sieving.

On Gist (https://gist.github.com/anonymous/d17da4f1a8db2fadf10c) I have posted both a reference version of a maximally factorized Sieve of Eratosthenes just as Atkin and Berstein did, but my version is not “crippled” and this version of the SoA for comparison.

On the same Gist link is a version of the SoA from the pseudo code of the updated Wkikpedia article with an added improvement of an inner loop to almost eliminate wasted time of loops that won’t pass the modulo checks, and with the added advantage that the tight innermost culling loops are almost as simple as those used for SoE.

The test program to run each over a billion, only requiring the change of the “open ” state to run either is as follows:

With the results as follows, first SoEMWF:

and for the SoA:

You can see that the SoEMWF is still slightly faster than the SoA at one billion, but testing shows that the SoA gradually passes it by about two billion on up until we run out of memory at about 60 billion; however, this is for the huge array model, it’s easy to provide a page segmented version for the SoE but very difficult for the SoA and the performance picture completely changes with the SoA losing terribly…

Module codes are here:

This is an impressive optimization- it’s too bad we won’t know how you calculated your limits.

@Andres, I’m not sure what you mean by “how I calculated my limits”. If you mean the limit of the largest number that can be sieved in these naive “one huge sieving array” sieves, that’s based on the largest DotNet array being limited to two Gigabytes unless one turns on some extensions (and then not much larger), that the sieving is done with one bit representing a possible prime excluding composites eliminated by wheel factorization, thus there are 16 possibilites for each 16 bit word representing a 60 number wheel for Atkin, and 48 possibilies for each three 16 bit words representing a wheel range of 210 numbers for Eratosthenes, with a few numbers eliminated for the begining and some extras to allow for wheel overflow at the end; thus, the maximum range is about 60 billion for Atkin and about 75 billion for Eratostheses. The reason these sieves are “naive” is that they are very slow in memory access times, and the only reason they are moderately useful is that the number of operations has been greatly reduced by the algorithms, with both having about the same (slow) memory access speed as they use similar bit culling/toggling techniques.

The reason I wrore these is to show that the fully optimized SoE is at least as fast as the fully optimized SoA when other environment limits are the same (the memory acess speed).

To be non naive, segmented sieves should be used to both improve memory access time by a factor of 10 or more and to increase the potential size of the sieving ranges, but then the problems with the Atkin sieve become apparent: it is quite easy to do the conversion for the SoE but very difficukt to do it for the SoA (see Atkin and Bernstein’s reference implementation, and also, no matter what tricks are used, the SoA gets slower much faster with increasing range than the SoE due to more complexity and a greatly increasing span of culling/toggling jumps as compared to the span jumps required for SoE.

I wrote the programs in compiled F# for conciseness, but for these naive implementations, the language efficency doesn’t much matter as the main bottleneck is the memory access speed of many 10’s of CPU clock cycles (over a hundred for some systems); thus, fast native code compilers such as C won’t be much faster and even interpreted languages such as Python won’ be that much slower.