## Pritchard’s Wheel Sieve

### January 6, 2012

We begin at the end with the main function that drives the others; setup is done in the `let*`

, then the loop processing updates the set *S* until *k* is greater than *m*:

`(define (pritchard n)`

(let* ((ps (eratosthenes (isqrt n)))

(m (length ps))

(k (compute-k n ps)))

(let loop ((k (+ k 1)) (s (initial-s n k ps)))

(if (< m k) (append ps (cdr s))

(loop (+ k 1) (next n s))))))

The `next`

function derives a new version of *S* from its predecessor:

`(define (next n s)`

(let* ((p (cadr s)) (pg (make-pg p)))

(let loop ((ss s) (ps (list p)))

(let ((p (+ (car ps)

(vector-ref pg (/ (- (cadr ss) (car ss) 2) 2)))))

(if (< n p) (list-minus s (reverse ps))

(loop (cdr ss) (cons p ps)))))))

`Next`

calls `make-pg`

to build a lookup-table of extended gaps, converting a multiplication to a lookup, and `list-minus`

subtracts one list from another:

`(define (make-pg p)`

(let ((pg (make-vector (- p 1) 0)))

(do ((i 0 (+ i 1))) ((= i (- p 1)) pg)

(vector-set! pg i (* 2 p (+ i 1))))))

`(define (list-minus xs ys)`

(let loop ((xs xs) (ys ys) (zs (list)))

(cond ((null? xs) (reverse zs))

((null? ys) (append (reverse zs) xs))

((= (car xs) (car ys))

(loop (cdr xs) (cdr ys) zs))

(else (loop (cdr xs) ys (cons (car xs) zs))))))

`Compute-k`

and `initial-s`

are used in the setup phase:

`(define (compute-k n ps)`

(let loop ((k 0) (m 2) (ps (cdr ps)))

(if (< (/ n (log n)) m) k

(loop (+ k 1) (* m (car ps)) (cdr ps)))))

`(define (initial-s n k ps)`

(let ((s (make-vector (+ n 1) #t)))

(vector-set! s 0 #f)

(let loop ((k k) (ps ps))

(if (zero? k) (enlist s)

(do ((i (car ps) (+ i (car ps))))

((< n i) (loop (- k 1) (cdr ps)))

(vector-set! s i #f))))))

The set *S* is initially represented as a bit vector locally in `initial-s`

then translated to a list by the `enlist`

function:

`(define (enlist s)`

(let loop ((i (- (vector-length s) 1)) (ss (list)))

(cond ((negative? i) ss)

((vector-ref s i)

(loop (- i 1) (cons i ss)))

(else (loop (- i 1) ss)))))

We also use the sieve of Eratosthenes from a previous exercise to initialize the small list of primes, and `isqrt`

from the Standard Prelude. Here’s the program in action:

`> (pritchard 100)`

(2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97)

> (length (pritchard 1000))

168

You can run the program at http://programmingpraxis.codepad.org/22VkAlSX. You will find that Pritchard’s sieve is quite slow compared to Eratosthenes’ sieve. Though that is true in general, our implementation is also rather inefficient due to its use of the set data structure. A better implementation uses arrays, but if you decide to do that, be careful; a naive translation from sets to array indexed front-to-back will fail because some of the values of the *S* set are changed (stricken from the set) before the sweep through the array catches them, causing errors. There are various ways to solve that problem (Pritchard swept the array back-to-front, other programmers came up with other methods), but all add to the problem that Pritchard’s sieve is slower in practice than Eratosthenes’, even with a better asymptotic complexity. Other versions of Pritchard’s sieve improve on this version, but are still slower in practice than Eratosthenes’. In fact, all of the advanced sieves that we have studied are slower than the original algorithm from the ancient Greek mathematician.

If I understand correctly, the wheel for 2, 3, 5, 7 is [10 2 4 2 4 6 2 6 4 2 4 6 6 2 6 4 2 6 4 6 8 4 2 4 2 4 8 6 4 6 2 4 6 2 6 6 4 2 4 6 2 6 4 2 4 2 10 2]. However, in this paper http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf, the wheel starts from 2: [2 4 2 4 6 2 6 4 2 4 6 6 2 6 4 2 6 4 6 8 4 2 4 2 4 8 6 4 6 2 4 6 2 6 6 4 2 4 6 2 6 4 2 4 2 10 2 10].

Why the difference?

There is an error here The second time through the loop, p = 7, the gaps are 7−1=6, 11−7=4, 13−11=6, 17−13=4, and 19−17=2, and the numbers that are stricken are 7, 7+6×7=49, 49+4×7=77, and 77+2×7=91, leaving S4 = {1 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97}.

i.e 13-11 is 2

Thanks for the algorithm!

@Giorgio Valoti, the difference is that one of the wheel sets has been rolled one position from the other as usually the wheels are used starting from their next wheel position, as is W0 starts at 2, W1 starts at 3, W2 starts at 5, etc.- the wheel excluding 7 (W4 by my definition) starts at 11.

The algorithm is interesting, but really it is just a wheel factorized Sieve of Eratosthenes with less wheels used for smaller sieve ranges and larger ones used for bigger sieve ranges in order to keep the O(n) performance: one may as well use a constant maximum practical sieve size such as 2/3/5/7/11/13/17/19 (W8 by my definition) and get even better performance for small ranges, the same performance when this algorithm would call for this size of wheel anyway (at about 200 million to quite a few billion) and accept that performance above that point will be a little slower given that one doesn’t really want to deal with a huge gap table.

As to “worse performance than the Sieve of Eratosthenes”, that isn’t necessarily true, or at least not by much: I suspect that your implementation is slow because you persist in using lists to store the gaps/wheel primes and the set of candidates to be culled of composites: lists are very handy but the bane of performance (and consume memory since as well as the actual data per element, they also have to pointer link to the next element of 4 bytes/8 bytes for 32/64 bit systems, respectively). Of the three operations you mention, 1) you don’t have to calculate gaps per culling operations if you store the wheel as gaps (preferably in an array for faster access rather than a list), 2) integer multiplies aren’t expensive for a modern CPU and can be streamed to not take much more time than the other operations performed in parallel, and 3) integer additions are done in any case and take almost zero average time for a modern CPU.

The most practical sieves such a (C) primesieve fill the array with a precull using a pattern of the large wheel as above and then do further culling using only the W4 wheel of 48 “hit” positions out of the wheel range of 210; this seems to maximize efficiency and trade-offs between use of memory including optimizing the use of CPU caches vs. execution speed. In order to keep the inner loop simple and avoid having to jump by gaps, the ‘hit” modulos are often separated out so that the innermost loops can do simple addition offsets for their culling (jumping by even numbers of wheels), with separate innermost loops running for each modulo for each base prime. This can be done in any language, including Scheme.

for instance, see Python 2, in this case using only a 2/3/5 wheel and no preculling, but that would be fairly easy to add as would increasing the size of the wheel, here the wheel is given as constant literals but computing the wheel isn’t hard in any language:

The whole process runs better if you use the sequence of products of the first n-primes as your bases. There really is no point in venturing outside this since the totient is minimal at those values and all primes less than any of them will lie in the reduced residue system which can be constructed easily by enumerating over the r.r.s. of its predecessor from zero to p-1.

[…] Pritchard’s Wheel Sieve […]

I have taken your PY code and have tried to make it work for values other the 8 primes and Wheel Size 30, by changing the value 8 to NP and the value 30 to NX. Also some additional changes such as changing the I <>3 shifts to division by 8 or NP in t=my modified code) . The I & 7, I change to I % NP. I aslo made a few simple changes to print some intermediate information. I have included my PY code in this email. It works OK when I set the values of NP = 8 and NX = 30. However, when I try to extend the code to consider additional primes beyond 31, to 37 and 41 – changing NP = 10 and NX= 30 and also changing “gaps” and “ndxs” to account for the additional primes, it fails miserably—see the code below. I get many additional values that are not prime and miss a few that are. I believe the issue is in the code fragment:

buf[c::p8] = [0] * ((lmtbf – c) // p8 + 1)

In the case where NP = 10 the p8 value would be p*NP. I can’t figure it out. If you find a bug or two I wouldn’t be surprised. You seem to be an expert on the sieving processes, maybe you can help?

My PY code—

#!/usr/bin/python

def primes235(limit) :

print “\nlimit %20d\n” % limit

#yield 2; yield 3; yield 5

if limit limit+1 : break

#for i in xrange(lmtbf – 6 + (ndxs[(limit – 7) % 30])): # adjust for extras

# if buf[i]: yield (30 * (i >> 3) + modPrms[i & 7])

#print “buf[i] %20d\n” % buf[i]

return [2,3,5] + [(NX * (i//NP) + modPrms[i % NP]) | 1 for i in xrange(lmtbf – 6 + (ndxs[(limit – 7) % NX])) if buf[i]]

“””***************************************************************************************

A test driver.

***************************************************************************************”””

if __name__ == ‘__main__’ :

#for p in primes235(1000000000) :

# if p > 999999900 and p < 1000000000: print(p)

p = primes235(2310)

print "\n"

print p

print "\n\n Prime count: %d\n" % len(p)

k = 1;

for i in range (len(p)) :

print k, p[i]

k = k + 1

@Peter: You don’t seem to understand how a prime wheel works. A 2,3,5-wheel uses the totatives of 2*3*5=30 to determine the stops of the wheel. The next larger wheel is not 37 or 41, it’s a 2,3,5,7-wheel with a circumference of 2*3*5*7=210. To learn more about prime wheels, look at my Wheel Factorization exercise.

@Peter, as programmingpraxis says, you don’t seem to understand factorization wheels. For the 2,3,5,7 wheel, “modPrms” will be the totients from 11 up to 220 excluding those numbers evenly divisible by 2,3,5, or 7 for a total of 48 entries, “gaps” will be two times 48 elements long, and “ndxs” will be 210 elements long; these can be auto generated by little programs. The resulting program needs adjustment for almost every line as to the factors of 48 and 210 rather than 8 and 30, and the resulting program won’t gain as much performance as expected because there will need to be divisions by 48 rather than just right shifts by 3 to effect the division by 8. For large ranges, this wouldn’t be the way to write this anyway, as using the “one large buffer array” won’t be very efficient and a page segmented approach would be much better, also would not require a fixed upper limit bound, although the resulting program would be a little more complex.

For my own interest I generalized Willy’s code to take the number of initial primes as an argument:

[python]

def primesk(k, limit):

”’Takes the number of inital primes k, and generates primes up to limit”’

inital = [2, 3, 5, 7, 11, 13]

if k <= 0 or k >= len(inital):

print(f”k must be in range 0 to {len(inital)}”)

sys.exit(1)

# return the initial primes

s = 0

prms = []

while s < k:

p = inital[s]

prms.append(p)

if limit < p:

return

yield p

s = s+1

# setup

m = reduce(lambda x, y: x

y, prms) # the product of the initial primesgaps[i] for i in range(lenModPrms)]p = inital[s]

modPrms = [x for x in range(p, p+m) if all([x % i != 0 for i in prms])]

m0 = modPrms[0]

lenModPrms = len(modPrms)

gaps = [modPrms[i+1]-modPrms[i] for i in range(lenModPrms-1)]

gaps.append(m-modPrms[-1]+modPrms[0])

gaps = gaps * 2 # 2 loops for overflow

ndxs = [item for sublist in [[i]

for item in sublist]

if debug:

print(f”prms={prms}, m={m}, modPrms={modPrms}, gaps={gaps}, ndxs={ndxs}”)

# integral number of wheels rounded up

lmtbf = (limit + m – m0) // m * lenModPrms – 1

lmtsqrt = (int(limit ** 0.5) – m0)

lmtsqrt = lmtsqrt // m * lenModPrms + \

ndxs[lmtsqrt % m] # round down on the wheel

if debug:

print(f”lmtbf={lmtbf}, lmtsqrt={lmtsqrt}”)

buf = [True] * (lmtbf + 1)

for i in range(lmtsqrt + 1):

if buf[i]:

ci = i % lenModPrms

p = m * (i // lenModPrms) + modPrms[ci]

s = p * p – m0

p2 = p * lenModPrms

if debug:

print(f”ci={ci}, p={p}, p2={p2}”)

for j in range(lenModPrms):

c = s // m * lenModPrms + ndxs[s % m]

buf[c::p2] = [False] * ((lmtbf – c) // p2 + 1)

if debug:

print(f”s={s}, c={c}, buf[{c}::{p2}]={buf[c::p2]}”)

s += p * gaps[ci]

ci += 1

# adjust for extras

end = lmtbf – (lenModPrms – 1) + (ndxs[(limit – m0) % m])

if debug:

print(f”end={end}”)

for i in range(end+1):

if buf[i]:

p = m * (i // lenModPrms) + modPrms[i % lenModPrms]

if debug:

print(f”i={i}, p={p}”)

yield p

[/python]

Reposting with (hopefully) proper formatting:

[…] I have been looking for fast and scalable implementations. The fastest Python version I found is here in a comment by Willy […]