Sieve Of Atkin, Improved

February 19, 2010

Our version, translated from a Python version by Steve Krenzel, is shown below:

(define (atkin limit) (define (exact x) (inexact->exact (floor x))) (let ((sieve (make-vector (+ (quotient limit 2) (modulo limit 2)) #f)) (primes (list 3 2))) (define (flip! m) (vector-set! sieve m (not (vector-ref sieve m))))

(let ((x-max (exact (sqrt (/ (- limit 1) 4)))) (x2 0)) (do ((xd 4 (+ xd 8))) ((<= (+ (* x-max 8) 2) xd)) (set! x2 (+ x2 xd)) (let* ((y-max (exact (sqrt (- limit x2)))) (n (+ x2 (* y-max y-max))) (n-diff (+ y-max y-max -1))) (when (even? n) (set! n (- n n-diff)) (set! n-diff (- n-diff 2))) (do ((d (* (- n-diff 1) 2) (- d 8))) ((<= d -1)) (when (member (modulo n 12) (list 1 5)) (flip! (quotient n 2))) (set! n (- n d))))))

(let ((x-max (exact (sqrt (/ (- limit 1) 3)))) (x2 0)) (do ((xd 3 (+ xd 6))) ((<= (+ (* x-max 6) 2) xd)) (set! x2 (+ x2 xd)) (let* ((y-max (exact (sqrt (- limit x2)))) (n (+ x2 (* y-max y-max))) (n-diff (+ y-max y-max -1))) (when (even? n) (set! n (- n n-diff)) (set! n-diff (- n-diff 2))) (do ((d (* (- n-diff 1) 2) (- d 8))) ((<= d -1)) (when (= (modulo n 12) 7) (flip! (quotient n 2))) (set! n (- n d))))))

(let ((x-max (exact (/ (+ (sqrt (- 4 (* (- 1 limit) 8))) 2) 4))) (y-min -1) (x2 0) (xd 3)) (do ((x 1 (+ x 1))) ((<= (+ x-max 1) x)) (set! x2 (+ x2 xd)) (set! xd (+ xd 6)) (when (<= limit x2) (set! y-min (* (- (* (- (inexact->exact (ceiling (sqrt (- x2 limit)))) 1) 2) 2) 2))) (let ((n (- (* (+ (* x x) x) 2) 1)) (n-diff (* (- (* (- x 1) 2) 2) 2))) (do ((d n-diff (- d 8))) ((<= d y-min)) (when (= (modulo n 12) 11) (flip! (quotient n 2))) (set! n (+ n d))))))

(do ((n 2 (+ n 1))) ((<= (quotient (+ (exact (sqrt limit)) 1) 2) n)) (when (vector-ref sieve n) (let* ((p (+ n n 1)) (p2 (* p p))) (set! primes (cons p primes)) (do ((k p2 (+ k (+ p2 p2)))) ((<= limit k)) (vector-set! sieve (quotient k 2) #f)))))

(do ((p (+ (exact (sqrt limit)) 1) (+ p 2))) ((<= limit p)) (when (vector-ref sieve (quotient p 2)) (set! primes (cons p primes))))

(reverse primes)))

The program consists of a header, three blocks that each consider one of the three cases of the basic algorithm, a block that performs the actual sieving of squares (the p2 variable), a block that collects the remaining primes beyond the square root, and a final (reverse primes) that returns the result.

This version is much faster than the naive version, taking 2.563 seconds to enumerate the first million primes, about double the Sieve of Eratosthenes. Daniel Bernstein, co-author with Arthur Atkin of the original paper describing the Sieve of Atkin, has an insanely fast version, faster than the Sieve of Eratosthenes; it is instructive to read the code.

You can run this program at http://programmingpraxis.codepad.org/xyQbTn0p.

Posted by programmingpraxis

Filed in Exercises

8 Comments »

8 Responses to “Sieve Of Atkin, Improved”

Mike said

April 28, 2010 at 6:46 PM

Here is my python version of an improved Sieve of Atkins. On my laptop it calculates the primes up to 1 million in .16 seconds, and primes to 10 million in about 1.5 seconds. That is 3.5 times faster than a basic Sieve of Eratosthenes and about 5.6 times faster than a naive Sieve of Atkins.

To imptove the naive sieve, I first calculated appropriate loop limits. Second I replaced squaring of x and y with additions using the well known formula:(x+1)^2 = x^2 + (2x + 1), and changed the loop counters to return the difference between the squares (i.e., the 2x+1). Obviously, the factor on the x^2 term could also be included in the loop counter. For example, for the 4x^2 term goes 4 16 36 … , so the loop was set up to return the differences 4, 12, 20, ….

Because of the way the loops counters incremented, I suspected there would be a pattern to the results of the n%12 calculations in the y^2 loops. After confirming the pattern, I changed the loop variables and split some loops in two, so that the loops only iterate over values of x and y for which the n%12 test would pass. This greatly reduced the number of loop iterations and eliminated the n%12 test. The start, stop, and step values for the loops were tricky to figure out.

Although I plan further optimization of the 3x^2 – y^2 section when I have time, I thought I’d post what I have now. Also, I didn’t optimize memory usage at all, so I’m sure that can be reduced.

def atkins13(limit=1000000):
    '''use sieve of atkins to find primes <= limit.'''
    primes = [0] * limit

    # n = 3x^2 + y^2 section
    xx3 = 3
    for dxx in xrange( 0, 12*int( sqrt( ( limit - 1 ) / 3 ) ), 24 ):
        xx3 += dxx
        y_limit = int(12*sqrt(limit - xx3) - 36)
        n = xx3 + 16
        for dn in xrange( -12, y_limit + 1, 72 ):
            n += dn
            primes[n] = not primes[n]

        n = xx3 + 4
        for dn in xrange( 12, y_limit + 1, 72 ):
            n += dn
            primes[n] = not primes[n]

    # n = 4x^2 + y^2 section
    xx4 = 0
    for dxx4 in xrange( 4, 8*int(sqrt((limit - 1 )/4)) + 4, 8 ):
        xx4 += dxx4
        n = xx4+1

        if xx4%3:
            for dn in xrange( 0, 4*int( sqrt( limit - xx4 ) ) - 3, 8 ):
                n += dn
                primes[n] = not primes[n]

        else:
            y_limit = 12 * int( sqrt( limit - xx4 ) ) - 36

            n = xx4 + 25
            for dn in xrange( -24, y_limit + 1, 72 ):
                n += dn
                primes[n] = not primes[n]

            n = xx4+1
            for dn in xrange( 24, y_limit + 1, 72 ):
                n += dn
                primes[n] = not primes[n]

    # n = 3x^2 - y^2 section
    xx = 1
    for x in xrange( 3, int( sqrt( limit / 2 ) ) + 1, 2 ):
        xx += 4*x - 4
        n = 3*xx

        if n > limit:
            min_y = ((int(sqrt(n - limit))>>2)<<2)
            yy = min_y*min_y
            n -= yy
            s = 4*min_y + 4

        else:
            s = 4

        for dn in xrange( s, 4*x, 8 ):
            n -= dn
            if n <= limit and n%12 == 11:
                    primes[n] = not primes[n]

    xx = 0
    for x in xrange( 2, int( sqrt( limit / 2 ) ) + 1, 2 ):
        xx += 4*x - 4
        n = 3*xx

        if n > limit:
            min_y = ((int(sqrt(n - limit))>>2)<<2)-1
            yy = min_y*min_y
            n -= yy
            s = 4*min_y + 4

        else:
            n -= 1
            s = 0

        for dn in xrange( s, 4*x, 8 ):
            n -= dn
            if n <= limit and n%12 == 11:
                    primes[n] = not primes[n]

    # eliminate squares        
    for n in xrange(5,int(sqrt(limit))+1,2):
        if primes[n]:
            for k in range(n*n, limit, n*n):
                primes[k] = False

    return [2,3] + filter(primes.__getitem__, xrange(5,limit,2))

Garrett said
August 1, 2011 at 8:02 PM
I made the above code MUCH faster by simply replacing the line:

primes = [0] * limit

with:

primes = numpy.zeros(limit, dtype = ‘bool’)

Obviously this requires the module numpy. This also helps considerably with the “memory management” he was talking about I would think!

I’m still trying to wrap my head around the rest of of the code (really trying to wrap my head around the sieve in genera). Will post when I find out more!
geeksumm said
April 26, 2013 at 1:35 AM
It’s simple. We can cut the loop by using these condition:
if (xx + yy >= limit) {
break;
}

Using JavaScript running on Chrome’s V8, it took about 2.8 seconds to generate primes with limit of 10 million. Here’s the code:

function sieveOfAtkin(limit){
limitSqrt = Math.sqrt(limit);
sieve = [];

//prime start from 2, and 3
sieve[2] = true;
sieve[3] = true;

for (x = 1; x <= limitSqrt; x++) {
xx = x*x;
for (y = 1; y = limit) {
break;
}
// first quadratic using m = 12 and r in R1 = {r : 1, 5}
n = (4 * xx) + (yy);
if (n <= limit && (n % 12 == 1 || n % 12 == 5)) {
sieve[n] = !sieve[n];
}
// second quadratic using m = 12 and r in R2 = {r : 7}
n = (3 * xx) + (yy);
if (n y && n <= limit && (n % 12 == 11)) {
sieve[n] = !sieve[n];
}
}
}

for (n = 5; n <= limitSqrt; n++) {
if (sieve[n]) {
x = n * n;
for (i = x; i <= limit; i += x) {
sieve[i] = false;
}
}
}

// primes are the one which sieve[n] returns true
return sieve;
}
Harsh said
April 16, 2014 at 1:36 PM
Could you please explain how the you got the limits?
Willy Good said
January 16, 2015 at 11:24 AM
This is written in F# under the CLR virtual machine as I prefer the functional programming style; I have translated the codes to Scala under JVM and have gotten about the same results, which makes sense as both are under virtual machines using JIT compilation and both have built in array bounds checks. These versions have very fast bit counting functions to obtain the number of primes rather than using enumeration as the enumeration can take much longer than the actual sieving.

On Gist (https://gist.github.com/anonymous/d17da4f1a8db2fadf10c) I have posted both a reference version of a maximally factorized Sieve of Eratosthenes just as Atkin and Berstein did, but my version is not “crippled” and this version of the SoA for comparison.

On the same Gist link is a version of the SoA from the pseudo code of the updated Wkikpedia article with an added improvement of an inner loop to almost eliminate wasted time of loops that won’t pass the modulo checks, and with the added advantage that the tight innermost culling loops are almost as simple as those used for SoE.

The test program to run each over a billion, only requiring the change of the “open ” state to run either is as follows:
```
open SoA

[<EntryPoint>]
let main argv =
  printfn "%s" getName
  if argv = null || argv.Length = 0 then
    failwith "\r\nno command line argument for limit!!!"

  let test = primes 100UL // |> Seq.iter (printf "%A ")

  let topNum = System.UInt64.Parse argv.[0]

  let strt = System.DateTime.Now.Ticks

//  let count = { 0 .. 3999 } |> Seq.map (fun i -> SoA.primes topNum) |> Seq.max // |> Seq.length
  let count = primes topNum // |> Seq.length

  let elpsd = System.DateTime.Now.Ticks - strt

  printfn "Found %A primes up to %A in %A milliseconds." count topNum (elpsd / 10000L)

  printf "\r\nPress any key to exit:"
  System.Console.ReadKey(true) |> ignore
  printfn ""
  0 // return an integer exit code
```
With the results as follows, first SoEMWF:
```
Uses the Sieve of Eratosthenes to calculate the primes to argument.

Found 50847534UL primes up to 1000000000UL in 3714L milliseconds.
```
and for the SoA:
```
Uses the Sieve of Atkin to calculate the primes to argument.

Found 50847534UL primes up to 1000000000UL in 3892L milliseconds.
```
You can see that the SoEMWF is still slightly faster than the SoA at one billion, but testing shows that the SoA gradually passes it by about two billion on up until we run out of memory at about 60 billion; however, this is for the huge array model, it’s easy to provide a page segmented version for the SoE but very difficult for the SoA and the performance picture completely changes with the SoA losing terribly…

Willy Good said

January 16, 2015 at 11:27 AM

Module codes are here:

module SoEMWF =
  let getName = "\r\nUses the Sieve of Eratosthenes to calculate the primes to argument.\r\n"
  let private WRDBTS, FRSTBP = 16, 23UL
  let private BPRMS = [| 2UL; 3UL; 5UL; 7UL; 11UL; 13UL; 17UL; 19UL |]
  let private BWPS = [| //base wheel positions + 23
    23uy;29uy;31uy;37uy;41uy;43uy;47uy;53uy; 59uy;61uy;67uy;71uy;73uy;79uy;83uy;89uy;
    97uy;101uy;103uy;107uy;109uy;113uy;121uy;127uy; 131uy;137uy;139uy;143uy;149uy;151uy;157uy;163uy;
    167uy;169uy;173uy;179uy;181uy;187uy;191uy;193uy; 197uy;199uy;209uy;211uy;221uy;223uy;227uy;229uy |]
  let private GAPS = [| //wheel position deltas for each computation of next incremental cull position
    6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 6uy; 6uy; 2uy; 6uy; 4uy; 2uy; 6uy; 4uy; 6uy; 8uy;
    4uy; 2uy; 4uy; 2uy; 4uy; 8uy; 6uy; 4uy; 6uy; 2uy; 4uy; 6uy; 2uy; 6uy; 6uy; 4uy;
    2uy; 4uy; 6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 2uy; 10uy; 2uy; 10uy; 2uy; 4uy; 2uy; 4uy;
    6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 6uy; 6uy; 2uy; 6uy; 4uy; 2uy; 6uy; 4uy; 6uy; 8uy; //2 rounds for overflow
    4uy; 2uy; 4uy; 2uy; 4uy; 8uy; 6uy; 4uy; 6uy; 2uy; 4uy; 6uy; 2uy; 6uy; 6uy; 4uy;
    2uy; 4uy; 6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 2uy; 10uy; 2uy; 10uy; 2uy; 4uy; 2uy; 4uy |]
  let private WHTS, WCRC = BWPS.Length, (GAPS |> Seq.map uint64 |> Seq.sum) / 2UL //check: 48 and 210UL
  let private WHLNDXS = //look up table for (position - 23) mod WCRC - 0 to WCRC - 1, to wheel index - 0 to WHTS - 1
    (GAPS |> Seq.take WHTS |> Seq.mapi (fun i x -> i, x)
    |> Seq.collect (fun (i, n) -> seq { for n = 1uy to n do yield byte i })) |> Seq.toArray
 
  let primes topNum =
    if topNum >= ((1UL <<< 30) / 3UL - 1UL) * uint64 WCRC + FRSTBP then failwith "Argument too large for algorithm!!!"
    let BFSZ = if topNum < FRSTBP then 6 else ((int ((topNum - FRSTBP) / uint64 WCRC) + 2) * 3) //includes topNum
 
    let cull (cmpsts: _ array) (bps: _ seq) =
      let topNdx = uint64 (cmpsts.Length * WRDBTS - 1)
      let topRng = topNdx / uint64 WHTS * WCRC + uint64 BWPS.[WHTS - 1] //from size of current array
      let kSqrtLim = int (sqrt (float (topRng - WCRC))) / int WCRC //base prime test limit value comp for extra
      bps |> Seq.takeWhile (((>=) kSqrtLim) << fst) |> Seq.iter (fun (k, i) -> //for each base prime less than limit
        let x = uint64 BWPS.[int i] in let p = (uint64 k * WCRC) + x
        let s = p * p - FRSTBP - x * p in let pm = 3 * int p
        { int i .. int i + WHTS - 1 } |> Seq.map (fun i ->
            let adj, ni = if i >= WHTS then WCRC, i - WHTS else 0UL, i
            let c0 = s + p * (uint64 BWPS.[ni] + adj)
            let c0d = c0 / WCRC in let c0n = WHLNDXS.[int (c0 - c0d * WCRC)]
            let c0dwtry = 3 * int c0d
            if c0n < 16uy then c0dwtry, c0n else if c0n < 32uy then c0dwtry + 1, c0n - 16uy
                                                                else c0dwtry + 2, c0n - 32uy )
        |> Seq.takeWhile (((>) cmpsts.Length) << fst) //loop maximum of wheel hits times; run tight cull loop for each
        |> Seq.iter (fun (ccd, cm) -> //tight culling loops almost all the time, following faster when 32-bit mode
          let cmsk = 1us <<< int cm
          let inline setCmpst i = cmpsts.[i] <- cmpsts.[i] ||| cmsk
          let rec cullp c = 
            if c < cmpsts.Length then setCmpst c; cullp (c + pm) in cullp ccd) ) //fast for 64 bit mode
 
    let cmpsts = //first generate the small master pre-culled pattern array
      let mstr = Array.zeroCreate (11 * 13 * 17 * 19 * WHTS / WRDBTS)
      { WHTS - 4 .. WHTS - 1 } |> Seq.map (fun m -> (-1, byte m)) |> cull mstr //cull 11/13/17/19
      let arr = Array.zeroCreate BFSZ
      { 0 .. mstr.Length .. (arr.Length - 1) } //copy repetitions of pre-cull pattern array until full
      |> Seq.iter (fun i -> let len = if i + mstr.Length <= arr.Length
                                      then mstr.Length else arr.Length - i
                            System.Array.Copy(mstr, 0, arr, i, len))
      arr //has an slight excess of up to two wheels to come out even in int's/wheels
 
    let inline testCmptst ndx = cmpsts.[ndx >>> 4] &&& (1us <<< (ndx &&& 15)) <> 0us
 
    Seq.initInfinite id |> Seq.collect (fun k -> { 0 .. WHTS - 1 } |> Seq.map (fun i -> k, byte i))
    |> Seq.filter (fun (k, i) -> not (testCmptst (WHTS * k + int i))) |> cull cmpsts //cull all composites
 
    //an extremely fast bit counting routine by 16-bit words...
    //mask off all buffer composite bits above the top number:
    let topd = (topNum - FRSTBP) / WCRC in let topn = int WHLNDXS.[int (topNum - FRSTBP - topd * WCRC)]
    let trywrd = int topd * 3
    let wrd, wrdn = if topn < 16 then trywrd, topn else if topn < 32 then trywrd + 1, topn - 16
                                                                     else trywrd + 2, topn - 32
    let awrd = if topNum < FRSTBP then -1 else cmpsts.[wrd] <- cmpsts.[wrd] ||| (uint16 -2 <<< wrdn); wrd    
    { awrd + 1 .. cmpsts.Length - 1 } |> Seq.iter (fun w -> cmpsts.[w] <- uint16 -1)
    let rec count w acc = //tight count loop
      if w < cmpsts.Length then //magic int16 bit count algorithm - slightly slower than Look Up Table for int16's
        let a = cmpsts.[w] in let b = a - ((a >>> 1) &&& 0x5555us)
        let c = (b &&& 0x3333us) + ((b >>> 2) &&& 0x3333us)
        count (w + 1) (acc - uint64 ((((c + (c >>> 4)) &&& 0x0F0Fus) * 0x0101us) >>> 8))
      else acc in count 0 (uint64 (BPRMS |> Seq.takeWhile ((>=) topNum) |> Seq.length)) + //include base primes
                  uint64 cmpsts.Length * uint64 WRDBTS //counts bits not set
 
//    seq { yield! BPRMS; |> Seq.takeWhile ((>=) topNum) //very slow but literal way of enumerating resulting primes:
//          yield! Seq.initInfinite id
//                 |> Seq.collect (fun k -> BWPS |> Seq.mapi (fun i x -> WCRC * uint64 k + uint64 x, WHTS * k + i))
//                 |> Seq.takeWhile (((>=) topNum) << fst)
//                 |> Seq.filter (not << testCmptst << snd) |> Seq.map (fst) }
 
module SoA =
  let getName = "\r\nUses the Sieve of Atkin to calculate the primes to argument.\r\n"
  let private WRDBTS, FRSTBP = 16, 7UL
  let private BPRMS = [| 2UL; 3UL; 5UL |]
  let private BWPS = [| 7uy;11uy;13uy;17uy;19uy;23uy;29uy;31uy; //base wheel positions + 7
                        37uy;41uy;43uy;47uy;49uy;53uy;59uy;61uy |]
  let private GAPS = [| //wheel position deltas for each computation of next incremental cull position
    4uy; 2uy; 4uy; 2uy; 4uy; 6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 2uy; 4uy; 6uy; 2uy; 6uy; //2 times for overflow
    4uy; 2uy; 4uy; 2uy; 4uy; 6uy; 2uy; 6uy; 4uy; 2uy; 4uy; 2uy; 4uy; 6uy; 2uy; 6uy |]
  let private WHTS, WCRC = BWPS.Length, (GAPS |> Seq.map uint64 |> Seq.sum) / 2UL //check: 16 and 60UL
  let private WHLNDXS = //look up table for (position - 23) mod WCRC - 0 to WCRC - 1, to wheel index - 0 to WHTS - 1
    (GAPS |> Seq.take WHTS |> Seq.mapi (fun i x -> i, x)
    |> Seq.collect (fun (i, n) -> seq { for n = 1uy to n do yield byte i })) |> Seq.toArray
 
  let primes topNum =
    if topNum >= ((1UL <<< 30) - 1UL) * uint64 WCRC + FRSTBP then failwith "Argument too large for algorithm!!!"
    let BFSZ = if topNum < FRSTBP then 2 else int ((topNum - FRSTBP) / WCRC) + 2 //includes topNum with spare
 
    let is_prms = //initialize the prime sieving array
      Array.zeroCreate BFSZ //has an slight excess of the rest of the last wheel to come out even in int's/wheels
 
    let setprms (prms: _ array) (bps: _ seq) =
      let topNdx = uint64 (prms.Length * WRDBTS - 1)
      let topRng = topNdx / uint64 WHTS * WCRC + uint64 BWPS.[WHTS - 1] //maximum value of current array
      let kSqrtLim = int (sqrt (float (topRng - WCRC))) / int WCRC //compensate for extra wheel
 
      //run the 4 * x * x + y * y quadratic prime generating sequences = all x's and odd y's
      Seq.initInfinite (uint64 << ((+) 1)) |> Seq.map (fun x -> 4UL * x * x + 1UL)
      |> Seq.takeWhile ((>=) topRng)
      |> Seq.collect (fun st -> { 1UL .. 2UL .. 29UL } |> Seq.map (fun midy -> st + midy * midy - 1UL, int midy + 15)
                                |> Seq.takeWhile (((>=) topRng) << fst) )
      |> Seq.iter (fun (n, myp15) ->
        if n < FRSTBP then () else //no action for n = 5 - not one of the modulos for this quadratic
          let cbd = (n - FRSTBP) / WCRC in let cr = n - cbd * WCRC
          let nm = int (if cr < WCRC then cr else cr - WCRC)
          if 1UL <<< nm &&& 0x22022020022002UL = 0UL then () else //if nm in {1,13,17,29,37,41,49,53}
            let cm = 1us <<< (int WHLNDXS.[int (cr - FRSTBP)]) //following is the tight loop that does it
            let rec tgl c inc = if c < prms.Length then prms.[c] <- prms.[c] ^^^ cm; tgl (c + inc) (inc + 30)
            tgl (int cbd) myp15)
 
      //run the 3 * x * x + y * y quadratic prime generating sequences - odd x's and even y's
      Seq.initInfinite (uint64 << ((+) 1) << ((*) 2)) |> Seq.map (fun x -> 3UL * x * x + 4UL)
      |> Seq.takeWhile ((>=) topRng)
      |> Seq.collect (fun st -> { 2UL .. 2UL .. 30UL } |> Seq.map (fun midy -> st + midy * midy - 4UL, int midy + 15)
                                |> Seq.takeWhile (((>=) topRng) << fst) )
      |> Seq.iter (fun (n, myp15) ->
        let cbd = (n - FRSTBP) / WCRC in let cr = n - cbd * WCRC
        let nm = int (if cr < WCRC then cr else cr - WCRC)
        if 1UL <<< nm &&& 0x80080080080UL = 0UL then () else //if cm in {7,19,31,43}
          let cm = 1us <<< (int WHLNDXS.[int (cr - FRSTBP)]) //following is the tight loop that does it
          let rec tgl c inc = if c < prms.Length then prms.[c] <- prms.[c] ^^^ cm; tgl (c + inc) (inc + 30)
          tgl (int cbd) myp15)
 
      //run the 3 * x * x - y * y quadratic prime generating sequences - both even/odd and odd/even combos in x's/y's
      Seq.initInfinite (uint64 << ((+) 2)) |> Seq.map (fun x -> 2UL * x * x + 2UL * x - 1UL, x - 1UL)
      |> Seq.takeWhile (((>=) topRng) << fst) //reverse order ok in the following
      |> Seq.collect (fun (st, by) -> Seq.unfold (fun x -> if x + 28UL < by || x < 1UL then None
                                                           else if x < 2UL then Some(x, 0UL) else Some(x, x - 2UL)) by
                                      |> Seq.map (fun midy -> st - midy * midy + by * by, int midy - 15)
                                      |> Seq.takeWhile (((>=) topRng) << fst))
      |> Seq.iter (fun (n, mym15) ->
        let cbd = (n - FRSTBP) / WCRC in let cr = n - cbd * WCRC
        let nm = int (if cr < WCRC then cr else cr - WCRC)
        if 1UL <<< nm &&& 0x800800000800800UL = 0UL then () else //if cm in {11,23,47,59}
          let cm = 1us <<< (int WHLNDXS.[int (cr - FRSTBP)]) //following is the tight loop that does it
          let rec tgl c inc = if inc > -15 && c < prms.Length then
                                prms.[c] <- prms.[c] ^^^ cm; tgl (c + inc) (inc - 30) in tgl (int cbd) mym15)
 
      //run the prime square-free culling operations:
      bps |> Seq.takeWhile (((>=) kSqrtLim) << fst) |> Seq.iter (fun (k, i) -> //for each base prime less than limit
        let x = uint64 BWPS.[int i] in let p = (uint64 k * WCRC) + x in let sndx = BWPS.Length - 1
        let sqr = p * p in let s = sqr - FRSTBP - sqr * uint64 BWPS.[sndx]
        { sndx .. sndx + WHTS - 1 } |> Seq.map (fun i ->
            let adj, ni = if i >= WHTS then WCRC, i - WHTS else 0UL, i
            let c0 = s + sqr * (uint64 BWPS.[ni] + adj)
            let c0d = c0 / WCRC in let c0n = WHLNDXS.[int (c0 - c0d * WCRC)]
            (c0d, c0n) )
        |> Seq.takeWhile (((>) (prms.Length - 1)) << int << fst) //loop maximum of wheel hits times; run tight cull loop for each
        |> Seq.iter (fun (ccd, cm) -> //tight culling loops almost all the time, following faster when 32-bit mode
          let cmsk = ~~~(1us <<< int cm)
          let rec cullp c = //tight prime square culling loop
            if c < uint64 prms.Length then prms.[int c] <- prms.[int c] &&& cmsk; cullp (c + sqr) in cullp ccd) )
 
    let inline testPrm ndx = is_prms.[int (ndx >>> 4)] &&& (1us <<< (int ndx &&& 15)) <> 0us
 
    Seq.initInfinite id |> Seq.collect (fun k -> { 0 .. WHTS - 1 } |> Seq.map (fun i -> k, byte i))
    |> Seq.filter (fun (k, i) -> testPrm (WHTS * int k + int i)) |> setprms is_prms //find all primes
 
    //an extremely fast bit counting routine by 16-bit words => number of primes...
    //mask off all buffer prime bits above the top number:
    let wrd = (topNum - FRSTBP) / WCRC in let wrdn = int WHLNDXS.[int (topNum - FRSTBP - wrd * WCRC)]
    let awrd = if topNum < FRSTBP then -1 else
               is_prms.[int wrd] <- is_prms.[int wrd] &&& ~~~(uint16 -2 <<< wrdn); int wrd
    { awrd + 1 .. is_prms.Length - 1 } |> Seq.iter (fun w -> is_prms.[w] <- 0us)
    let rec count w acc = //tight count loop
      if w < is_prms.Length then //magic int16 bit count algorithm - slightly slower than Look Up Table for int16's
        let a = is_prms.[w] in let b = a - ((a >>> 1) &&& 0x5555us)
        let c = (b &&& 0x3333us) + ((b >>> 2) &&& 0x3333us)
        count (w + 1) (acc + uint64 ((((c + (c >>> 4)) &&& 0x0F0Fus) * 0x0101us) >>> 8))
      else acc in count 0 (uint64 (BPRMS |> Seq.takeWhile ((>=) topNum) |> Seq.length)) //counts set bits incl. base
 
//    seq { yield! BPRMS |> Seq.takeWhile ((>=) topNum); //very slow but literal way of enumerating resulting primes:
//          yield! Seq.initInfinite id
//                 |> Seq.collect (fun k -> BWPS |> Seq.mapi (fun i x -> WCRC * uint64 k + uint64 x, WHTS * k + i))
//                 |> Seq.takeWhile (((>=) topNum) << fst)
//                 |> Seq.filter (testPrm << snd) |> Seq.map (fst) }
open SoA
 
[<EntryPoint>]
let main argv =
  printfn "%s" getName
  if argv = null || argv.Length = 0 then
    failwith "\r\nno command line argument for limit!!!"
 
  let test = primes 100UL // |> Seq.iter (printf "%A ")
 
  let topNum = System.UInt64.Parse argv.[0]
 
  let strt = System.DateTime.Now.Ticks
 
//  let count = { 0 .. 3999 } |> Seq.map (fun i -> SoA.primes topNum) |> Seq.max // |> Seq.length
  let count = primes topNum // |> Seq.length
 
  let elpsd = System.DateTime.Now.Ticks - strt
 
  printfn "Found %A primes up to %A in %A milliseconds." count topNum (elpsd / 10000L)
 
  printf "\r\nPress any key to exit:"
  System.Console.ReadKey(true) |> ignore
  printfn ""
  0 // return an integer exit code

Andres said
March 6, 2017 at 10:37 PM
This is an impressive optimization- it’s too bad we won’t know how you calculated your limits.
Willy Good said
June 17, 2018 at 12:37 AM
@Andres, I’m not sure what you mean by “how I calculated my limits”. If you mean the limit of the largest number that can be sieved in these naive “one huge sieving array” sieves, that’s based on the largest DotNet array being limited to two Gigabytes unless one turns on some extensions (and then not much larger), that the sieving is done with one bit representing a possible prime excluding composites eliminated by wheel factorization, thus there are 16 possibilites for each 16 bit word representing a 60 number wheel for Atkin, and 48 possibilies for each three 16 bit words representing a wheel range of 210 numbers for Eratosthenes, with a few numbers eliminated for the begining and some extras to allow for wheel overflow at the end; thus, the maximum range is about 60 billion for Atkin and about 75 billion for Eratostheses. The reason these sieves are “naive” is that they are very slow in memory access times, and the only reason they are moderately useful is that the number of operations has been greatly reduced by the algorithms, with both having about the same (slow) memory access speed as they use similar bit culling/toggling techniques.

The reason I wrore these is to show that the fully optimized SoE is at least as fast as the fully optimized SoA when other environment limits are the same (the memory acess speed).

To be non naive, segmented sieves should be used to both improve memory access time by a factor of 10 or more and to increase the potential size of the sieving ranges, but then the problems with the Atkin sieve become apparent: it is quite easy to do the conversion for the SoE but very difficukt to do it for the SoA (see Atkin and Bernstein’s reference implementation, and also, no matter what tricks are used, the SoA gets slower much faster with increasing range than the SoE due to more complexity and a greatly increasing span of culling/toggling jumps as compared to the span jumps required for SoE.

I wrote the programs in compiled F# for conciseness, but for these naive implementations, the language efficency doesn’t much matter as the main bottleneck is the memory access speed of many 10’s of CPU clock cycles (over a hundred for some systems); thus, fast native code compilers such as C won’t be much faster and even interpreted languages such as Python won’ be that much slower.

Programming Praxis