The Sum Of The First Billion Primes

September 11, 2012

We start by noting that there is no formula for computing the sum of a set of prime numbers; you have to enumerate each prime and add them up. We also note that the billionth prime is 22801763489, which we can calculate using the algorithm of a previous exercise, so whatever algorithm we use for generating the primes, we should be prepared for a long-running program. We will develop several different functions for generating primes and use this generic calling function for testing:

(define (sum-primes gen n) (let loop ((p (gen)) (n n) (s 0)) (if (zero? n) s (loop (gen) (- n 1) (+ s p)))))

Our first generator uses brute force; generate all the integers starting from 2, and test each for primality using trial division. For small generators, say several thousand primes, this is hard to beat:

(define (gen-brute) (define (prime? n) (if (even? n) (= n 2) (let loop ((d 3)) (cond ((< n (* d d)) #t) ((zero? (modulo n d)) #f) (else (loop (+ d 2))))))) (let ((next 2)) (lambda () (let loop ((n (+ next 1))) (if (prime? n) (let ((p next)) (set! next n) p) (loop (+ n 1)))))))

We time the function like this:

> (time (sum-primes (gen-brute) 1000000)) (time (sum-primes (gen-brute) ...)) 157 collections 132289 ms elapsed cpu time, including 15 ms collecting 132341 ms elapsed real time, including 10 ms collecting 1323026560 bytes allocated, including 1322982416 bytes reclaimed 7472966967499

Our second generator also loops over numbers, testing each for primality, but both the list of numbers and the primality test are improved from the brute-force generator. We generate candidate primes using a 2,3,5,7-wheel, which greatly reduces the number of candidates. We test primality using strong pseudoprime tests to bases 2, 7 and 61, which limits the generator to 32-bit primes; that means we won’t be able to sum the first billion primes, but it is suitable for smaller generators, and because this generator is faster than the previous one (except over small ranges), has compact code, and uses only a small, constant amount of space, it is the generator of choice for many applications:

(define (gen-wheel) (define (prime? n) ; n < 2^32 (define (expm b e m) (define (times x y) (modulo (* x y) m)) (let loop ((b b) (e e) (r 1)) (if (zero? e) r (loop (times b b) (quotient e 2) (if (odd? e) (times b r) r))))) (define (witness? a n) ; composite if #t (do ((d (- n 1) (/ d 2)) (s 0 (+ s 1))) ((odd? d) (let ((t (expm a d n))) (if (or (= t 1) (= t (- n 1))) #f (do ((s (- s 1) (- s 1)) (t (expm t 2 n) (expm t 2 n))) ((or (zero? s) (= t (- n 1))) (zero? s)))))))) (cond ((zero? (modulo n 2)) (= n 2)) ((zero? (modulo n 7)) (= n 7)) ((zero? (modulo n 61)) (= n 61)) (else (not (or (witness? 2 n) (witness? 7 n) (witness? 61 n)))))) (let ((wheel (cons 1 (cons 2 (cons 2 (cons 4 (let ((xs (list 2 4 2 4 6 2 6 4 2 4 6 6 2 6 4 2 6 4 6 8 4 2 4 2 4 8 6 4 6 2 4 6 2 6 6 4 2 4 6 2 6 4 2 4 2 10 2 10))) (set-cdr! (last-pair xs) xs) xs)))))) (next 2)) (lambda () (let loop ((n (+ next (car wheel)))) (set! wheel (cdr wheel)) (if (not (prime? n)) (loop (+ n (car wheel))) (let ((result next)) (set! next n) result))))))

Here’s our timing test, which shows that the wheel is between four and five times faster than brute force:

> (time (sum-primes (gen-wheel) 1000000)) (time (sum-primes (gen-wheel) ...)) 464 collections 28891 ms elapsed cpu time, including 16 ms collecting 28882 ms elapsed real time, including 30 ms collecting 3912633392 bytes allocated, including 3907682192 bytes reclaimed 7472966967499

Our third generator is based on a variant of the Sieve of Eratosthenes due to Melissa O’Neill; we studied it in a previous exercise. The idea is to keep a priority queue of the next multiple of each prime seen so far; each odd number in order becomes a candidate prime, the current candidate is declared prime if it is not at the head of the priority queue and a new sieving prime is added to the priority queue, but the current candidate is declared composite if it is found at the head of the priority queue and it replaced by its next multiple:

(define (gen-oneill) (define (pq-rank pq) (vector-ref pq 0)) (define (pq-item pq) (vector-ref pq 1)) (define (pq-lkid pq) (vector-ref pq 2)) (define (pq-rkid pq) (vector-ref pq 3)) (define pq-empty (vector 0 'pq-empty 'pq-empty 'pq-empty)) (define (pq-empty? pq) (eqv? pq pq-empty)) (define (pq-merge lt? p1 p2) (define (pq-swap item lkid rkid) (if (< (pq-rank rkid) (pq-rank lkid)) (vector (+ (pq-rank rkid) 1) item lkid rkid) (vector (+ (pq-rank lkid) 1) item rkid lkid))) (cond ((pq-empty? p1) p2) ((pq-empty? p2) p1) ((lt? (pq-item p2) (pq-item p1)) (pq-swap (pq-item p2) (pq-lkid p2) (pq-merge lt? p1 (pq-rkid p2)))) (else (pq-swap (pq-item p1) (pq-lkid p1) (pq-merge lt? (pq-rkid p1) p2))))) (define (pq-insert lt? x pq) (pq-merge lt? (vector 1 x pq-empty pq-empty) pq)) (define (pq-first pq) (if (pq-empty? pq) (error 'pq-first "empty priority queue") (pq-item pq))) (define (pq-rest lt? pq) (if (pq-empty? pq) (error 'pq-rest "empty priority queue") (pq-merge lt? (pq-lkid pq) (pq-rkid pq)))) (define (lt? a b) (or (< (car a) (car b)) (and (= (car a) (car b)) (< (cdr a) (cdr b))))) (let ((first? 2) (second? 3) (prev 3) (pq (pq-insert lt? (cons 9 6) pq-empty))) (lambda () (if first? (begin (set! first? #f) 2) (if second? (begin (set! second? #f) 3) (let loop1 ((n (+ prev 2))) (if (< n (car (pq-first pq))) (let ((c (* n n)) (s (+ n n))) (set! prev n) (set! pq (pq-insert lt? (cons c s) pq)) n) (let loop2 () (if (< n (car (pq-first pq))) (loop1 (+ n 2)) (let* ((c (car (pq-first pq))) (s (cdr (pq-first pq))) (cs (cons (+ c s) s))) (set! pq (pq-insert lt? cs (pq-rest lt? pq))) (loop2)))))))))))

Here’s the timing test; the Sieve of Eratosthenes is a good algorithm, but the implementation using priority queues is very slow:

> (time (sum-primes (gen-oneill) 1000000)) (time (sum-primes (gen-oneill) ...)) 3645 collections 91042 ms elapsed cpu time, including 2164 ms collecting 91184 ms elapsed real time, including 2245 ms collecting 30695660288 bytes allocated, including 30633126688 bytes reclaimed 7472966967499

Our fourth, and final, generator is based on a segmented variant of the Sieve of Eratosthenes. The code is moderately complex, and it requires space that grows as the square root of the size of the primes being generated to store the sieving primes plus a constant amount of space to store the sieve, but against these disadvantages it is very fast; if you can afford the space, this is the method of choice:

(define (gen-sieve) (define (prime? n) (if (even? n) (= n 2) (let loop ((d 3)) (cond ((< n (* d d)) #t) ((zero? (modulo n d)) #f) (else (loop (+ d 2))))))) (define (init) (let ((ps (list)) (qs (list)) (sv (make-vector 50000 #t))) (do ((p 3 (+ p 2))) ((< 100000 (* p p))) (when (prime? p) (set! ps (cons p ps)))) (set! qs (map (lambda (p) (+ (* 1/2 (- p 1)) p)) ps)) (do ((ps ps (cdr ps)) (qs qs (cdr qs))) ((null? ps)) (let ((p (car ps)) (q (car qs))) (do ((i q (+ i p))) ((<= 50000 i)) (vector-set! sv i #f)))) (values ps qs sv))) (define (advance ps qs sv bot) (do ((i 0 (+ i 1))) ((= i 50000)) (vector-set! sv i #t)) (set! qs (map (lambda (p q) (modulo (- q 50000) p)) ps qs)) (do ((p (+ (car ps) 2) (+ p 2))) ((< (+ bot 100000) (* p p))) (when (prime? p) (set! ps (cons p ps)) (set! qs (cons (/ (- (* p p) bot 1) 2) qs)))) (do ((ps ps (cdr ps)) (qs qs (cdr qs))) ((null? ps)) (let ((p (car ps)) (q (car qs))) (do ((i q (+ i p))) ((<= 50000 i)) (vector-set! sv i #f)))) (values ps qs sv)) (let-values (((ps qs sv) (init)) ((bot next-i next-p) (values 0 1 2))) (lambda () (let ((result next-p)) (let loop ((i next-i)) (cond ((= i 50000) (set! bot (+ bot 100000)) (let-values (((p q s) (advance ps qs sv bot))) (set! ps p) (set! qs q) (set! sv s) (set! next-i 0)) (loop 0)) ((vector-ref sv i) (set! next-i (+ i 1)) (set! next-p (+ bot i i 1)) result) (else (loop (+ i 1)))))))))

We used the same primality test by trial division as the brute force generator, because for small primes, such as the sieving primes, it’s faster than the alternatives. Variable sv is the bitarray of the sieve, which contains only odd numbers, and variable bot is the even number just below the smallest number in the current segment, which are hard-coded at intervals of 100000. Variable ps is the list of sieving primes, in descending order. Variable qs corresponds to ps, and holds the index in the sieve bitarray of the smallest multiple of the corresponding sieving prime in the current segment. Init collects the sieving primes less than 100000 in the first do, computes the corresponding qs in the map, and sieves each of the sieving primes in the second do, using the third (nested) do to mark off multiples. Advance resets for the next segment, clearing the bitarray in the first do, calculating the new qs in the map, extending the ps and qs lists for the new sieving primes in the second do, and sieving in the third and fourth (nested) do. The loop after the lambda finds the next set bit in the sieve, resetting for the next segment when the current segment is exhausted. Variable next-i is the next sieve location to be checked for primality, and variable next-p is the next unreported prime, which is always calculated in advance of its need. Here’s the timing test:

> (time (sum-primes (gen-sieve) 1000000)) (time (sum-primes (gen-sieve) ...)) 7 collections 1832 ms elapsed cpu time, including 4 ms collecting 1905 ms elapsed real time, including 6 ms collecting 32659704 bytes allocated, including 29472880 bytes reclaimed 7472966967499

Finally, we answer the original question; here’s the sum of the first billion primes:

> (time (sum-primes (gen-sieve) 1000000000)) (time (sum-primes (gen-sieve) ...)) 23357 collections 2034715 ms elapsed cpu time, including 14597 ms collecting 2173303 ms elapsed real time, including 15477 ms collecting 98382285776 bytes allocated, including 98384545808 bytes reclaimed 11138479445180240497

That’s 33:54.7 minutes, or about half a million primes per second. Which is really rockin’ big-time. You can see more of these prime sums at http://oeis.org/A007504/a007504.txt.

You can run the program at http://programmingpraxis.codepad.org/hbNkJxt5, but only for the first ten thousand primes due to time constraints.

Posted by programmingpraxis

Filed in Exercises

14 Comments »

14 Responses to “The Sum Of The First Billion Primes”

Kim Walisch said

September 14, 2012 at 10:35 PM

//////////////////////////////////////////////////////////////////////
// sum_primes.cpp

#include 
#include 
#include 

uint64_t sum = 0;

void callback(uint64_t prime)
{
  sum += prime;
}

int main()
{
  PrimeSieve ps;
  ps.generatePrimes(2, (uint64_t) 22801763489.0, callback);
  std::cout << "Sum = " << sum << std::endl;
  return 0;
}

//////////////////////////////////////////////////////////////////////

The above C++ code sums primes using my primesieve library (https://primesieve.googlecode.com). To compile it on a Linux/Unix system use:

$ svn checkout http://primesieve.googlecode.com/svn/trunk/ primesieve
$ cd primesieve
$ make lib
$ sudo make install

$ g++ -O2 sum_primes.cpp -o sum_primes -lprimesieve
$ ./sum_primes
Sum = 11138479445180240497

This takes about 10 seconds on my Intel Core-i5 CPU.

Catalin Cristu (@catalin_c) said

September 17, 2012 at 8:27 AM

A (inefficient) Python solution:

import time

def primes():
    yield 2
    yield 3
    stack = [3]
    while True:
        n = stack[-1] + 2
        is_prime = False
        while not is_prime:
            is_prime = True
            for p in stack:
                if p * p > n:
                    break
                if n % p == 0:
                    is_prime = False
                    break
            if is_prime:
                stack.append(n)
            else:
                n += 2
        yield stack[-1]

def sum_primes(n):
    result = 0
    count = 0
    for p in primes():
        result += p
        count += 1
        if count == n:
            break
    return result

if __name__ == '__main__':
    start = time.clock()
    print(sum_primes(1000000))
    print("elapsed %.2fs" % (time.clock() - start))

On my machine (Intel Core i3) this takes around 87 seconds to complete.

Jeff said

September 23, 2012 at 11:55 PM

Java noob’s super inefficient, but extensively documented, Java

public class AddPrimes {

    public static void main(String[] args) {
        
        //Initialize number as 2, the first possible prime number larger than one
        number = 2;
        //Main Loop; loop until 1 billion prime numbers have been discovered
        while(primeCounter < 1000000){
            //Initialize fields for Calculation Loop.
            m = Math.sqrt(number);
            divisor = 2;
            //Calculation Loop
            while(divisor <= m){
                //Initialize quotient
                quotient = number /divisor;
                //Initialize isInteger
                isInteger = checkIsInteger(quotient);
                //If the quotient is an integer, add it to the list of factors.  Increase divisor by 1.
                if(isInteger == true ){
                    factors.add(quotient);
                    divisor++;
                //If the quotient is not an integer, increase the divisor by 1.
                }else{
                    divisor++;
                }
            }
            //Once all quotients are analyzed, if the list of factors is empty, then the number is prime.
            //Add the number to the sum of primes.  Increase primeCounter by 1.  Increase number by 1.
            if(factors.isEmpty() == true){
                sum += number;
                primeCounter ++;
                number++;
            //If the list of factors is not empty, then the number is not prime.  Clear the list of factors.
            //Increase number by 1.
            }else{
                factors.clear();
                number++;
            }
        }
        //Display the sum of the total number of primes discovered.
        System.out.println(sum);
    }
    /**
     * Determines whether or not the given double is an integer.
     * @param d - double
     * @return true if d is an integer or false if d is not an integer
     */
    public static boolean checkIsInteger(double d){
        boolean intCheck;
        String doubleText = Double.toString(d);
        if(doubleText.endsWith("0")){
            intCheck = true;
        }else{
            intCheck = false;
        }
        return intCheck;
    }
    
    //Necessary fields
        /**
         * The current number being tested for primality.
         */
        static double number;
        /**
         * The sum of all numbers determined to be prime.
         */ 
        static int sum;
        /**
         * The square root of the current number.
         */ 
        static double m;
        /**
         * The current divisor.
         */
        static int divisor;
        /**
         * The quotient of number / divisor.
         */
        static double quotient;
        /**
         * A counter that records the total number of discovered prime numbers.
         */
        static int primeCounter;
        /**
         * Boolean placeholder for the result of the checkInteger method.
         */
        static boolean isInteger;
        /**
         * List of any factors discovered for number.
         */
        static ArrayList factors = new ArrayList();
}

David said
September 29, 2012 at 10:26 PM
Done in SwiftForth with a bitset library I wrote a long time ago. Took 40 minutes or 410K primes per second. (While also watching web video, etc. on the computer.) It uses the compressed 33K file from previous exercise to store primes < 1 million, which is used as the sieving primes, and the segmented sieve from another previous exercise to sieve blocks after 1 million, in increments of 100,000.

requires bitset

\ Table lookup for "small" primes, (33K table for primes < 1e6)

1,000,003 drop Constant FILE_PRIME_MAX

: load-primes ( str — addr )
r/o bin open-file throw locals| fd |
fd file-size throw drop ( — size )
s" CREATE prime-table" evaluate
here over allot
swap fd read-file throw drop
fd close-file throw ;

s" ../prime1e6.dat" load-primes

hex
create masks[]
FF c, \ 0
FE c, FE c, FE c, FE c, FE c, FE c, \ 1 – 6
FC c, FC c, FC c, FC c, \ 7 – 10
F8 c, F8 c, \ 11 – 12
F0 c, F0 c, F0 c, F0 c, \ 13 – 16
E0 c, E0 c, \ 17 – 18
C0 c, C0 c, C0 c, C0 c, \ 19 – 22
80 c, 80 c, 80 c, 80 c, 80 c, 80 c, \ 23 – 28
00 c, \ 29

decimal

create offset_mod_11
-1 , 1 , 7 , -1 , 11 , 17 , -1 , 29 , 13 , 23 , 19 ,

: lowbit>offset
dup negate and 11 mod cells offset_mod_11 + @ ;

: next-small-prime ( n — n )
dup 2 < IF drop 2 EXIT THEN
dup 3 < IF drop 3 EXIT THEN
dup 5 < IF drop 5 EXIT THEN
dup FILE_PRIME_MAX >= abort" : precomputed prime table exceeded"

30 /mod prime-table +
swap masks[] + c@ \ addr mask
over c@ and lowbit>offset
BEGIN dup 0< WHILE
drop
1+ dup c@ lowbit>offset
REPEAT
swap prime-table – 30 * + ;

\ segmented sieve

100,000 drop Constant SIEVE_BLOCK_SIZE
SIEVE_BLOCK_SIZE 2/ bset sieve

2variable sieve_start \ used by the generator
: sieve_end ( — d )
sieve_start 2@ [ SIEVE_BLOCK_SIZE 1- ] Literal M+ ;

: d/mod ( d n — d%n d/n )
locals| n |
\ FORTH does not have remainder function for double division
\ so we multiply the quotient by the divisor and subtract…
\
2dup 1 n M*/ 2dup 2>r \ get quotient & save a copy
n 1 M*/ d- drop \ compute remainder
2r> ; \ restore quotient

: dmod ( d m — n )
locals| m |
m d/mod 2drop \ note: FM/MOD is faster but overflows…
dup 0< IF m + THEN ;

: offset ( n — n ) \ first offset into bit array of prime #
sieve_start 2@ third 1+ M+ D2/ dnegate rot dmod ;

: seg-sieve ( d — )
sieve_start 2!
sieve bits erase
sieve bset-invert! \ set everything to TRUE

3
BEGIN
dup offset
BEGIN dup sieve nip ( length ) < WHILE
dup sieve -bit
over +
REPEAT drop
next-small-prime
sieve_end third dup M* D< UNTIL
drop ;

\ gen-primes — generate primes and call callback for each one.
\ callback returns 0 when no more primes desired
\ After ~1e12 primes generated will throw exception…

variable 'with_prime
variable complete?

: sieve_cb ( n — n )
complete? @ IF
drop
ELSE
sieve_start 2@ rot 2* 1+ M+ 'with_prime @execute not complete? !
THEN ;

: gen-primes ( xt — )
'with_prime ! false complete? !

2
BEGIN
dup s>d 'with_prime @execute not IF drop EXIT THEN
next-small-prime
dup 1000000 > UNTIL drop

1,000,000
BEGIN
2dup seg-sieve
sieve ['] sieve_cb bset-foreach
complete? @ IF 2drop EXIT THEN
SIEVE_BLOCK_SIZE M+
AGAIN ;

\ sum first N primes

variable #primes
2variable sum

: sum_cb ( d — ? )
sum 2@ D+ sum 2!
-1 #primes +!
#primes @ 0> ;

: sum_primes ( n — d )
0, sum 2! #primes !
['] sum_cb gen-primes
sum 2@ ;

counter 1,000,000,000 drop sum_primes du. timer 11138479445180240497 2401035 ok
David said
September 29, 2012 at 10:31 PM
Trying this again, hope it formats better…

Done in SwiftForth with a bitset library I wrote a long time ago. Took 40 minutes or 410K primes per second. (While also watching web video, etc. on the computer.) It uses the compressed 33K file from previous exercise to store primes < 1 million, which is used as the sieving primes, and the segmented sieve from another previous exercise to sieve blocks after 1 million, in increments of 100,000.

requires bitset

\ Table lookup for "small" primes, (33K table for primes < 1e6)

1,000,003 drop Constant FILE_PRIME_MAX

: load-primes ( str — addr )
r/o bin open-file throw locals| fd |
fd file-size throw drop ( — size )
s" CREATE prime-table" evaluate
here over allot
swap fd read-file throw drop
fd close-file throw ;

s" ../prime1e6.dat" load-primes

hex
create masks[]
FF c, \ 0
FE c, FE c, FE c, FE c, FE c, FE c, \ 1 – 6
FC c, FC c, FC c, FC c, \ 7 – 10
F8 c, F8 c, \ 11 – 12
F0 c, F0 c, F0 c, F0 c, \ 13 – 16
E0 c, E0 c, \ 17 – 18
C0 c, C0 c, C0 c, C0 c, \ 19 – 22
80 c, 80 c, 80 c, 80 c, 80 c, 80 c, \ 23 – 28
00 c, \ 29

decimal

create offset_mod_11
-1 , 1 , 7 , -1 , 11 , 17 , -1 , 29 , 13 , 23 , 19 ,

: lowbit>offset
dup negate and 11 mod cells offset_mod_11 + @ ;

: next-small-prime ( n — n )
dup 2 < IF drop 2 EXIT THEN
dup 3 < IF drop 3 EXIT THEN
dup 5 < IF drop 5 EXIT THEN
dup FILE_PRIME_MAX >= abort" : precomputed prime table exceeded"

30 /mod prime-table +
swap masks[] + c@ \ addr mask
over c@ and lowbit>offset
BEGIN dup 0< WHILE
drop
1+ dup c@ lowbit>offset
REPEAT
swap prime-table – 30 * + ;

\ segmented sieve

100,000 drop Constant SIEVE_BLOCK_SIZE
SIEVE_BLOCK_SIZE 2/ bset sieve

2variable sieve_start \ used by the generator
: sieve_end ( — d )
sieve_start 2@ [ SIEVE_BLOCK_SIZE 1- ] Literal M+ ;

: d/mod ( d n — d%n d/n )
locals| n |
\ FORTH does not have remainder function for double division
\ so we multiply the quotient by the divisor and subtract…
\
2dup 1 n M*/ 2dup 2>r \ get quotient & save a copy
n 1 M*/ d- drop \ compute remainder
2r> ; \ restore quotient

: dmod ( d m — n )
locals| m |
m d/mod 2drop \ note: FM/MOD is faster but overflows…
dup 0< IF m + THEN ;

: offset ( n — n ) \ first offset into bit array of prime #
sieve_start 2@ third 1+ M+ D2/ dnegate rot dmod ;

: seg-sieve ( d — )
sieve_start 2!
sieve bits erase
sieve bset-invert! \ set everything to TRUE

3
BEGIN
dup offset
BEGIN dup sieve nip ( length ) < WHILE
dup sieve -bit
over +
REPEAT drop
next-small-prime
sieve_end third dup M* D< UNTIL
drop ;

\ gen-primes — generate primes and call callback for each one.
\ callback returns 0 when no more primes desired
\ After ~1e12 primes generated will throw exception…

variable 'with_prime
variable complete?

: sieve_cb ( n — n )
complete? @ IF
drop
ELSE
sieve_start 2@ rot 2* 1+ M+ 'with_prime @execute not complete? !
THEN ;

: gen-primes ( xt — )
'with_prime ! false complete? !

2
BEGIN
dup s>d 'with_prime @execute not IF drop EXIT THEN
next-small-prime
dup 1000000 > UNTIL drop

1,000,000
BEGIN
2dup seg-sieve
sieve ['] sieve_cb bset-foreach
complete? @ IF 2drop EXIT THEN
SIEVE_BLOCK_SIZE M+
AGAIN ;

\ sum first N primes

variable #primes
2variable sum

: sum_cb ( d — ? )
sum 2@ D+ sum 2!
-1 #primes +!
#primes @ 0> ;

: sum_primes ( n — d )
0, sum 2! #primes !
['] sum_cb gen-primes
sum 2@ ;

counter 1,000,000,000 drop sum_primes du. timer 11138479445180240497 2401035 ok
The Sum Of The First Billion Primes | jverkamp.com said
November 1, 2012 at 1:01 PM
[…] This problem from Programming Praxis came about in the comments to my last post and intrigued me. So today, we are trying to sum the first one billion primes. Summing the first hundred, thousand, even million primes isn’t actually that bad. But it takes a bit more effort when you scale it up to a billion. And why’s that? […]
JP said
November 1, 2012 at 1:26 PM
I know it’s an older, but after your comments on my solution to the Pythagorean Triples hinting at this one, I had to try it. I ended up working on it way more than I probably should have with quals coming up, but what I ended up with is 5 different ways (including all three sieves mentioned on Wikipedia) of summing the first n primes with different data structure variants on the two fastest ones.

It still ends up taking an hour and a half on the quickest run, but it was really interesting to work out all of the ways to do it. Perhaps I’ll see if I can optimize even more at some point. Although 10 seconds? Likely not. :)

Linky: The Sum of the First Billion Primes

danaj said

November 11, 2012 at 11:32 AM

Five methods in Perl.

The first is calling Math::Prime::Util::next_prime in a dumb loop. Not very fast since it’s doing trial division and/or MR tests. 400k-500k / second.

We can speed that up by having MPU precalculate all the primes, obviously at the expense of memory (about 760MB for sieving 22.8B numbers). Using nth_prime_upper() to get an upper bound (0.01% larger than the real value, but obtained in microseconds) we can do this pretty fast. A reasonably fast solution at over 3M primes/s.

Next is using Math::Prime::Util to do it more efficiently — get primes in segments until we’ve reached our limit. Not bad — 66s on my machine to sum the first 1 billion. Over 15M/s.

Now we move to pure Perl. First let’s use an unsegmented sieve. This one is a modification of one from c2.com’s wiki, which is a very simple SoE, and much faster than most Perl sieves found on the web (I have a collection of 17 of them — some shockingly bad, some quite decent). It runs at about 260k/s. But the biggest problem is that it isn’t segmented, nor is it storing the numbers efficiently. My 32GB machine started swapping when calculating 10^8.

Lastly is a segmented sieve in pure Perl using strings. A bit wacky and too verbose, but it can be faster than the obvious solutions depending on memory speed. As a basic unsegmented sieve, it’s about 2x faster than any of the other Perl solutions on my workstation. On my old Atom netbook it’s a bit slower than the above sieve. I wrote it earlier this year so Math::Prime::Util could have something reasonable as a fallback if the XS failed to compile on a platform. Since it’s segmented it will work for our task without going crazy with memory. Over 500k/s.

None of them compare to primesieve of course (the serial version of Kim’s program takes all of 7 seconds on my machine, the parallel version runs in 1.1s). The callback is spiffy also. I thought about adding something like it to Math::Prime::Util (e.g. prime_foreach 2 100 { …sub… }) and still might, but calling Perl subroutines is far slower than C — it’s over a minute just to make 1B empty subroutine calls so it seems of limited use for this example.

Timings:

       MPU next     next+precalc   MPU sieve      Perl big    Perl segment
      ------------  ------------  ------------  ------------  ------------
10^6      1.6s          0.3s          0.1s          2.9s          1.4s
10^7     20.4s          2.3s          0.6s         37.7s         16.0s
10^8    252.1s         26.0s          6.1s          n/a         197.6s
10^9                  304.9s         65.8s          n/a        2548.0s

Code:

#!/usr/bin/perl
use warnings;
use strict;
use Math::Prime::Util qw/:all -nobigint/;
use List::Util qw/sum/;

my $n = $ARGV[0] || 1_000_000;
#print "sum of the first $n primes: ", prime_sum_next($n, 1), "\n";
#print "sum of the first $n primes: ", prime_sum_sieve($n), "\n";
#print "sum of the first $n primes: ", prime_sum_ns($n), "\n";
print "sum of the first $n primes: ", prime_sum($n), "\n";

sub prime_sum_next {
  my ($np, $do_precalc) = @_;
  prime_precalc(nth_prime_upper($np)) if $do_precalc;
  my $sum = 0;
  my $p = 0;
  while ($np-- > 0) {
    $p = next_prime($p);
    $sum += $p;
  }
  $sum;
}

sub prime_sum_sieve {
  my $np = shift;
  my $sum = 0;
  my $low = 1;
  my $segsize = 5_000_000;
  while ($np > 0) {
    my $chunkref = primes($low, $low+$segsize);
    $low += $segsize + 2;
    $#$chunkref = $np-1 if scalar @$chunkref > $np;
    $sum += int(sum @$chunkref);
    $np -= scalar @$chunkref;
  }
  $sum;
}

sub prime_sum_ns {   # Basic unsegmented sieve
  my $np = shift;
  return if !defined $np || $np < 0;
  return (0,2,2+3,2+3+5)[$np] if $np <= 3;
  my $max = nth_prime($np);
  my @c;
  for(my $t = 3; $t*$t <= $max; $t += 2) {
     if (!$c[$t]) {
         for(my $s = $t*$t; $s <= $max; $s += $t*2) { $c[$s]++ }
     }
  }
  my $sum = 2;
  for (my $t = 3; $t <= $max; $t += 2) {
    $c[$t] || do {$sum += $t;}
  }
  $sum;
}

sub prime_sum {   # Using string segmented sieve
  my $np = shift;
  return if !defined $np || $np < 0;
  return (0,2,2+3,2+3+5)[$np] if $np <= 3;
  $np -= 3;

  my $sum = 2+3+5;
  my $segsize = 1_000_000;
  my $low = 7;
  while ($np > 0) {
    my $high = $low + $segsize;
    my $sref = _pp_segment_sieve($low, $high);
    my $p = $low - 2;
    foreach my $s (split("0", $$sref, -1)) {
      $p += 2 + 2 * length($s);
      last if $p > $high;
      $sum += $p;
      last if --$np <= 0;
    }
    $low = $high + 2;
  }
  $sum;
}

sub _pp_segment_sieve {
  my($beg,$end) = @_;
  my $range = int( ($end - $beg) / 2 ) + 1;
  # Prefill with 3 and 5 already marked, and offset to the segment start.
  my $whole = int( ($range+14) / 15);
  my $startp = ($beg % 30) >> 1;
  my $sieve = substr("011010010010110", $startp) . "011010010010110" x $whole;
  # Set 3 and 5 to prime if we're sieving them.
  substr($sieve,0,2) = "00" if $beg == 3;
  substr($sieve,0,1) = "0"  if $beg == 5;
  # Get rid of any extra we added.
  substr($sieve, $range) = '';

  # If the end value is below 7^2, then the pre-sieve is all we needed.
  return \$sieve if $end < 49;

  my $limit = int(sqrt($end)) + 1;
  # For large value of end, it's a huge win to just walk primes.
  my $primesieveref = _pp_string_sieve($limit);
  my $p = 7-2;
  foreach my $s (split("0", substr($$primesieveref, 3), -1)) {
    $p += 2 + 2 * length($s);
    my $p2 = $p*$p;
    last if $p2 > $end;
    if ($p2 < $beg) {
      $p2 = int($beg / $p) * $p;
      $p2 += $p if $p2 < $beg;
      $p2 += $p if ($p2 % 2) == 0;   # Make sure p2 is odd
    }
    # With large bases and small segments, it's common to find we don't hit
    # the segment at all.  Skip all the setup if we find this now.
    if ($p2 <= $end) {
      # Inner loop marking multiples of p
      # (everything is divided by 2 to keep inner loop simpler)
      my $filter_end = ($end - $beg) >> 1;
      my $filter_p2  = ($p2  - $beg) >> 1;
      while ($filter_p2 <= $filter_end) {
        substr($sieve, $filter_p2, 1) = "1";
        $filter_p2 += $p;
      }
    }
  }
  \$sieve;
}

sub _pp_string_sieve {
  my($end) = @_;
  return 0 if $end < 2;
  return 1 if $end < 3;
  $end-- if ($end & 1) == 0;

  # string with prefill
  my $whole = int( ($end>>1) / 15);
  die "Sieve too large" if $whole > 1_145_324_612;  # ~32 GB string
  my $sieve = '100010010010110' . '011010010010110' x $whole;
  substr($sieve, ($end>>1)+1) = '';
  my $n = 7;
  while ( ($n*$n) <= $end ) {
    my $s = $n*$n;
    my $filter_s   = $s >> 1;
    my $filter_end = $end >> 1;
    while ($filter_s <= $filter_end) {
      substr($sieve, $filter_s, 1) = '1';
      $filter_s += $n;
    }
    do { $n += 2 } while substr($sieve, $n>>1, 1);
  }
  return \$sieve;
}

JP said
November 29, 2012 at 10:19 PM
Dang. After some comments over on my original post, I decided to go back and add a the segmented version of the Sieve of Eratosthenes into the mix. I’m glad I did.

While my previous times were 2 hr 42 min for Eratosthenes and 1 hr 28 min for Atkin, the new segmented sieve blows both out of the water at only 25 min 12 sec. It’s still nowhere near as quick as some people’s, but it’s at least a good example of what being conscious of what memory/caching behavior can do to runtimes.

Linky: One Billion Primes – Segmented Sieve

Nathan McKenzie said

September 25, 2014 at 4:57 AM

Alright, here’s my approach (quite literally).

On my Intel Core i7 2.3 ghz machine, it completes in 495 milliseconds and uses about 800,000 bytes of RAM (but if I used a larger wheel that would go up rapidly).

One interesting side note: the actual algorithm I’m using doesn’t use any memory during run time aside from the pre-allocated memory of the wheels and the stack space of the function calls (which will never be deeper than log n/log 2), so it would be relatively trivially to make this algorithm multi-threaded or even distributed. My 495 ms time is single-threaded. With that said, this algorithm runs in something worse than O(n^4/5) time, so there are faster solutions.

My C# implementation follows. Just call Program.Main(). This code handles overflow very poorly – if you try values of n larger than 22801763489, or increasing the wheel size, there’s a good chance you’ll start seeing garbage values.

using System;
using System.Diagnostics;
// A description of this algorithm can be found as equation 5 in http://icecreambreakfast.com/primecount/PrimeSumming_NathanMcKenzie.pdf
namespace ConsoleApplication1{
	struct Polynomial{
		// For this problem, I didn't want to take the speed hit of a full-blown conversion to decimal / BigInteger math.  So instead, as a kludge, output numbers are collated as
		// polynomials - as x2 * theWheel.wheelRunningTotals[1, 0] + x1 * theWheel.wheelCycle + x0, which, with the default wheel that excludes all primes < 17,
		// amounts to x2 * 86486400 + x1 * 30030 + x0.  These two values were chosen because they feature prominently in many of the output functions.
		// The entire point of this unfortuanate complexity is to deal with the fact that internally, to sum the primes up to <= 22801763489, we need to deal with numbers
		// larger than 2^64 with the identities here.  This is a stop-gap solution.  Overflow will still silently happen if great care isn't used.
		public static Polynomial Zero = new Polynomial(0, 0, 0);
		public static long base1 = 0, base2 = 0;
		private long x2, x1, x0;
		public Polynomial(long val2, long val1, long val0){
			x2 = val2;
			x1 = val1;
			x0 = val0;}
		public decimal Value{ get{ return (decimal)x2 * base2 + (decimal)x1 * base1 + x0; } }
		public Polynomial Normalize(){
			x1 += x0 / base1;
			x0 %= base1;
			return this;
		}
		public static Polynomial operator +(Polynomial c1, Polynomial c2){
			return new Polynomial(c1.x2 + c2.x2, c1.x1 + c2.x1, c1.x0 + c2.x0 );
		}
		public static Polynomial operator -(Polynomial c1, Polynomial c2){
			return new Polynomial(c1.x2 - c2.x2, c1.x1 - c2.x1, c1.x0 - c2.x0);
		}
		public static Polynomial operator *(Polynomial m1, long c){
			return new Polynomial( m1.x2 * c, m1.x1 * c, m1.x0 * c);
		}
		public static Polynomial operator *(long c, Polynomial m1){
			return new Polynomial(m1.x2 * c, m1.x1 * c, m1.x0 * c);
		}
	}
	class Wheel{
		// The wheel has to address 2 problems.
		// One is it needs to to let us iterate through a set of numbers, efficiently skipping numbers that are excluded from the wheel.
		// Two is it needs to provide a quick way to find the sum of all numbers in a range that are not excluded by the wheel, with those numbers raised
		// to integer powers.  This second task is by far the more complicated of the two.
		// This is all made worse, of course, by the confusion introduced by the Polynomial representation here.
		// I might write up what's going on here better someday.
		static uint[] primeCache = new uint[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71 };
		static uint[] usedEntries = new uint[] { 1, 2, 8, 48, 480, 5760, 92160, 1658880 };
		public uint wheelCycle;
		public long[,] wheelRunningTotals;
		public uint[] offsets;
		int numExcludedPrimes;
		int numPowers = 3;
		public uint firstPrime;
		public Wheel(int excludedPrimes){
			numExcludedPrimes = excludedPrimes;
			wheelCycle = 1;
			for (int i = 0; i < excludedPrimes; i++) wheelCycle *= primeCache[i];
			firstPrime = primeCache[excludedPrimes];
			wheelRunningTotals = new long[numPowers, wheelCycle];
			var temporarySums = new long[numPowers];
			offsets = new uint[usedEntries[excludedPrimes - 1]];
			for (var k = 0; k < numPowers; k++) temporarySums[k] = 0;
			var clocks = new int[excludedPrimes];
			for (int i = 0; i < excludedPrimes; i++)clocks[i] = 0;
			uint offset = 0;
			var curOffset = 0;
			for (uint i = 0; i < wheelCycle; i++){
				var reset = false;
				for (uint j = 0; j < excludedPrimes; j++){
					if (clocks[j] == 0)reset = true;
					clocks[j]++;
					if (clocks[j] == primeCache[j])clocks[j] = 0;
				}
				if (reset == false){
					uint mul = 1;
					for (var k = 0; k < numPowers; k++){
						temporarySums[k] += mul;
						mul *= i;
					}
					offsets[curOffset] = offset + 1;
					curOffset++;
					offset = 0;
				}
				else offset++;
				for (var k = 0; k < numPowers; k++) wheelRunningTotals[k, i] = temporarySums[k];
			}
			for (var k = 0; k < numPowers; k++) wheelRunningTotals[k, 0] = temporarySums[k];
		}
		public double SumOfExcludedPrimes(double power){
			double total = 0;
			for (var k = 0; k < numExcludedPrimes; k++) total += Math.Pow(primeCache[k], power);
			return total;
		}
		public static long PowerSum0(long n){
			return n + 1;
		}
		public static long PowerSum1(long n){
			return (n * (n + 1) / 2);
		}
		public static long PowerSum2(long n){
			return n * (n + 1) * (2 * n + 1) / 6;
		}
		//_________________________________________________________
		// SumRangeZeroPower(long start, long end) counts the number of values that remain unsieved in the range going from start to end inclusive.
		// In a non-sieved version, this would just be end-start+1.
		private Polynomial OffsetZeroPower(long val){
			long entryInCycle = val % wheelCycle;
			return new Polynomial( 0, 0, wheelRunningTotals[0, entryInCycle] );
		}
		private Polynomial ValueZeroPower(long val){
			if (val == 0) return Polynomial.Zero;
			long a = (val - 1) / wheelCycle - 1;
			return new Polynomial(0, 0, PowerSum0(a) * wheelRunningTotals[0, 0]) + OffsetZeroPower(val);
		}
		public Polynomial SumRangeZeroPower(long start, long end){
			return ValueZeroPower(end) - ValueZeroPower(start - 1);
		}
		//_________________________________________________________
		// SumRangeFirstPower(long start, long end) adds up the numbers that remain unsieved in the range going from start to end inclusive.
		// In a non-sieved version, this would be the very well known (end)(end+1)/2 - (start-1)(start)/2.
		private Polynomial OffsetFirstPower(long val){
			long entryInCycle = val % wheelCycle;
			return new Polynomial(0, wheelRunningTotals[0, entryInCycle] * ((val - 1) / wheelCycle), 0) + new Polynomial(0, wheelRunningTotals[1, entryInCycle] / wheelCycle, wheelRunningTotals[1, entryInCycle]%wheelCycle);
		}
		private Polynomial ValueFirstPower(long val){
			if (val == 0) return Polynomial.Zero;
			long fullCycles = (val - 1) / wheelCycle - 1;
			return new Polynomial(0, PowerSum1(fullCycles) * wheelRunningTotals[0, 0], 0) + new Polynomial( PowerSum0(fullCycles), 0, 0) + OffsetFirstPower(val);
		}
		public Polynomial SumRangeFirstPower(long start, long end){
			return (ValueFirstPower(end) - ValueFirstPower(start - 1)).Normalize();
		}
		//_________________________________________________________
		// SumRangeFirstPower(long start, long end) adds up the numbers that remain unsieved in the range going from start to end inclusive.
		// In a non-sieved version, this would be the very well known (end)(end+1)(2 *end +1 )/6 - (start-1)(start)(2*start-1)/6.
		private Polynomial OffsetSecondPower(long val){
			long fullCycles = ((val - 1) / wheelCycle);
			long c = wheelCycle * fullCycles;
			long entryInCycle = val % wheelCycle;
			return new Polynomial( 0, wheelRunningTotals[0, entryInCycle] * c * fullCycles, 0 ) + new Polynomial( 0, 2 * wheelRunningTotals[1, entryInCycle] * fullCycles, 0 )
				+ new Polynomial( 0, wheelRunningTotals[2, entryInCycle]/wheelCycle, wheelRunningTotals[2, entryInCycle]%wheelCycle );
		}
		private Polynomial ValueSecondPower(long val){
			if (val == 0) return Polynomial.Zero;
			long fullCycles = (val - 1) / wheelCycle - 1;
			var temp = PowerSum0(fullCycles) * wheelRunningTotals[2, 0];
			return new Polynomial( 0, PowerSum2(fullCycles) * wheelRunningTotals[0, 0] * wheelCycle, 0 ) + new Polynomial( 2 * PowerSum1(fullCycles) * wheelCycle, 0, 0 )
				+ new Polynomial( 0, temp/wheelCycle, temp%wheelCycle ) + OffsetSecondPower(val);
		}
		public Polynomial SumRangeSecondPower(long start, long end){
			return ValueSecondPower(end) - ValueSecondPower(start - 1);
		}
	}
	class Program{
		static int[] moebiusMu = new int[]{ 0, 1, -1, -1, 0, -1, 1, -1, 0, 0, 1, -1, 0, -1, 1, 1, 0, -1, 0, -1, 0, 1, 1, -1, 0, 0, 1, 0, 0, -1, -1, -1, 0, 1, 1, 1, 0, -1, 1, 1 };
		static long[,] binomialCache;
		static double epsilon = .00000001;
		static Wheel theWheel;
		static long InversePow(long n, int k){
			return (long)(Math.Pow(n, 1.0 / k) + epsilon);
		}
		static long Binomial(double n, int k){
			double total = 1;
			for (int i = 1; i <= k; i++) total *= (n - (k - i)) / i;
			return (long)(total + epsilon);
		}
		struct WheelEntry{
			public int wheelID;
			public long num;
			public void Init(){
				wheelID = 2;
				num = theWheel.firstPrime;
				if (wheelID >= theWheel.offsets.Length) wheelID = 0;
			}
			public static WheelEntry Create(){
				var theWheel = new WheelEntry();
				theWheel.Init();
				return theWheel;
			}
			public static WheelEntry CopyAndIncrement(WheelEntry source){
				var theWheel = new WheelEntry();
				theWheel.num = source.num;
				theWheel.wheelID = source.wheelID;
				theWheel.Increment();
				return theWheel;
			}
			public void Increment(){
				num += theWheel.offsets[wheelID];
				wheelID++;
				if (wheelID >= theWheel.offsets.Length) wheelID = 0;
			}
		}
		static Polynomial Divisors(long n, int p, int k, WheelEntry wheel){
			if (k == 0) return new Polynomial(0, 0, 1);
			long a = wheel.num;
			if (k == 1){
				switch (p){
					case 0: return theWheel.SumRangeZeroPower(a, n);
					case 1: return theWheel.SumRangeFirstPower(a, n);
					case 2: return theWheel.SumRangeSecondPower(a, n);
				}
			}
			Polynomial total = new Polynomial(0, 0, 0);
			long maxDivisor = InversePow(n, k);
			var nextWheel = WheelEntry.CopyAndIncrement(wheel);
			for (; a <= maxDivisor; nextWheel.Increment()){
				long n2 = n;
				for (int j = 1; j <= k; j++){
					n2 /= a;
					total += (long)(Math.Pow((double)a, p * j)+epsilon) * Divisors(n2, p, k - j, nextWheel) * binomialCache[k, j];
				}
				a = nextWheel.num;
			}
			return total;
		}

		// Remember, a description of this algorithm can be found as equation 5 in http://icecreambreakfast.com/primecount/PrimeSumming_NathanMcKenzie.pdf
		static decimal PrimeSum(long n, int p){
			decimal total = 0.0M;
			for (int j = 1; j < Math.Log(n, theWheel.firstPrime)+epsilon; j++){
				if (moebiusMu[j] == 0) continue;
				decimal sign = 1;
				for (int k = 1; k < Math.Log(Math.Pow(n, 1.0 / j)+epsilon, theWheel.firstPrime) + epsilon; k++){
					var entry = WheelEntry.Create();
					total += sign * Divisors(InversePow(n, j), j * p, k, entry).Value / k / j * moebiusMu[j];
					sign *= -1;
				}
			}
			return decimal.Truncate((total + .00000001M) + (uint)theWheel.SumOfExcludedPrimes(p));
		}
		static void BuilldBinomialCache(){
			binomialCache = new long[30, 30];
			for (var j = 0; j < 30; j++)
				for (var k = 0; k < 30; k++)
					binomialCache[j, k] = Binomial(j, k);
		}
		static void Main(string[] args){
			var startingMemory = GC.GetTotalMemory(false);
			theWheel = new Wheel(6);
			Polynomial.base1 = theWheel.wheelCycle;
			Polynomial.base2 = theWheel.wheelRunningTotals[1, 0];
			BuilldBinomialCache();
			var stopWatch = new Stopwatch();
			stopWatch.Start();
			var d = PrimeSum(22801763489, 1);
			stopWatch.Stop();
			Console.WriteLine( "Computed " + d +", expected 11138479445180240497.");
			Console.WriteLine("Ellapsed Time: " + stopWatch.Elapsed.TotalMilliseconds+ " milliseconds.");
			Console.WriteLine("Caches allocataed " +(GC.GetTotalMemory( false )-startingMemory)+" bytes of heap memory.");
			Console.ReadKey();
		}
	}
}

danaj said
June 24, 2015 at 8:27 AM
Nathan, that is very impressive. Thanks for sharing as well as the writeup you did for stackoverflow and your web page. I saw another approach here: http://mathematica.stackexchange.com/questions/80291/efficient-way-to-sum-all-the-primes-below-n-million-in-mathematica that looks more analogous to Meissel’s formula (with the Legendre phi replaced by the S(v,p) function).

Pari/GP. 2m20s
```
s=0;  forprime(i=0,22801763489,s+=i);  s
```
Perl 47s
```
use ntheory ":all";
my $s = 0;
forprimes { $s += $_ } nth_prime(1e9);
say $s;
```
Perl 12s
```
use ntheory ":all";
say sum_primes( nth_prime(1e9) );
```
That’s not *too* far off primesieve’s 7 seconds. Both work by generating primes with a segmented sieve and doing something with the results. Someday I may add one of the fast sums, but memory and code size (standard C) are important.
Marc Deleglise said
April 2, 2016 at 5:41 PM
Using The Lagarias-Miller-Odlyzko method for counting primes improved by Deleglise-Rivat it easy to write a ithprime(n) function which returns the n-th prime in time (n^(2/3)/log(n)^2).

This algorithm return ithprime(10^9) = 22801763489 in 0.03s.

The number of primes up to x may be written pi(x) = sum(1, for p in primes(x)) = sum(f(p) for p in primes(x)) where f(p) = 1.

The Lagarias-Miller-Odlyzko method may be used with very small changes to compute Sf(x) = sum(f(p) for p in primes(x)), still with O(x^(1/3)/log(x)^2) operations
for every function f such that f(ab) = f(a) f(b). This is explained in the paper
Maximal product of primes whose sum is bounded, by M. Deléglise and J.-L. Nicolas, arXiv:1207.0603v1,p 25-35.

If we take for f the identity function f(x) = x we get Sf(x) = sum(p for p in primes(x).

This algorithm computes
Sf(22801763489) = 11138479445180240497 en 0.9s.

If we replace one billion, 10^9 by 10^12 wet get respectiviely
ithprime(10^12) = 29996224275833 in 1.5s
and then Sf( 29996224275833) = 14738971133550183905879827 in 5.6s.
GordonBGood said
May 9, 2016 at 12:49 AM
@Marc Deleglise, very interesting, can you show us the code used or post a link to it (since this is a programming website not a math one)?
Curtis Seizert said
November 8, 2016 at 9:12 PM
This is a quick and dirty way of explicitly computing the sum using CUDA:
```
#include <stdint.h>
#include <iostream>
#include <math.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include "CUDASieve/cudasieve.hpp"

int main()
{
  uint64_t sum = 0, bottom = 0, intervalSize = pow(2,32);
  int64_t numRemaining = pow(10,9);
  size_t numInCurrentInterval;

  while(numRemaining > 0){
    uint64_t * d_primes = CudaSieve::getDevicePrimes(bottom, bottom + intervalSize, numInCurrentInterval);

    sum += thrust::reduce(thrust::device, d_primes, d_primes + std::min((int64_t)numInCurrentInterval, numRemaining));
    
    numRemaining -= numInCurrentInterval;
    bottom += intervalSize;
    cudaFree(d_primes);
  }

  std::cout << "Sum = " << sum << std::endl;

  return 0;
}
```
This code uses the thrust library for a parallel reduction as well as my CUDASieve library (https://github.com/curtisseizert/CUDASieve) for parallel prime generation. The fastest block size of 2^32 runs in about 630 ms on my GTX 1080 but takes about 1.6 GB of device memory during this time. One could significantly improve the speed of this by changing the source code of CUDASieve in order to obviate generating lists of primes. This would theoretically take only a few kb over the memory required to store the list of sieving primes for the Eratosthenes implementation (~60 kb) and would have a likely run time essentially equal to that of simply counting the primes.

Programming Praxis