## Counting Primes Using Legendre’s Formula

### July 22, 2011

We will need functions to compute the *n*th prime, for the calculation of φ, and to compute π(*x*) for small values of *x*, to terminate the recursion in the formula for π. We do both by using the sieve of Eratosthenes to create a vector of the primes up to some convenient limit (the vector has an additional nonsense value in the 0th position so we don’t have to keep subtracting 1 whenever we index the vector), then just index the vector to compute the *n*th prime and use binary search to compute π:

`(define ps (list->vector (cons -1 (primes #e1e5))))`

`(define (p n) (vector-ref ps n))`

`(define (pi n)`

(let loop ((lo 1) (hi (- (vector-length ps) 1)))

(let ((mid (quotient (+ lo hi) 2)))

(cond ((<= hi lo) mid)

((< n (p mid)) (loop lo (- mid 1)))

((< (p mid) n) (loop (+ mid 1) hi))

(else mid)))))

The φ function is computed according to the recurrence equation. We memoize the function because small results appear again and again in the recursive calculations:

`(define phi`

(let ((memo (make-hash (lambda (x) (+ (* 1000 (car x)) (cadr x))) equal? #f 9973)))

(lambda args

(let ((x (car args)) (a (cadr args)) (t (memo 'lookup args)))

(cond (t t) ; return memoized value

((= a 1) (let ((t (quotient (+ x 1) 2))) (memo 'insert args t) t))

(else (let ((t (- (phi x (- a 1)) (phi (quotient x (p a)) (- a 1)))))

(memo 'insert args t) t)))))))

Now the calculation of π is simple:

`(define (prime-pi n)`

(let ((a (pi (isqrt n))))

(+ (phi n a) a -1)))

Here is Legendre’s calculation:

`> (prime-pi 1000000)`

78498

We used hash tables and the `isqrt`

function from the Standard Prelude, and `primes`

from a previous exercise. You can run the program at http://programmingpraxis.codepad.org/sNCV5Y0d.

Even with memoization, my Python solution didn’t finish quickly enough for me. Interestingly, (or not, as the case may be)

the following naive Haskell implementation finishes much more quickly (especially when compiled via GHC).

I guess the next step is to get good enough at Haskell to figure out a way to cut the number of

`phi`

calls.The reference solution finds pi(10^6) in 72ms on my machine, and pi(10^7) in about ten times that, which is still less than a second. What kind of times are you seeing?

Sieving by trial division to get a list of the primes is ugly. The fastest, prettiest prime generator that I know in Haskell is by Melissa O’Neill: http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf.

The reference solution finds pi(10^6) in about 0.3 seconds and pi(10^7) in about 1.6 (under Petite Chez Scheme). Running the naive Haskell above with GHC’s

`runhaskell`

command (so, intepreted) takes almost 9.6 seconds for pi(10^6), while compiled with`gch --make -O2`

take about 0.5 seconds. My Haskell interpreted chokes when going for pi(10^7), but manages to finish in about 11 seconds when compiled. Thanks for the O’Neill paper; I think I’d come across it before, but hadn’t checked it out in depth.Python 3.2

@Mike, your solution seems to run into the same problem my Python work did, even with Python 3.2: trying pi(10^7) causes the interpreter to crash after exceeding the recursion limit. Any thoughts?

@Graham,

I had trouble with the stack limit as well.

Evaluating the code for pi() with n = 1e7, a = 966. So, pi(1e7) = phi(1e7, 966) + 966 – 1. phi(1e7, 966) depends on phi(1e7,965), which depends on phi(1e7,964), … down to phi(1e7, 1). That’s 966 stack frames, which is close to the stack overflow limit.

I suspect the python shell uses a few stack frames, but probably not enough to push pi(1e7) over the limit.

I believe the culprit is the lru cache decorator for the phi() function. Looking at the source code in functools.py, when there is a cache miss, the wrapped function calls the original code for phi(). So I think that on a cache miss, two frames get pushed on the stack (one for the call to the wrapped function and one for the call to the original function). The original function may the recurse and call the wrapped function, etc.

I have not tested my hypothesis. However, if I pre-populate the cache by calling pi() in a loop with gradually increasing arguments, I can calculate pi() for larger arguments without hitting the stack limit.

Scheme has no trouble with stack overflow because stack frames are stored on the heap, not on the stack.

I don’t know enough about Python to help. Here are some random thoughts on the matter. I don’t know if any of them are useful.

Setting a higher stack overflow limit only helps for a little while. I don’t know if that is possible in Python, but in any event it is not a serious solution.

Does Python have some kind of lazy evaluation method? If it does, and if it uses the heap instead of the stack, maybe you could arrange to use that instead of the normal recursion stack.

You could build your own stack out of cons pairs stored on the heap, and make it however big you want, instead of using Python’s built-in recursion stack.

You don’t necessarily need to use the recurrence equation to calculate Legendre’s sum. See the MathWorld article for Legendre’s Formula. Specifically, equation (1) may be useful.

Here is a Python 3 version of phi(x,a) using a separate stack.

I optimized the code a bit.

–If x < prime[a], the second term in the recurrence equation is zero, and the first term in the recurrence equation is basically doing a linear search through the primes to find the largest prime that divides x. The bisect() call replaces the linear search with a binary search to find the largest prime less than x.

–As stated in the problem, the recurrence terminates when a == 1. The recurrence can also be terminated when x == 1, where phi(1,a) is +1 or -1 depending on the sign the phi term would have (the variable s in the code).

from bisect import bisect

from math import sqrt

from utils import primesTo

# pad prime[] so that prime[n] is the nth prime number

prime = [0] + primesTo(100000)

def phi(x, a):

if a == 0:

return x

stack = [(1,x,a)]

value = 0

while stack:

s, x, a = stack.pop()

if x == 1:

value += s

else:

if x < prime[a]:

a = bisect(prime, x, 0, a) – 1

if a == 1:

value += s * ((x + 1)//2)

else:

stack.extend([(s, x, a-1), (-s, x//prime[a], a-1)])

return value

def pi(x):

if x < 2:

return 0

else:

a = pi(int(sqrt(x)))

return phi(x, a) + a – 1

There is no caching of intermediate results, so it is not very fast:

_n_______time (s)_____pi____

10^0 0.0000083810 0

10^1 0.0000455365 4

10^2 0.0001030857 25

10^3 0.0007869715 168

10^4 0.0064016516 1229

10^5 0.0681326817 9592

10^6 0.2781906639 78498

10^7 2.4777696607 664579

10^8 25.5805003150 5761455

10^9 262.7440423294 50847534

With formatting:

Clojure also runs out of stack space, and just using a stack and a loop is not obvious how to memoize the intermediate results. So I took advantage of Mike’s optimization above (I admit I never would have figured out what the recurrence actually does on my own.) Uses O’Neil sieve algorithm to create enough primes to calculate pi to 1e9.

My experience is different that everyone else. I have no trouble with recursion limits, and my timings are quite fast. I just computed pi(10^9) from an initial load with no previously-cached results:

`Petite Chez Scheme Version 8.4`

Copyright (c) 1985-2011 Cadence Research Systems

`> (load "e:/prime.ss")`

> (time (prime-pi #e1e9))

(time (prime-pi 1000000000))

35 collections

1014 ms elapsed cpu time, including 62 ms collecting

1010 ms elapsed real time, including 72 ms collecting

298285040 bytes allocated, including 277790944 bytes reclaimed

50847534

One second is a whole lot better than the four minutes David reported. What’s going on?

There is probably only a marginal effect if you run the scheme code with the cache empty or with previously cached results. The reason why it is so fast is that there are so many redundant calculations that don’t need to be performed when the results are cached. Without a cache, calculating phi for 1e9 will take (I don’t know, days?) Calculating phi for pi(1e6) naively without caching takes hours (and I’m not sure how long because I gave up after it was chugging for a few hours.)

Mike’s optimization removes some computational dead ends, and he also figured out what the recurrence actually does, so replacing a big chunk of the recurrence with the call to binary search the prime array saves a lot of time, but probably still leaves a lot of redundant computation in place. It’s time is still much improved over trying to compute phi naively without a cache.

Clojure is based on the JVM, and Java was not designed to be a functional programming language, so there’s not a whole lot of stack. I’m not sure why Python has so much problems. I worked on the Python code in the early 90’s and at that time stack frames were allocated on the heap. Because Python now has generators, there is even less reason to use the system stack for procedure calls, so it doesn’t make sense that there would be a stack limit…

This was fun, especially as my copy of Riesel arrived tonight. Two versions in Perl, not tested extensively. An obvious one:

#!/usr/bin/env perl

use strict;

use warnings; no warnings 'recursion';

use Memoize;

use Math::Prime::Util qw/primes/;

my @primea = @{primes(1_000_000)};

memoize('phi');

sub phi {

my ($x, $a) = @_;

return ($a ? int(($x+1)/2) : $x) if $a <= 1;

return 1 if $x < $primea[$a];

return phi($x, $a-1) - phi( int($x / $primea[$a-1]), $a-1);

}

sub pi {

my $n = shift;

return 0 if $n < 2;

`my $a = pi(int(sqrt($n)));`

return phi($n, $a) + $a - 1;

}

0.104s for 10**6 (78498), 41.1s for 10**9 (50847534).

I then made a non-recursive phi:

#!/usr/bin/env perl

use strict;

use warnings;

use Math::Prime::Util qw/primes/;

my @primea = @{primes(1_000_000)};

sub phi {

my ($x, $a) = @_;

return $x if $a == 0;

my @add = ($x);

my @sub;

my $sum = 0;

while ($a-- > 1) {

my $prime = $primea[$a];

my @newadd = map { int($_ / $prime) } @sub;

my @newsub = map { int($_ / $prime) } @add;

push @add, @newadd;

push @sub, @newsub;

$prime = $primea[$a-1];

my $count_ones = (scalar @add) - (scalar @sub);

@add = grep { $_ > $prime } @add;

@sub = grep { $_ > $prime } @sub;

$count_ones = $count_ones - (scalar @add) + (scalar @sub);

$sum += $count_ones;

}

$sum += int(($_+1)/2) for @add;

$sum -= int(($_+1)/2) for @sub;

$sum;

}

sub pi {

my $n = shift;

return 0 if $n < 2;

`my $a = pi(int(sqrt($n)));`

return phi($n, $a) + $a - 1;

}

0.016s for 10**6 (78498), 11.5s for 10**9 (50847534).

Math::Prime::Util::prime_count(10**9) takes under 0.5s but that’s cheating (it’s also written in C — its pure perl prime_count code is much slower).

Hopefully formatting will work this time (I wish there was a preview). A new version, with better phi(x,a), and also Lehmer essentially straight from from Riesel (page 22):

Timing on my machine for 10**9 is 6.2s for Legendre, 0.12s for Lehmer. I tried Lehmer counts 10**12 in 24.9s, 10**13 in 186.2s. Pretty fantastic for Perl. I am using a Perl/XS library to get the prime list and do the small prime_counts in the Lehmer loop. In theory those could all be precalculated (MPU’s prime count is doing bit counts on the pre-sieved area to get the result).

More timing:

primesieve is helped a lot by using 12 threads but it’s seriously fast even with only one thread (4-10x slower than multi-threaded). Both it and Math::Prime::Util are sieving the entire range (in segments) and doing fast bit counts, and both use very low memory since they run segmented. “Python-Legendre” is Mike’s code on my machine. The Lehmer results are great (IMO) for a first pass. Lots of optimization could be done on the “small” prime_count calls rather than its current method of making a big sieve and then counting all the bits up to the target, for each call.

Memoizing the phi function is the way to go. Ruby (which is slower language in execution than Clojure, generally) can compute pi(1e9) in 70 seconds, vs. 235 in Clojure. This is because there are no stack space issues here…

David, I don’t have your LazySieve, but just threw in a standard sieve to the nth_upper_limit of rt_n (and verified the sieving is a trivial amount of time and the prime counts are identical). I get “stack level too deep” errors in phi when trying to run 1e8 using Ruby 1.8.7. I had to set ‘ulimit -s 32768’ to get it to run 1e9 successfully (hmm, looks like ruby for Windows presets the stack to a giant size so that’s why it runs fine for you). Sadly I’m getting times that are *much* slower than yours — perhaps a 1.8.7 vs. 1.9.x issue. It’s 3.1s for 1e7, 58s for 1e8, and over 10 minutes for 1e9.

I believe ‘Memoizing the phi function is the way to go’ only applies to very small n values. My Legendre version in Perl above is over 10x faster than your Ruby numbers, and Perl isn’t _that_ much faster (I have a fast computer, but my memoized versions in Perl are slower than your Ruby numbers). When I did memoization it was not only slow for large n, but used gawdawful amounts of memory. The ruby code is using over 700MB of memory computing pi(10**9).

One of the main issues I ran into when doing Lehmer with large values (see https://programmingpraxis.com/2011/07/26/more-prime-counting-functions/) was space more so than just speed, making simple memoization a very bad solution.