Counting Primes Using Legendre’s Formula
July 22, 2011
We will need functions to compute the nth prime, for the calculation of φ, and to compute π(x) for small values of x, to terminate the recursion in the formula for π. We do both by using the sieve of Eratosthenes to create a vector of the primes up to some convenient limit (the vector has an additional nonsense value in the 0th position so we don’t have to keep subtracting 1 whenever we index the vector), then just index the vector to compute the nth prime and use binary search to compute π:
(define ps (list->vector (cons -1 (primes #e1e5))))
(define (p n) (vector-ref ps n))
(define (pi n)
(let loop ((lo 1) (hi (- (vector-length ps) 1)))
(let ((mid (quotient (+ lo hi) 2)))
(cond ((<= hi lo) mid)
((< n (p mid)) (loop lo (- mid 1)))
((< (p mid) n) (loop (+ mid 1) hi))
(else mid)))))
The φ function is computed according to the recurrence equation. We memoize the function because small results appear again and again in the recursive calculations:
(define phi
(let ((memo (make-hash (lambda (x) (+ (* 1000 (car x)) (cadr x))) equal? #f 9973)))
(lambda args
(let ((x (car args)) (a (cadr args)) (t (memo 'lookup args)))
(cond (t t) ; return memoized value
((= a 1) (let ((t (quotient (+ x 1) 2))) (memo 'insert args t) t))
(else (let ((t (- (phi x (- a 1)) (phi (quotient x (p a)) (- a 1)))))
(memo 'insert args t) t)))))))
Now the calculation of π is simple:
(define (prime-pi n)
(let ((a (pi (isqrt n))))
(+ (phi n a) a -1)))
Here is Legendre’s calculation:
> (prime-pi 1000000)
78498
We used hash tables and the isqrt function from the Standard Prelude, and primes from a previous exercise. You can run the program at http://programmingpraxis.codepad.org/sNCV5Y0d.
Even with memoization, my Python solution didn’t finish quickly enough for me. Interestingly, (or not, as the case may be)
the following naive Haskell implementation finishes much more quickly (especially when compiled via GHC).
I guess the next step is to get good enough at Haskell to figure out a way to cut the number of
phicalls.The reference solution finds pi(10^6) in 72ms on my machine, and pi(10^7) in about ten times that, which is still less than a second. What kind of times are you seeing?
Sieving by trial division to get a list of the primes is ugly. The fastest, prettiest prime generator that I know in Haskell is by Melissa O’Neill: http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf.
The reference solution finds pi(10^6) in about 0.3 seconds and pi(10^7) in about 1.6 (under Petite Chez Scheme). Running the naive Haskell above with GHC’s
runhaskellcommand (so, intepreted) takes almost 9.6 seconds for pi(10^6), while compiled withgch --make -O2take about 0.5 seconds. My Haskell interpreted chokes when going for pi(10^7), but manages to finish in about 11 seconds when compiled. Thanks for the O’Neill paper; I think I’d come across it before, but hadn’t checked it out in depth.Python 3.2
from bisect import bisect from functools import lru_cache from math import sqrt from utils import primesTo prime = list(primesTo(10000)) @lru_cache(maxsize=None) def phi(x, a): if a == 1: return (x+1)//2 else: return phi(x, a-1) - phi(x//prime[a - 1], a - 1) def pi(n): ''' return number of primes <= n >>> pi(10) 4 >>> pi(1000) 168 >>> pi(1000000) 78498 ''' if n < prime[-1]: return bisect(prime, n) else: a = pi(int(sqrt(n))) return phi(n, a) + a - 1 ##if __name__ == "__main__": ## import doctest ## doctest.testmod()@Mike, your solution seems to run into the same problem my Python work did, even with Python 3.2: trying pi(10^7) causes the interpreter to crash after exceeding the recursion limit. Any thoughts?
@Graham,
I had trouble with the stack limit as well.
Evaluating the code for pi() with n = 1e7, a = 966. So, pi(1e7) = phi(1e7, 966) + 966 – 1. phi(1e7, 966) depends on phi(1e7,965), which depends on phi(1e7,964), … down to phi(1e7, 1). That’s 966 stack frames, which is close to the stack overflow limit.
I suspect the python shell uses a few stack frames, but probably not enough to push pi(1e7) over the limit.
I believe the culprit is the lru cache decorator for the phi() function. Looking at the source code in functools.py, when there is a cache miss, the wrapped function calls the original code for phi(). So I think that on a cache miss, two frames get pushed on the stack (one for the call to the wrapped function and one for the call to the original function). The original function may the recurse and call the wrapped function, etc.
I have not tested my hypothesis. However, if I pre-populate the cache by calling pi() in a loop with gradually increasing arguments, I can calculate pi() for larger arguments without hitting the stack limit.
Scheme has no trouble with stack overflow because stack frames are stored on the heap, not on the stack.
I don’t know enough about Python to help. Here are some random thoughts on the matter. I don’t know if any of them are useful.
Setting a higher stack overflow limit only helps for a little while. I don’t know if that is possible in Python, but in any event it is not a serious solution.
Does Python have some kind of lazy evaluation method? If it does, and if it uses the heap instead of the stack, maybe you could arrange to use that instead of the normal recursion stack.
You could build your own stack out of cons pairs stored on the heap, and make it however big you want, instead of using Python’s built-in recursion stack.
You don’t necessarily need to use the recurrence equation to calculate Legendre’s sum. See the MathWorld article for Legendre’s Formula. Specifically, equation (1) may be useful.
Here is a Python 3 version of phi(x,a) using a separate stack.
I optimized the code a bit.
–If x < prime[a], the second term in the recurrence equation is zero, and the first term in the recurrence equation is basically doing a linear search through the primes to find the largest prime that divides x. The bisect() call replaces the linear search with a binary search to find the largest prime less than x.
–As stated in the problem, the recurrence terminates when a == 1. The recurrence can also be terminated when x == 1, where phi(1,a) is +1 or -1 depending on the sign the phi term would have (the variable s in the code).
from bisect import bisect
from math import sqrt
from utils import primesTo
# pad prime[] so that prime[n] is the nth prime number
prime = [0] + primesTo(100000)
def phi(x, a):
if a == 0:
return x
stack = [(1,x,a)]
value = 0
while stack:
s, x, a = stack.pop()
if x == 1:
value += s
else:
if x < prime[a]:
a = bisect(prime, x, 0, a) – 1
if a == 1:
value += s * ((x + 1)//2)
else:
stack.extend([(s, x, a-1), (-s, x//prime[a], a-1)])
return value
def pi(x):
if x < 2:
return 0
else:
a = pi(int(sqrt(x)))
return phi(x, a) + a – 1
There is no caching of intermediate results, so it is not very fast:
_n_______time (s)_____pi____
10^0 0.0000083810 0
10^1 0.0000455365 4
10^2 0.0001030857 25
10^3 0.0007869715 168
10^4 0.0064016516 1229
10^5 0.0681326817 9592
10^6 0.2781906639 78498
10^7 2.4777696607 664579
10^8 25.5805003150 5761455
10^9 262.7440423294 50847534
With formatting:
from bisect import bisect from math import sqrt from utils import primesTo # pad prime[] so that prime[n] is the nth prime number prime = [0] + primesTo(100000) def phi(x, a): if a == 0: return x stack = [(1,x,a)] value = 0 while stack: s, x, a = stack.pop() if x == 1: value += s else: if x < prime[a]: a = bisect(prime, x, 0, a) – 1 if a == 1: value += s * ((x + 1)//2) else: stack.extend([(s, x, a-1), (-s, x//prime[a], a-1)]) return value def pi(x): if x < 2: return 0 else: a = pi(int(sqrt(x))) return phi(x, a) + a – 1 _n_______time (s)_____pi____ 10^0 0.0000083810 0 10^1 0.0000455365 4 10^2 0.0001030857 25 10^3 0.0007869715 168 10^4 0.0064016516 1229 10^5 0.0681326817 9592 10^6 0.2781906639 78498 10^7 2.4777696607 664579 10^8 25.5805003150 5761455 10^9 262.7440423294 50847534Clojure also runs out of stack space, and just using a stack and a loop is not obvious how to memoize the intermediate results. So I took advantage of Mike’s optimization above (I admit I never would have figured out what the recurrence actually does on my own.) Uses O’Neil sieve algorithm to create enough primes to calculate pi to 1e9.
(load "lazy-sieve") (def max-prime (int 1e9)) (def prime (into (vector-of :int 0) (take-while #(< %1 (int (Math/sqrt max-prime))) lazy-primes))) (defn small-pi "pi(n) for small n {n <= (last prime)}" [n] (let [pos (java.util.Collections/binarySearch prime (int n))] (if (>= pos 0) pos (- (+ pos 2))))) (defn phi [x a] (loop [stack (list [1, x, a]), n 0] (if (not (empty? stack)) (let [[sgn, x, a] (first stack), l (rest stack)] (if (= x 1) (recur l (+ n sgn)) (let [a' (if (< x (prime a)) (small-pi x) a)] (if (= a' 1) (recur l (+ n (* sgn (int (/ (inc x) 2))))) (recur (conj (conj l [sgn, x, (dec a')]) [(- sgn), (int (/ x (prime a'))), (dec a')]) n))))) n))) (defn prime-pi "Compute pi(n), upto 1 billion" [n] (if (or (< n 0) (> n max-prime)) (throw (IllegalArgumentException. "argument to prime-pi out of range."))) (if (<= n (last prime)) (small-pi n) (let [a (prime-pi (int (Math/sqrt n)))] (+ (phi n a) a -1)))) (defmacro timex "Evaluates expr and returns expr and the time it took in milliseconds." [expr] `(let [start# (. System (nanoTime)), ret# ~expr] [(/ (double (- (. System (nanoTime)) start#)) 1e6), ret#])) (defn pi-table "Count primes by powers of 10" [limit] (println "n time (s) pi(n) % prime") (loop [n 1, c 0] (let [[t, pi] (timex (prime-pi n))] (printf "1e%d %11.6f %11d %5.2f%%\n" c (/ t 1000) pi (/ (* 100.0 pi) n)) (if (< c limit) (recur (* 10 n) (inc c)))))) (comment user=> (pi-table 9) n time (s) pi(n) % prime 1e0 0.000949 0 0.00% 1e1 0.000764 4 40.00% 1e2 0.000762 25 25.00% 1e3 0.000822 168 16.80% 1e4 0.000759 1229 12.29% 1e5 0.046668 9592 9.59% 1e6 0.218717 78498 7.85% 1e7 2.206287 664579 6.65% 1e8 22.933560 5761455 5.76% 1e9 234.952010 50847534 5.08% )My experience is different that everyone else. I have no trouble with recursion limits, and my timings are quite fast. I just computed pi(10^9) from an initial load with no previously-cached results:
Petite Chez Scheme Version 8.4Copyright (c) 1985-2011 Cadence Research Systems
> (load "e:/prime.ss")
> (time (prime-pi #e1e9))
(time (prime-pi 1000000000))
35 collections
1014 ms elapsed cpu time, including 62 ms collecting
1010 ms elapsed real time, including 72 ms collecting
298285040 bytes allocated, including 277790944 bytes reclaimed
50847534
One second is a whole lot better than the four minutes David reported. What’s going on?
There is probably only a marginal effect if you run the scheme code with the cache empty or with previously cached results. The reason why it is so fast is that there are so many redundant calculations that don’t need to be performed when the results are cached. Without a cache, calculating phi for 1e9 will take (I don’t know, days?) Calculating phi for pi(1e6) naively without caching takes hours (and I’m not sure how long because I gave up after it was chugging for a few hours.)
Mike’s optimization removes some computational dead ends, and he also figured out what the recurrence actually does, so replacing a big chunk of the recurrence with the call to binary search the prime array saves a lot of time, but probably still leaves a lot of redundant computation in place. It’s time is still much improved over trying to compute phi naively without a cache.
Clojure is based on the JVM, and Java was not designed to be a functional programming language, so there’s not a whole lot of stack. I’m not sure why Python has so much problems. I worked on the Python code in the early 90’s and at that time stack frames were allocated on the heap. Because Python now has generators, there is even less reason to use the system stack for procedure calls, so it doesn’t make sense that there would be a stack limit…
This was fun, especially as my copy of Riesel arrived tonight. Two versions in Perl, not tested extensively. An obvious one:
#!/usr/bin/env perl
use strict;
use warnings; no warnings 'recursion';
use Memoize;
use Math::Prime::Util qw/primes/;
my @primea = @{primes(1_000_000)};
memoize('phi');
sub phi {
my ($x, $a) = @_;
return ($a ? int(($x+1)/2) : $x) if $a <= 1;
return 1 if $x < $primea[$a];
return phi($x, $a-1) - phi( int($x / $primea[$a-1]), $a-1);
}
sub pi {
my $n = shift;
return 0 if $n < 2;
my $a = pi(int(sqrt($n)));
return phi($n, $a) + $a - 1;
}
0.104s for 10**6 (78498), 41.1s for 10**9 (50847534).
I then made a non-recursive phi:
#!/usr/bin/env perl
use strict;
use warnings;
use Math::Prime::Util qw/primes/;
my @primea = @{primes(1_000_000)};
sub phi {
my ($x, $a) = @_;
return $x if $a == 0;
my @add = ($x);
my @sub;
my $sum = 0;
while ($a-- > 1) {
my $prime = $primea[$a];
my @newadd = map { int($_ / $prime) } @sub;
my @newsub = map { int($_ / $prime) } @add;
push @add, @newadd;
push @sub, @newsub;
$prime = $primea[$a-1];
my $count_ones = (scalar @add) - (scalar @sub);
@add = grep { $_ > $prime } @add;
@sub = grep { $_ > $prime } @sub;
$count_ones = $count_ones - (scalar @add) + (scalar @sub);
$sum += $count_ones;
}
$sum += int(($_+1)/2) for @add;
$sum -= int(($_+1)/2) for @sub;
$sum;
}
sub pi {
my $n = shift;
return 0 if $n < 2;
my $a = pi(int(sqrt($n)));
return phi($n, $a) + $a - 1;
}
0.016s for 10**6 (78498), 11.5s for 10**9 (50847534).
Math::Prime::Util::prime_count(10**9) takes under 0.5s but that’s cheating (it’s also written in C — its pure perl prime_count code is much slower).
Hopefully formatting will work this time (I wish there was a preview). A new version, with better phi(x,a), and also Lehmer essentially straight from from Riesel (page 22):
#!/usr/bin/env perl use strict; use warnings; use Math::Prime::Util qw/-nobigint primes prime_count prime_precalc/; use Time::HiRes qw(gettimeofday tv_interval); # Get 10M primes -- enough to test to at least 10^13 Math::Prime::Util::prime_precalc(10_000_000); my @primea = @{primes(10_000_000)}; sub phi { my ($x, $a) = @_; # Some shortcuts for small x, a = 0, a = 1 return $x if $a == 0; return int(($x+1)/2) if $a == 1; return ($x > 0 ? 1 : 0) if $x < $primea[$a]; my %vals = ( $x => 1 ); my $sum = 0; while ($a-- > 2) { my $p1 = $primea[$a]; my $p2 = $primea[$a-1]; my %newvals; while (my($v,$c) = each %vals) { next if $c == 0; # Prune out zero counts if ($v > $p2) { $newvals{$v} += $c; my $sval = int($v / $p1); if ($sval > $p2) { $newvals{$sval} -= $c; } else { $sum -= $c; } } } %vals = %newvals; } die unless $a == 1; while (my($v,$c) = each %vals) { $sum = $sum + $c * int(($v+1)/2) - $c * int((int($v/3)+1)/2); } $sum; } sub legendre_pi { my $n = shift; return 0 if $n < 2; my $a = legendre_pi(int(sqrt($n))); return phi($n, $a) + $a - 1; } sub lehmer { my $x = shift; return prime_count($x) if $x < 1_000; my $z = int(sqrt($x+0.5)); my $a = lehmer(int(sqrt($z)+0.5)); my $b = lehmer($z); my $c = lehmer(int($x**(1/3)+0.5)); my $sum = phi($x, $a) + int(($b + $a - 2) * ($b - $a + 1) / 2); # Make sure these calls are fast. # 1M for 10**8, 32M for 10**10, 5600M for 10**12 prime_precalc( int($x / $primea[$a]) ); foreach my $i ($a+1 .. $b) { my $w = int($x / $primea[$i-1]); $sum = $sum - prime_count($w); if ($i <= $c) { my $bi = prime_count(int(sqrt($w)+0.5)); foreach my $j ($i .. $bi) { $sum = $sum - prime_count(int($w / $primea[$j-1])) + $j - 1; } } } $sum; } for my $e (0 .. 13) { my $start = [gettimeofday]; my $pi = lehmer( int(10 ** $e) ); my $seconds = tv_interval($start); printf "10^$e %10.6fs $pi\n", $seconds; }Timing on my machine for 10**9 is 6.2s for Legendre, 0.12s for Lehmer. I tried Lehmer counts 10**12 in 24.9s, 10**13 in 186.2s. Pretty fantastic for Perl. I am using a Perl/XS library to get the prime list and do the small prime_counts in the Lehmer loop. In theory those could all be precalculated (MPU’s prime count is doing bit counts on the pre-sieved area to get the result).
More timing:
primesieve MPU Legendre Lehmer Python-Legendre 10** 8 0.01s 0.08s 0.82s 0.02s 12.4s 10** 9 0.04s 0.5s 6.2s 0.09s 123.3s 10**10 0.4s 5.3s 47.7s 0.49s 1272.6s 10**11 4.3s 67.5s 408.0s 3.1s 10**12 52.6s 1029.9s 22.8s 10**13 642.0s 177.7s 10**14 1472.5sprimesieve is helped a lot by using 12 threads but it’s seriously fast even with only one thread (4-10x slower than multi-threaded). Both it and Math::Prime::Util are sieving the entire range (in segments) and doing fast bit counts, and both use very low memory since they run segmented. “Python-Legendre” is Mike’s code on my machine. The Lehmer results are great (IMO) for a first pass. Lots of optimization could be done on the “small” prime_count calls rather than its current method of making a big sieve and then counting all the bits up to the target, for each call.
Memoizing the phi function is the way to go. Ruby (which is slower language in execution than Clojure, generally) can compute pi(1e9) in 70 seconds, vs. 235 in Clojure. This is because there are no stack space issues here…
require 'benchmark' require_relative 'lazy-sieve' $primegen = LazySieve.new $prime = [0] def required_primes(min) if $prime[-1] < min then ($primegen.upto {|n| n <= min}).each {|p| $prime << p} end end def small_pi(x) lo = 0 hi = $prime.length while lo < hi do m = (lo + hi) / 2 case x <=> $prime[m] when 0 then return m when -1 then hi = m when 1 then lo = m + 1 end end return lo - 1 end $phi_cache = {} def phi(x, a) y = $phi_cache[[x,a]] return y unless y.nil? $phi_cache[[x,a]] = if a == 1 then (x + 1) / 2 else phi(x, a-1) - phi(x / $prime[a], a - 1) end end def legendre_pi(n) rt_n = Math.sqrt(n).to_i required_primes [rt_n, 100].max if n <= $prime[-1] then small_pi(n) else a = small_pi(rt_n) phi(n, a) + a - 1 end end def pi_table(limit) n = 1 pi = 0 puts "n time (s) pi(n) % prime" (0..limit).each do |c| seconds = Benchmark.realtime {pi = legendre_pi(n)} printf "1e%d %11.6f %11d %5.2f%%\n", c, seconds, pi, pi.to_f/n * 100 n *= 10 end end pi_table 9 PS C:\Users\dave\Documents\dev\ruby> ruby .\prime-pi.rb n time (s) pi(n) % prime 1e0 0.003001 0 0.00% 1e1 0.000000 4 40.00% 1e2 0.000000 25 25.00% 1e3 0.001000 168 16.80% 1e4 0.002000 1229 12.29% 1e5 0.020001 9592 9.59% 1e6 0.120007 78498 7.85% 1e7 0.769044 664579 6.65% 1e8 6.259358 5761455 5.76% 1e9 70.721045 50847534 5.08%David, I don’t have your LazySieve, but just threw in a standard sieve to the nth_upper_limit of rt_n (and verified the sieving is a trivial amount of time and the prime counts are identical). I get “stack level too deep” errors in phi when trying to run 1e8 using Ruby 1.8.7. I had to set ‘ulimit -s 32768’ to get it to run 1e9 successfully (hmm, looks like ruby for Windows presets the stack to a giant size so that’s why it runs fine for you). Sadly I’m getting times that are *much* slower than yours — perhaps a 1.8.7 vs. 1.9.x issue. It’s 3.1s for 1e7, 58s for 1e8, and over 10 minutes for 1e9.
I believe ‘Memoizing the phi function is the way to go’ only applies to very small n values. My Legendre version in Perl above is over 10x faster than your Ruby numbers, and Perl isn’t _that_ much faster (I have a fast computer, but my memoized versions in Perl are slower than your Ruby numbers). When I did memoization it was not only slow for large n, but used gawdawful amounts of memory. The ruby code is using over 700MB of memory computing pi(10**9).
One of the main issues I ran into when doing Lehmer with large values (see https://programmingpraxis.com/2011/07/26/more-prime-counting-functions/) was space more so than just speed, making simple memoization a very bad solution.
I am afraid that I have been out of the loop as I lost email notifications for quite some time, but eventually noticed this task. The problems being experienced with recursion depth and the need for memoization are due to the naive formulation of the task.
RosettaCode reasonably recently added this same task (https://rosettacode.org/wiki/Legendre_prime_counting_function) to which I added “Comments to the Task”. It turns out that there is a very simple optimizaton of stopping ” “splitting” the φ(x, a) “tree” when ‘x’ is zero” that stops the exponential growth of the recursion tree and will give much better performance even without memoization. For instance, the non-memoizing Haskell version I contributed there takes only about 273 milliseconds to count the primes to 1e9: https://rosettacode.org/wiki/Legendre_prime_counting_function#Non_Memoized_Haskell_Versions.
There are other optimizations covered there, such as reducing recursion by using a “bottom-up” rather than “top-down” recursion and a look-up table for small values of “phi”, but the major advance is to use a scheme that doesn’t use recursion at all but partial sieving with linear arrays as implemented in several languages there, including Haskell and Nim. Using this technique, the primes to 1e9 can be calculated in just a millisecond or two, to 1e11 in about 10 milliseconds, and to the quite large range of 1e16 in minutes. Note that not using recursion is a form of “LMO” as applied to the Meissel (rather than the Legendre) technique but converted back to apply to the Legendre mathematics, and that this technique is no longer linear time complexity with range (or exponential without memoization or the short-circuiting optimization). However, using the Legendre partial-sieving technique still has its limits as it uses considerably more memory (proportional to the square root of the counting range rather than the cube root) and therefore is limited to the range of about 1e16 on many modern desktop computers.
It is also more difficult to adapt even the Legendre partial sieving technique to multi-threading than the “LMO” and later techniques as applied to the Meissel mathematics, which have been used in the Kim Walich “primecount” (https://github.com/kimwalisch/primecount) implementations to count the number of primes to very large ranges of 1e29 and likely advancing…
It is not too hard to adapt the non-threading partial-sieving “LMO” technique to even use with JavaScript (https://stackoverflow.com/questions/43432760/looking-for-a-fast-prime-counting-function/70049343#70049343) which allows one to count the number of primes up to almost 1e16 in a minute or two…