Counting Primes Using Legendre’s Formula

July 22, 2011

We will need functions to compute the nth prime, for the calculation of φ, and to compute π(x) for small values of x, to terminate the recursion in the formula for π. We do both by using the sieve of Eratosthenes to create a vector of the primes up to some convenient limit (the vector has an additional nonsense value in the 0th position so we don’t have to keep subtracting 1 whenever we index the vector), then just index the vector to compute the nth prime and use binary search to compute π:

(define ps (list->vector (cons -1 (primes #e1e5))))

(define (p n) (vector-ref ps n))

(define (pi n) (let loop ((lo 1) (hi (- (vector-length ps) 1))) (let ((mid (quotient (+ lo hi) 2))) (cond ((<= hi lo) mid) ((< n (p mid)) (loop lo (- mid 1))) ((< (p mid) n) (loop (+ mid 1) hi)) (else mid)))))

The φ function is computed according to the recurrence equation. We memoize the function because small results appear again and again in the recursive calculations:

(define phi (let ((memo (make-hash (lambda (x) (+ (* 1000 (car x)) (cadr x))) equal? #f 9973))) (lambda args (let ((x (car args)) (a (cadr args)) (t (memo 'lookup args))) (cond (t t) ; return memoized value ((= a 1) (let ((t (quotient (+ x 1) 2))) (memo 'insert args t) t)) (else (let ((t (- (phi x (- a 1)) (phi (quotient x (p a)) (- a 1))))) (memo 'insert args t) t)))))))

Now the calculation of π is simple:

(define (prime-pi n) (let ((a (pi (isqrt n)))) (+ (phi n a) a -1)))

Here is Legendre’s calculation:

> (prime-pi 1000000) 78498

We used hash tables and the isqrt function from the Standard Prelude, and primes from a previous exercise. You can run the program at http://programmingpraxis.codepad.org/sNCV5Y0d.

Posted by programmingpraxis

Filed in Exercises

17 Comments »

17 Responses to “Counting Primes Using Legendre’s Formula”

Graham said

July 25, 2011 at 12:23 AM

Even with memoization, my Python solution didn’t finish quickly enough for me. Interestingly, (or not, as the case may be)
the following naive Haskell implementation finishes much more quickly (especially when compiled via GHC).

import Data.List (foldl')
import Data.Numbers.Primes (primes)

sieve :: Int -> [Int] -> [Int]
sieve p = filter (\x -> x `mod` p /= 0)

phi :: Int -> Int -> Int
phi x a = length $ foldl' (flip sieve) [1..x] (take a primes)

isqrt :: Int -> Int
isqrt = floor . sqrt . fromIntegral

primePi :: Int -> Int
primePi 1 = 0
primePi x = phi x a + a - 1 where a = primePi $ isqrt x

main :: IO ()
main = print $ primePi 1000000

I guess the next step is to get good enough at Haskell to figure out a way to cut the number of phi calls.

programmingpraxis said
July 25, 2011 at 12:37 AM
The reference solution finds pi(10^6) in 72ms on my machine, and pi(10^7) in about ten times that, which is still less than a second. What kind of times are you seeing?

Sieving by trial division to get a list of the primes is ugly. The fastest, prettiest prime generator that I know in Haskell is by Melissa O’Neill: http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf.
Graham said
July 25, 2011 at 1:26 PM
The reference solution finds pi(10^6) in about 0.3 seconds and pi(10^7) in about 1.6 (under Petite Chez Scheme). Running the naive Haskell above with GHC’s runhaskell command (so, intepreted) takes almost 9.6 seconds for pi(10^6), while compiled with gch --make -O2 take about 0.5 seconds. My Haskell interpreted chokes when going for pi(10^7), but manages to finish in about 11 seconds when compiled. Thanks for the O’Neill paper; I think I’d come across it before, but hadn’t checked it out in depth.

Mike said

July 26, 2011 at 10:46 PM

Python 3.2

from bisect import bisect
from functools import lru_cache
from math import sqrt
from utils import primesTo

prime = list(primesTo(10000))

@lru_cache(maxsize=None)
def phi(x, a):
    if a == 1:
        return (x+1)//2
    else:
        return phi(x, a-1) - phi(x//prime[a - 1], a - 1)


def pi(n):
    ''' return number of primes <= n

    >>> pi(10)
    4
    >>> pi(1000)
    168
    >>> pi(1000000)
    78498
    '''
        
    if n < prime[-1]:
        return bisect(prime, n)
    
    else:
        a = pi(int(sqrt(n)))
        return phi(n, a) + a - 1

##if __name__ == "__main__":
##    import doctest
##    doctest.testmod()

Graham said
July 27, 2011 at 1:14 PM
@Mike, your solution seems to run into the same problem my Python work did, even with Python 3.2: trying pi(10^7) causes the interpreter to crash after exceeding the recursion limit. Any thoughts?
Mike said
August 3, 2011 at 12:06 AM
@Graham,

I had trouble with the stack limit as well.

Evaluating the code for pi() with n = 1e7, a = 966. So, pi(1e7) = phi(1e7, 966) + 966 – 1. phi(1e7, 966) depends on phi(1e7,965), which depends on phi(1e7,964), … down to phi(1e7, 1). That’s 966 stack frames, which is close to the stack overflow limit.

I suspect the python shell uses a few stack frames, but probably not enough to push pi(1e7) over the limit.

I believe the culprit is the lru cache decorator for the phi() function. Looking at the source code in functools.py, when there is a cache miss, the wrapped function calls the original code for phi(). So I think that on a cache miss, two frames get pushed on the stack (one for the call to the wrapped function and one for the call to the original function). The original function may the recurse and call the wrapped function, etc.

I have not tested my hypothesis. However, if I pre-populate the cache by calling pi() in a loop with gradually increasing arguments, I can calculate pi() for larger arguments without hitting the stack limit.
programmingpraxis said
August 3, 2011 at 3:00 AM
Scheme has no trouble with stack overflow because stack frames are stored on the heap, not on the stack.

I don’t know enough about Python to help. Here are some random thoughts on the matter. I don’t know if any of them are useful.

Setting a higher stack overflow limit only helps for a little while. I don’t know if that is possible in Python, but in any event it is not a serious solution.

Does Python have some kind of lazy evaluation method? If it does, and if it uses the heap instead of the stack, maybe you could arrange to use that instead of the normal recursion stack.

You could build your own stack out of cons pairs stored on the heap, and make it however big you want, instead of using Python’s built-in recursion stack.

You don’t necessarily need to use the recurrence equation to calculate Legendre’s sum. See the MathWorld article for Legendre’s Formula. Specifically, equation (1) may be useful.
Mike said
August 3, 2011 at 11:31 PM
Here is a Python 3 version of phi(x,a) using a separate stack.

I optimized the code a bit.
–If x < prime[a], the second term in the recurrence equation is zero, and the first term in the recurrence equation is basically doing a linear search through the primes to find the largest prime that divides x. The bisect() call replaces the linear search with a binary search to find the largest prime less than x.
–As stated in the problem, the recurrence terminates when a == 1. The recurrence can also be terminated when x == 1, where phi(1,a) is +1 or -1 depending on the sign the phi term would have (the variable s in the code).

from bisect import bisect
from math import sqrt
from utils import primesTo

# pad prime[] so that prime[n] is the nth prime number
prime = [0] + primesTo(100000)

def phi(x, a):
if a == 0:
return x

stack = [(1,x,a)]
value = 0
while stack:
s, x, a = stack.pop()

if x == 1:
value += s

else:
if x < prime[a]:
a = bisect(prime, x, 0, a) – 1

if a == 1:
value += s * ((x + 1)//2)

else:
stack.extend([(s, x, a-1), (-s, x//prime[a], a-1)])

return value

def pi(x):
if x < 2:
return 0

else:
a = pi(int(sqrt(x)))
return phi(x, a) + a – 1

There is no caching of intermediate results, so it is not very fast:

_n_______time (s)_____pi____
10^0 0.0000083810 0
10^1 0.0000455365 4
10^2 0.0001030857 25
10^3 0.0007869715 168
10^4 0.0064016516 1229
10^5 0.0681326817 9592
10^6 0.2781906639 78498
10^7 2.4777696607 664579
10^8 25.5805003150 5761455
10^9 262.7440423294 50847534

Mike said

August 3, 2011 at 11:37 PM

With formatting:

from bisect import bisect
from math import sqrt
from utils import primesTo

# pad prime[] so that prime[n] is the nth prime number
prime = [0] + primesTo(100000)

def phi(x, a):
    if a == 0:
        return x

    stack = [(1,x,a)]
    value = 0
    while stack:
        s, x, a = stack.pop()

        if x == 1:
            value += s

        else:
            if x < prime[a]:
                a = bisect(prime, x, 0, a) – 1

            if a == 1:
                value += s * ((x + 1)//2)

            else:
                stack.extend([(s, x, a-1), (-s, x//prime[a], a-1)])

    return value

def pi(x):
    if x < 2:
        return 0

    else:
        a = pi(int(sqrt(x)))
        return phi(x, a) + a – 1

_n_______time (s)_____pi____
10^0   0.0000083810   0
10^1   0.0000455365   4
10^2   0.0001030857   25
10^3   0.0007869715   168
10^4   0.0064016516   1229
10^5   0.0681326817   9592
10^6   0.2781906639   78498
10^7   2.4777696607   664579
10^8  25.5805003150   5761455
10^9 262.7440423294   50847534

David said

August 24, 2012 at 2:49 PM

Clojure also runs out of stack space, and just using a stack and a loop is not obvious how to memoize the intermediate results. So I took advantage of Mike’s optimization above (I admit I never would have figured out what the recurrence actually does on my own.) Uses O’Neil sieve algorithm to create enough primes to calculate pi to 1e9.

(load "lazy-sieve")

(def max-prime (int 1e9)) 
(def prime  (into (vector-of :int 0) (take-while #(< %1 (int (Math/sqrt max-prime))) lazy-primes)))

(defn small-pi
   "pi(n) for small n {n <= (last prime)}"
   [n]
   (let [pos (java.util.Collections/binarySearch prime (int n))]
      (if (>= pos 0)
         pos
         (- (+ pos 2)))))

(defn phi [x a]
   (loop [stack (list [1, x, a]), n 0]
      (if (not (empty? stack))
         (let [[sgn, x, a] (first stack), l (rest stack)]
            (if (= x 1)
               (recur l (+ n sgn))
               (let [a' (if (< x (prime a))  (small-pi x)  a)]
                  (if (= a' 1)
                     (recur l (+ n (* sgn (int (/ (inc x) 2)))))
                     (recur (conj
                               (conj l [sgn, x, (dec a')])
                               [(- sgn), (int (/ x (prime a'))), (dec a')])
                            n)))))
         n)))

(defn prime-pi
   "Compute pi(n), upto 1 billion"
   [n]
   (if (or (< n 0)  (> n max-prime))
      (throw (IllegalArgumentException. "argument to prime-pi out of range.")))

   (if (<= n (last prime))
      (small-pi n)
      (let [a (prime-pi (int (Math/sqrt n)))]
         (+ (phi n a) a -1))))

(defmacro timex
   "Evaluates expr and returns expr and the time it took in milliseconds."
   [expr]
   `(let [start# (. System (nanoTime)), ret# ~expr]
       [(/ (double (- (. System (nanoTime)) start#)) 1e6), ret#]))

(defn pi-table
   "Count primes by powers of 10"
   [limit]
   (println "n      time (s)       pi(n)    % prime")
   (loop [n 1, c 0]
      (let [[t, pi] (timex (prime-pi n))]
         (printf "1e%d %11.6f %11d    %5.2f%%\n" c (/ t 1000) pi (/ (* 100.0 pi) n))
         (if (< c limit)
            (recur (* 10 n) (inc c))))))

(comment
user=> (pi-table 9)
n      time (s)       pi(n)    % prime
1e0    0.000949           0     0.00%
1e1    0.000764           4    40.00%
1e2    0.000762          25    25.00%
1e3    0.000822         168    16.80%
1e4    0.000759        1229    12.29%
1e5    0.046668        9592     9.59%
1e6    0.218717       78498     7.85%
1e7    2.206287      664579     6.65%
1e8   22.933560     5761455     5.76%
1e9  234.952010    50847534     5.08%
)

programmingpraxis said
August 24, 2012 at 4:55 PM
My experience is different that everyone else. I have no trouble with recursion limits, and my timings are quite fast. I just computed pi(10^9) from an initial load with no previously-cached results:

Petite Chez Scheme Version 8.4 Copyright (c) 1985-2011 Cadence Research Systems
> (load "e:/prime.ss") > (time (prime-pi #e1e9)) (time (prime-pi 1000000000)) 35 collections 1014 ms elapsed cpu time, including 62 ms collecting 1010 ms elapsed real time, including 72 ms collecting 298285040 bytes allocated, including 277790944 bytes reclaimed 50847534

One second is a whole lot better than the four minutes David reported. What’s going on?
David said
September 9, 2012 at 10:53 PM
There is probably only a marginal effect if you run the scheme code with the cache empty or with previously cached results. The reason why it is so fast is that there are so many redundant calculations that don’t need to be performed when the results are cached. Without a cache, calculating phi for 1e9 will take (I don’t know, days?) Calculating phi for pi(1e6) naively without caching takes hours (and I’m not sure how long because I gave up after it was chugging for a few hours.)

Mike’s optimization removes some computational dead ends, and he also figured out what the recurrence actually does, so replacing a big chunk of the recurrence with the call to binary search the prime array saves a lot of time, but probably still leaves a lot of redundant computation in place. It’s time is still much improved over trying to compute phi naively without a cache.

Clojure is based on the JVM, and Java was not designed to be a functional programming language, so there’s not a whole lot of stack. I’m not sure why Python has so much problems. I worked on the Python code in the early 90’s and at that time stack frames were allocated on the heap. Because Python now has generators, there is even less reason to use the system stack for procedure calls, so it doesn’t make sense that there would be a stack limit…
Dana said
November 10, 2012 at 9:17 AM
This was fun, especially as my copy of Riesel arrived tonight. Two versions in Perl, not tested extensively. An obvious one:

#!/usr/bin/env perl use strict; use warnings; no warnings 'recursion'; use Memoize; use Math::Prime::Util qw/primes/;
my @primea = @{primes(1_000_000)}; memoize('phi'); sub phi { my ($x, $a) = @_; return ($a ? int(($x+1)/2) : $x) if $a <= 1; return 1 if $x < $primea[$a]; return phi($x, $a-1) - phi( int($x / $primea[$a-1]), $a-1); } sub pi { my $n = shift; return 0 if $n < 2;
my $a = pi(int(sqrt($n))); return phi($n, $a) + $a - 1; }
0.104s for 10**6 (78498), 41.1s for 10**9 (50847534).

I then made a non-recursive phi:
#!/usr/bin/env perl use strict; use warnings; use Math::Prime::Util qw/primes/;
my @primea = @{primes(1_000_000)}; sub phi { my ($x, $a) = @_; return $x if $a == 0; my @add = ($x); my @sub; my $sum = 0; while ($a-- > 1) { my $prime = $primea[$a]; my @newadd = map { int($_ / $prime) } @sub; my @newsub = map { int($_ / $prime) } @add; push @add, @newadd; push @sub, @newsub; $prime = $primea[$a-1]; my $count_ones = (scalar @add) - (scalar @sub); @add = grep { $_ > $prime } @add; @sub = grep { $_ > $prime } @sub; $count_ones = $count_ones - (scalar @add) + (scalar @sub); $sum += $count_ones; } $sum += int(($_+1)/2) for @add; $sum -= int(($_+1)/2) for @sub; $sum; } sub pi { my $n = shift; return 0 if $n < 2;
my $a = pi(int(sqrt($n))); return phi($n, $a) + $a - 1; }
0.016s for 10**6 (78498), 11.5s for 10**9 (50847534).

Math::Prime::Util::prime_count(10**9) takes under 0.5s but that’s cheating (it’s also written in C — its pure perl prime_count code is much slower).

danaj said

November 11, 2012 at 6:11 AM

Hopefully formatting will work this time (I wish there was a preview). A new version, with better phi(x,a), and also Lehmer essentially straight from from Riesel (page 22):

#!/usr/bin/env perl
use strict;
use warnings;
use Math::Prime::Util qw/-nobigint primes prime_count prime_precalc/;
use Time::HiRes qw(gettimeofday tv_interval);

# Get 10M primes -- enough to test to at least 10^13
Math::Prime::Util::prime_precalc(10_000_000);
my @primea = @{primes(10_000_000)};

sub phi {
  my ($x, $a) = @_;
  # Some shortcuts for small x, a = 0, a = 1
  return $x if $a == 0;
  return int(($x+1)/2) if $a == 1;
  return ($x > 0 ? 1 : 0) if $x < $primea[$a];

  my %vals = ( $x => 1 );
  my $sum = 0;
  while ($a-- > 2) {
    my $p1 = $primea[$a];
    my $p2 = $primea[$a-1];
    my %newvals;
    while (my($v,$c) = each %vals) {
      next if $c == 0;   # Prune out zero counts
      if ($v > $p2) {
        $newvals{$v} += $c;
        my $sval = int($v / $p1);
        if ($sval > $p2) {
          $newvals{$sval} -= $c;
        } else {
          $sum -= $c;
        }
      }
    }
    %vals = %newvals;
  }
  die unless $a == 1;
  while (my($v,$c) = each %vals) {
    $sum = $sum + $c * int(($v+1)/2) - $c * int((int($v/3)+1)/2);
  }
  $sum;
}

sub legendre_pi {
  my $n = shift;
  return 0 if $n < 2;

  my $a = legendre_pi(int(sqrt($n)));
  return phi($n, $a) + $a - 1;
}


sub lehmer {
  my $x = shift;
  return prime_count($x) if $x < 1_000;
  my $z = int(sqrt($x+0.5));
  my $a = lehmer(int(sqrt($z)+0.5));
  my $b = lehmer($z);
  my $c = lehmer(int($x**(1/3)+0.5));
  my $sum = phi($x, $a) + int(($b + $a - 2) * ($b - $a + 1) / 2);

  # Make sure these calls are fast.
  # 1M for 10**8, 32M for 10**10, 5600M for 10**12
  prime_precalc( int($x / $primea[$a]) );
  
  foreach my $i ($a+1 .. $b) {
    my $w = int($x / $primea[$i-1]);
    $sum = $sum - prime_count($w);
    if ($i <= $c) {
      my $bi = prime_count(int(sqrt($w)+0.5));
      foreach my $j ($i .. $bi) {
        $sum = $sum - prime_count(int($w / $primea[$j-1])) + $j - 1;
      }
    }
  }
  $sum;
}

for my $e (0 .. 13) {
  my $start = [gettimeofday];
  my $pi = lehmer( int(10 ** $e) );
  my $seconds = tv_interval($start);
  printf "10^$e %10.6fs $pi\n", $seconds;
}

Timing on my machine for 10**9 is 6.2s for Legendre, 0.12s for Lehmer. I tried Lehmer counts 10**12 in 24.9s, 10**13 in 186.2s. Pretty fantastic for Perl. I am using a Perl/XS library to get the prime list and do the small prime_counts in the Lehmer loop. In theory those could all be precalculated (MPU’s prime count is doing bit counts on the pre-sieved area to get the result).

More timing:

        primesieve   MPU     Legendre   Lehmer   Python-Legendre
10** 8     0.01s      0.08s    0.82s     0.02s      12.4s
10** 9     0.04s      0.5s     6.2s      0.09s     123.3s
10**10     0.4s       5.3s    47.7s      0.49s    1272.6s
10**11     4.3s      67.5s   408.0s      3.1s
10**12    52.6s    1029.9s              22.8s
10**13   642.0s                        177.7s
10**14                                1472.5s

primesieve is helped a lot by using 12 threads but it’s seriously fast even with only one thread (4-10x slower than multi-threaded). Both it and Math::Prime::Util are sieving the entire range (in segments) and doing fast bit counts, and both use very low memory since they run segmented. “Python-Legendre” is Mike’s code on my machine. The Lehmer results are great (IMO) for a first pass. Lots of optimization could be done on the “small” prime_count calls rather than its current method of making a big sieve and then counting all the bits up to the target, for each call.

David said

December 27, 2012 at 2:43 AM

Memoizing the phi function is the way to go. Ruby (which is slower language in execution than Clojure, generally) can compute pi(1e9) in 70 seconds, vs. 235 in Clojure. This is because there are no stack space issues here…

require 'benchmark'
require_relative 'lazy-sieve'

$primegen = LazySieve.new
$prime = [0]

def required_primes(min)
    if $prime[-1] < min then
        ($primegen.upto {|n| n <= min}).each {|p| $prime << p}
    end
end


def small_pi(x)
    lo = 0
    hi = $prime.length
    while lo < hi do
        m = (lo + hi) / 2
        case x <=> $prime[m]
            when  0 then return m
            when -1 then hi = m
            when  1 then lo = m + 1
        end
    end
    return lo - 1
end


$phi_cache = {}
def phi(x, a)
    y = $phi_cache[[x,a]]
    return y unless y.nil?

    $phi_cache[[x,a]] =
        if a == 1 then
            (x + 1) / 2
        else
            phi(x, a-1) - phi(x / $prime[a], a - 1)
        end
end


def legendre_pi(n)
    rt_n = Math.sqrt(n).to_i
    required_primes [rt_n, 100].max

    if n <= $prime[-1] then
        small_pi(n)
    else
        a = small_pi(rt_n)
        phi(n, a) + a - 1
    end
end


def pi_table(limit)
    n = 1
    pi = 0

    puts "n      time (s)       pi(n)    % prime"
    (0..limit).each do |c|
        seconds = Benchmark.realtime {pi = legendre_pi(n)}
        printf "1e%d %11.6f %11d    %5.2f%%\n", c, seconds, pi, pi.to_f/n * 100
        n *= 10
    end
end

pi_table 9


PS C:\Users\dave\Documents\dev\ruby> ruby .\prime-pi.rb
n      time (s)       pi(n)    % prime
1e0    0.003001           0     0.00%
1e1    0.000000           4    40.00%
1e2    0.000000          25    25.00%
1e3    0.001000         168    16.80%
1e4    0.002000        1229    12.29%
1e5    0.020001        9592     9.59%
1e6    0.120007       78498     7.85%
1e7    0.769044      664579     6.65%
1e8    6.259358     5761455     5.76%
1e9   70.721045    50847534     5.08%

danaj said
December 27, 2012 at 5:35 AM
David, I don’t have your LazySieve, but just threw in a standard sieve to the nth_upper_limit of rt_n (and verified the sieving is a trivial amount of time and the prime counts are identical). I get “stack level too deep” errors in phi when trying to run 1e8 using Ruby 1.8.7. I had to set ‘ulimit -s 32768’ to get it to run 1e9 successfully (hmm, looks like ruby for Windows presets the stack to a giant size so that’s why it runs fine for you). Sadly I’m getting times that are *much* slower than yours — perhaps a 1.8.7 vs. 1.9.x issue. It’s 3.1s for 1e7, 58s for 1e8, and over 10 minutes for 1e9.

I believe ‘Memoizing the phi function is the way to go’ only applies to very small n values. My Legendre version in Perl above is over 10x faster than your Ruby numbers, and Perl isn’t _that_ much faster (I have a fast computer, but my memoized versions in Perl are slower than your Ruby numbers). When I did memoization it was not only slow for large n, but used gawdawful amounts of memory. The ruby code is using over 700MB of memory computing pi(10**9).

One of the main issues I ran into when doing Lehmer with large values (see https://programmingpraxis.com/2011/07/26/more-prime-counting-functions/) was space more so than just speed, making simple memoization a very bad solution.
GordonBGood said
August 3, 2024 at 9:20 PM
I am afraid that I have been out of the loop as I lost email notifications for quite some time, but eventually noticed this task. The problems being experienced with recursion depth and the need for memoization are due to the naive formulation of the task.

RosettaCode reasonably recently added this same task (https://rosettacode.org/wiki/Legendre_prime_counting_function) to which I added “Comments to the Task”. It turns out that there is a very simple optimizaton of stopping ” “splitting” the φ(x, a) “tree” when ‘x’ is zero” that stops the exponential growth of the recursion tree and will give much better performance even without memoization. For instance, the non-memoizing Haskell version I contributed there takes only about 273 milliseconds to count the primes to 1e9: https://rosettacode.org/wiki/Legendre_prime_counting_function#Non_Memoized_Haskell_Versions.

There are other optimizations covered there, such as reducing recursion by using a “bottom-up” rather than “top-down” recursion and a look-up table for small values of “phi”, but the major advance is to use a scheme that doesn’t use recursion at all but partial sieving with linear arrays as implemented in several languages there, including Haskell and Nim. Using this technique, the primes to 1e9 can be calculated in just a millisecond or two, to 1e11 in about 10 milliseconds, and to the quite large range of 1e16 in minutes. Note that not using recursion is a form of “LMO” as applied to the Meissel (rather than the Legendre) technique but converted back to apply to the Legendre mathematics, and that this technique is no longer linear time complexity with range (or exponential without memoization or the short-circuiting optimization). However, using the Legendre partial-sieving technique still has its limits as it uses considerably more memory (proportional to the square root of the counting range rather than the cube root) and therefore is limited to the range of about 1e16 on many modern desktop computers.

It is also more difficult to adapt even the Legendre partial sieving technique to multi-threading than the “LMO” and later techniques as applied to the Meissel mathematics, which have been used in the Kim Walich “primecount” (https://github.com/kimwalisch/primecount) implementations to count the number of primes to very large ranges of 1e29 and likely advancing…

It is not too hard to adapt the non-threading partial-sieving “LMO” technique to even use with JavaScript (https://stackoverflow.com/questions/43432760/looking-for-a-fast-prime-counting-function/70049343#70049343) which allows one to count the number of primes up to almost 1e16 in a minute or two…

Programming Praxis