## More Prime-Counting Functions

### July 26, 2011

In the previous exercise we studied Legendre’s prime-counting function. In today’s exercise we will look at two more prime-counting functions, one from Ernst Meissel in the late 1800s and the other from Derrick Lehmer in the mid 1990s.

Both formulas are based on Legendre’s original formula; in both cases, they simply rearrange the terms of Legendre’s formula to eliminate some work. We won’t try to give derivations, as they are complex and chock-full of Greek letters; if you are interested, Hans Riesel’s book Prime Numbers and Computer Methods for Factorization was the source for all three formulas.

Meissel: $\pi(x) = \phi(x, c) + \frac{(b+c-2)(b-c+1)}{2} - \sum_{i=c+1}^b \pi\left(\left\lfloor\frac{x}{p_i}\right\rfloor\right)$, where $b = \pi(\lfloor \sqrt{x} \rfloor)$ and $c = \pi(\lfloor \sqrt[3]{x} \rfloor)$

Lehmer: $\pi(x) = \phi(x, a) + \frac{(b+a-2)(b-a+1)}{2} - \sum_{i=a+1}^b \pi \left( \left\lfloor \frac{x}{p_i} \right\rfloor \right) - \sum_{i=a+1}^c \sum_{j=i}^{b_i} \left( \pi \left( \left\lfloor \frac{x}{p_i p_j} \right\rfloor \right) - (j-1) \right)$, where $a = \pi(\lfloor \sqrt[4]{x} \rfloor)$, $b = \pi(\lfloor \sqrt{x} \rfloor)$, $c = \pi(\lfloor \sqrt[3]{x} \rfloor)$, and $b_i = \pi(\lfloor \sqrt{x / p_i} \rfloor)$ for $a < i <= c$.

Your task is to implement the prime-counting functions of Meissel and Lehmer, then compare timings with Legendre's prime-counting function of the previous exercise. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

### 6 Responses to “More Prime-Counting Functions”

1. Graham said

I think your write-up of Lehmer’s formula on the first page doesn’t match your code on the second. Specifically, what is b_i supposed to be? And is the fraction out front (b + a - 2) * (b - a + 1) / 2 or (b + c - 2) * (b - c + 1) / 2?

2. programmingpraxis said

Oops. I didn’t do that very well, did I? It’s now fixed.

b_i = pi(isqrt(x/p_i) for a<i<=c

The fraction is (b+a-2)(b-a+1)/2

3. Graham said

I’m completely off my game; Python is either giving me nonsense or complaining of exceeding the recursion depth. I’m going to start over, so I won’t have a solution for a bit.

4. danaj said

In Lehmer’s formula, it should be \phi(x, a). It looks like your program does the right thing. I just set a = c for Meissel, then it becomes a small change from Lehmer — just remove the last double summation (and obviously a and c are different).

5. programmingpraxis said

Fixed. Thank you.

6. danaj said

I don’t know if the formatting will work — crossing fingers.

Code: lehmer.c on github

alternate: link to exponent vs. log time graph

I wrote this over the last couple weeks as part of the Perl module Math::Prime::Util module, where it’s given a very nice speedup for prime_count and nth_prime. I decided to go ahead and make it work as a standalone program using primesieve, and add the Meissel formula. Thanks to this blog for giving me the added impetus to do it.

I use primesieve with its parallel siever, which means given 12 cores (6 + hyperthreading) it’s very fast. I don’t use parallelism anywhere else in the program, which is one reason why Legendre comes out so bad compared to straight sieving. We’re comparing my serial phi(x,a) code to primesieve’s optimized sieving. Also, I use primesieve for some of the large segment sieves at the end of the Lehmer and Meissel calculations. In theory I could recurse to get these values, but given, for example: Pi(5620819515485) when we just calculated the result of Pi(5615769079576), it’s a lot faster to just ask for the primes in that range.

The phi(x,a) calculation is an interesting exercise. Given that I want a program to work for n = 10^18 or higher, using simple recursion is not going to cut it. At some point space becomes as important as speed, and simple memoization won’t cut it either with hundreds of millions of unique x,a pairs. The solution I came up with was to walk down ‘a’ values holding a list of the current ‘x’ values for this ‘a’. It became clear that many of these values are identical, so we can coalesce them into a single unique ‘x’ and a signed count. At any point we’re only storing the unique ‘x’ values and a count for a single ‘a’ value. In Perl I used two hashes, while my solution in C is two heaps. augmented with small arrays to help performance at the bottom end where it becomes dense (that is, we will end up with a non-zero count for basically every ‘x’ value below some threshold). I still feel like this could be improved — two obvious non-optimal things are (1) it coalesces similar ‘x’ values on removal rather than insertion, and (2) two heaps are used while in theory we only need as much total memory as the largest one. #1 could be solved using an ordered list, though noting that anything using pointers per element is probably a non-starter just from memory use, but buckets might work. A good way to parallize the phi calculation would be nice also.

With Math::Prime::Util’s non-parallel segmented siever, the phi(x,a) calculation takes about 15% of the total time, which is a heck of a lot less than it was when I started. Using primesieve’s parallel and optimized sieving, it comes out nearer to 50% (in other words, primesieve in parallel makes the final stage (the summations) run ~10x faster than MPU in serial). There are different tradeoffs that could be made here — more memory spent on “small” primes would mean less time doing segment sieves. Parallelizing the loops would help MPU. Running the phi(x,a) calculation in parallel with the last stage is possible and could give a nice spsedup, but it would nearly double memory use, which is definitely a bad thing.

Table of raw times in seconds:

            Sieve      Legendre     Meissel      Lehmer
10^5        0.00141      0.00174      0.00285      0.00388
10^6        0.00161      0.00344      0.00306      0.0012
10^7        0.00378      0.01198      0.00358      0.00156
10^8        0.00775      0.04193      0.00555      0.00329
10^9        0.03724      0.32352      0.01652      0.01853
10^10       0.34622      2.72898      0.07415      0.03533
10^11       4.15675     22.83046      0.34653      0.12613
10^12      50.52079    195.34312      1.82301      0.57864
10^13     610.47052   1707.51764      9.36845      2.82593
10^14                                49.90988     13.97931
10^15                               289.39294     70.68814
10^16                              1733.79125    354.63109
10^17                                           1767.91606


Again, ‘Sieve’ is primesieve in parallel on 12 cores. Time growth is about a factor of 12x for each 10x larger range. For Legendre (completely serial — all time is spent in phi), growth is 8-9x. Meissel is 5-6x, Lehmer is 5x. When I did a fit on an earlier version with Lehmer using MPU last week it came out to c * x ^ 0.72 which is pretty close to where it should be (10^0.7 = 5.01 so it looks like Lehmer with primesieve is right on the money). One interesting point is how in your Scheme timings, Legendre was not much more than an order of magnitude slower than Meissel. While the absolute results for my Meissel vs. Lehmer are faster, the ratios aren’t that far off. Legendre is far, far off however. My guess is that Legendre is stuck doing a giant phi calculation in both cases, while in your Scheme code you have each prime count in the summations do recursive calls (assuming they’re larger than max-pi). There are two optimizations I made there: (1) if we generate just a few more primes than we needed for the prime[i] values in the summation loops, we can use a binary search to cut out a lot of the smaller calls, and (2) arrange the summation loop so it decrements i means the w values end up in increasing order, so we can use a segment sieve to just count the values between the previous call and the current one.

I haven’t tried, but I suspect memory constraints in the phi calculation are going to limit my program to not much more than 10^18, unless I buy more memory. I’m a little intrigued by how well running this offline using GMP and distributing the computations would work, but I suspect it still would be just too slow to be of any interest other than personal.