## Montgomery Multiplication

### July 29, 2014

We will work with 64-bit unsigned integers in C rather than unlimited-precision integers in Scheme, because the use of unlimited-precision integers makes the exercise trivial. We are following the text and code of Henry Warren’s description of Montgomery multiplication. We assume that long long integers are 64 bits, and select *r* = 2^{64}, which isn’t representable in a 64-bit integer, though that doesn’t matter:

`typedef unsigned long long ull;`

typedef signed long long sll;

We begin with the gcd calculation. We studied the binary gcd algorithm in a previous exercise, but we can simplify and speed that algorithm because we know that *r* is a power of 2 and *m* is odd, so there are no powers of 2 to eliminate. Note that *r* is half what it “should” be, so this function computes `rinv`

and `mprime`

such that (2*r*) * `rinv`

− *m* * `mprime`

= 1 rather than the expected *r* * `rinv`

+ *m* * `mprime`

= 1. Here’s the code:

`void gcd(ull r, ull m, ull *r_inv, ull *m_prime) {`

ull r_saved, m_saved, u, v;

r_saved = r; m_saved = m; u = 1; v = 0;

while (r > 0) {

r = r >> 1;

if ((u & 1) == 0) {

u = u >> 1; v = v >> 1; }

else {

u = ((u ^ m_saved) >> 1) + (u & m_saved);

v = (v >> 1) + r_saved; } }

*r_inv = u; *m_prime = v;

return; }

Montgomery multiplication needs the 128-bit product of two 64-bit numbers. We keep the high bits (most significant bits) and the low bits (least significant bits) of the product separate, using the grade-school multiplication algorithm on 32-bit “digits” in an algorithm given at Knuth 4.3.1 M:

`void mul_ull(ull x, ull y, ull *xy_hi, ull *xy_lo) {`

ull x0, x1, y0, y1, t, p0, p1, p2;

x1 = x >> 32; x0 = x & 0xFFFFFFFF;

y1 = y >> 32; y0 = y & 0xFFFFFFFF;

```
``` t = x0 * y0;

p0 = t & 0xFFFFFFFF;

t = x1 * y0 + (t >> 32);

p1 = t & 0xFFFFFFFF;

p2 = t >> 32;

t = x0 * y1 + p1;

*xy_hi = x1 * y1 + p2 + (t >> 32);

` return; }`

The modulus operator divides (*x* || *y*) by the modulus *m*, returning the remainder. This function assumes *x* < *m*, which is enforced by the `mulmod_ull`

function given below. When the loop finished, *y* is the quotient and *x* is the remainder:

`ull mod_ull(ull x, ull y, ull m) {`

sll i, t;

for (i = 1; i <= 64; i++) {

t = (sll) x >> 63;

x = (x << 1) | (y >> 63);

y = y << 1;

if ((x | t) >= m) {

x = x - m; y = y + 1; } }

return x; }

We are ready now for the Montgomery step, which takes ā, b̄, *m* and *m*′ and returns *a b r* mod *m*. The code follows the three-step process described above:

`ull mont_mul(ull a_bar, ull b_bar, ull m, ull m_prime) {`

ull t_hi, t_lo, tm, tmm_hi, tmm_lo, u_hi, u_lo, ov;

mul_ull(a_bar, b_bar, &t_hi, &t_lo);

tm = t_lo * m_prime;

mul_ull(tm, m, &tmm_hi, &tmm_lo);

u_lo = t_lo + tmm_lo; u_hi = t_hi + tmm_hi;

if (u_lo < t_lo) u_hi = u_hi + 1;

ov = (u_hi < t_hi) | ((u_hi == t_hi) & (u_lo < t_lo));

u_lo = u_hi; u_hi = 0;

u_lo = u_lo - (m & -(ov | (u_lo >= m)));

return u_lo; }

Finally we are ready to specify the modular multiplication, following the four-step algorithm given above; we return 1 if the operation succeeded or 0 if its arguments are out of bounds, and return the product in the fourth argument:

`int mulmod_ull(ull a, ull b, ull m, ull *p) {`

ull half_r = 0x8000000000000000LL;

ull r_inv, m_prime, a_bar, b_bar, p_hi, p_lo;

```
``` if ((m & 1) == 0) { return 0; } /* must be odd */

if (m <= a) { return 0; } /* must be less than m */

if (m <= b) { return 0; } /* must be less than m */

` gcd(half_r, m, &r_inv, &m_prime);`

a_bar = mod_ull(a, 0, m); b_bar = mod_ull(b, 0, m);

*p = mont_mul(a_bar, b_bar, m, m_prime);

mul_ull(*p, r_inv, &p_hi, &p_lo);

*p = mod_ull(p_hi, p_lo, m);

return 1; }

Now the payoff for all our hard work. Given modular multiplication, it is easy to perform modular exponentiation using the square-and-multiply algorithm. The expensive calculations of r_inv and m_prime and the translations to and from Montgomery space are performed only once, and only Montgomery steps are taken inside the loop:

`int powmod_ull(ull x, ull y, ull m, ull *x_to_y) {`

ull half_r = 0x8000000000000000LL;

ull r_inv, m_prime, x_bar, hi, lo;

```
``` if ((m & 1) == 0) { return 0; } /* must be odd */

if (m <= x) { return 0; } /* must be less than m */

` gcd(half_r, m, &r_inv, &m_prime);`

x_bar = mod_ull(x, 0, m);

*x_to_y = mod_ull(1, 0, m);

while (y > 0) {

if ((y & 1) == 1) {

*x_to_y = mont_mul(*x_to_y, x_bar, m, m_prime);

y = y - 1; }

else {

x_bar = mont_mul(x_bar, x_bar, m, m_prime);

y = y / 2; } }

mul_ull(*x_to_y, r_inv, &hi, &lo);

*x_to_y = mod_ull(hi, lo, m);

return 1; }

Here is an example:

`#include <stdio.h/gt;`

```
```int main(void) {

ull z; int t;

ull x = 34721908534901LL;

ull y = 72193687003295LL;

ull m = 9412345678901731LL;

t = mulmod_ull(x, y, m, &z);

if (t) { printf("%lld\n", z); }

else { printf("ERROR"); }

` t = powmod_ull(x, y, m, &z);`

if (t) { printf("%lld\n", z); }

else { printf("ERROR"); }

return 0; }

If the program is stored in file `monty.c`

, compiling and running the program looks like this; both results shown below are confirmed by Wolfram|Alpha:

`> cc -o monty monty.c`

> chmod +x monty

> ./monty

3751384291706939

7001634529421238

You can run the program at http://programmingpraxis.codepad.org/A5ePEDGe.

Re bar: One theory seems to be to put a code called a combining

diacritic in Unicode after the character. A bar above would be

̄. The following characters looked somewhat usable

locally in w3m and firefox: ā, b̄, c̄,

d̄, ē.

(The above text is copied and pasted from my local test document source code. In the comment editor, at the moment of this writing, I see the hex codes as HTML character entities or whatever they are called.)

Hm. The code entity above should have been ampersand hash x 0304 semicolon. The software interpreted it as the actual code. The sample characters look to me as they should.