## Modular Multiplication Without Overflow

### May 28, 2013

Let’s begin with an example, calculating 56 * 37 modulo 100 using 8-bit arithmetic, so no intermediate total may be 256 or greater. We start by representing a = 56 = 3 * 16 + 8 and b = 37 = 2 * 16 + 5, so:

`a1 = 8`

a2 = 3

b1 = 5

b2 = 2

Then the four intermediate products with their shifts are:

`p11 = 8 * 5 = 40`

p12 = 8 * 2 = 16 > 32 > 64 > 128 (28) > 56

p21 = 3 * 5 = 15 > 30 > 60 > 120 (20) > 40

p22 = 3 * 2 = 6 > 12 > 24 > 48 > 96 > 192 (92) > 184 (84) > 168 (68) > 136 (36)

We’re using binary arithmetic, so each number doubles as it is shifted, taking it modulo 100 as we go. The product of two low-half numbers is not shifted, the product of a low-half and high-half number is shifted 4 times (since log_{2} 16 = 4), and the product of two high-half numbers is shifted 8 times. Then the intermediate products are summed, again removing *m* each time an intermediate sum exceeds *m*:

`s = 40 + 56 = 96`

s = 96 + 40 = 136 (36)

s = 36 + 36 = 72

And that’s the final answer: 56 * 37 = 2072, which is 72 (mod 100).

Here’s the code:

`(define (mod-mul a b m) ; assumes a,b < 2^64 and m < 2^63`

(define (shift x k)

(let loop ((k k) (x x))

(if (zero? k) x

(let* ((x (* x 2))

(x (if (< x m) x (- x m))))

(loop (- k 1) x)))))

(let* ((two32 (expt 2 32))

(a1 (remainder a two32))

(a2 (quotient a two32))

(b1 (remainder b two32))

(b2 (quotient b two32))

(p11 (modulo (* a1 b1) m))

(p12 (shift (modulo (* a1 b2) m) 32))

(p21 (shift (modulo (* a2 b1) m) 32))

(p22 (shift (modulo (* a2 b2) m) 64))

(s (+ p11 p12))

(s (if (< s m) s (- s m)))

(s (+ s p21))

(s (if (< s m) s (- s m)))

(s (+ s p22))

(s (if (< s m) s (- s m))))

s))

And here’s a somewhat larger example:

`> (define a 17259738289493410580109721600)`

> (define b 10327523882682224844906430464)

> (define m 8631375519702822467321987072)

> (mod-mul a b m)

7869126927168251407166865408

You can run the program at http://programmingpraxis.codepad.org/psbpa4MC.

This seems to work for arbitrary 64-bit values, and for me it runs a little faster than splitting into two 32-bit parts. If you have a recent gcc, you can cheat with:

On x86_64 there is also 4 lines of asm that will do it quickly. There are faster methods if one has lots of operations with the same modulus.

I see that Mathew’s question was about addition. With unsigned types in C I use:

to return (a + b) mod n. This requires that a and b are already modulo n, hence n > a,b (if you don’t know, just mod them. but often you know they’re in range). If n-a > b, then there won’t be overflow so we just add; otherwise we do a+b-n. Since this is an unsigned type, it wraps ok around the max integer size, and because a and b were modulo n when we started, one subtract will be enough.

On my previous post, the “UV” in the mulmod macro should be “unsigned long”. The comment about removing the mods is indicating performance — they don’t hurt anything other than wasting time if you already know the values are mod m.

FORTH is tricky since the support for 64 bit math in general is very limited. This is specific to x86 with 80 bit IEEE floating point math. OTOH, 32 bit modular multiplication can be done easily without overflow due to words that generate 64 bit intermediate result.