Sunday, 7 June 2009

MPIR - version 1.2

Finally version 1.2 of MPIR (Multiple Precision Integers and Rationals), of which I am a lead developer, is released: http://www.mpir.org/ 

MPIR is an open source project based on the GNU Multi Precision (GMP, see http://www.gmplib.org/) library, but still licensed under version 2 of the LGPL.

About a month ago version GMP 4.3.0 was released, which they had been preparing for a LONG time. We expected some nice features, and found some, which we have been subsequently catching up with.

In particular we needed:

* Faster assembly code for multiplication basecase 

* Faster unbalanced integer multiplication (where you are multiplying integers of different sizes)

* Improvements to the speed of multiplying medium sized integers (50-2000 words where 1 word = 2^64 on a 64 bit machine)

* Asymptotically fast extended GCD

* Faster integer multiplication for large integers (> 2000 limbs)

* Faster integer squaring

* Other assembly improvements

Multiplication Basecase

Jason Moxham, a brilliant MPIR developer from the UK decided to take on the mul_basecase challenge. He's been writing an assembly optimiser for quite a few months. It takes hand written assembly code and reorganises the instructions over and over, within permitted bounds, to try and find an optimal sequence.

The results over the past year are pretty impressive to see:

Multiplications per second
128 x 128 bits512 x 512 bits8192 x 8192 bits
MPIR 1.2.05379464612488043117404
GMP 4.3.05276650610879150114927
MPIR 1.1.25180225211802334111772
MPIR 1.0.03585659810928085111641
MPIR 0.9.037299412812245286301
GMP 4.2.125896136638354260819

Note that the number of multiplications that can be done per second has more than doubled in most cases since last year, and all this just from assembly language improvements.

The timings for this table were made on an Opteron (AMD K8) server, running at 2.8 GHz. 

Toom Multiplication

Toom multiplication can be described in terms of a polynomial multiplication problem. Firstly the large integers to be multiplied are split apart into pieces, which form the coefficients of polynomials. Then the original integer multiplication problem becomes one of polynomial multiplication, then a reconstruction phase at the end, where the polynomial coefficients of the product are stitched together to make the product integer.

This can be thought of in terms of writing the original integer in terms of some large base, 2^M, where M is usually a multiple of the machine word size (e.g. M = 64B on a 64 bit machine).

For example, Toom-3 breaks the two integers into 3 pieces each, corresponding to the digits in base 2^M:

A = a0 + a1 * 2^M + a2 * 2^2M
B = b0 + b1 * 2^M + b2 * 2^2M

So we construct two polynomials:

f(x) = a0 + a1 * x + a2 * x^2
g(x) = b0 + b1 * x + b2 * x^2

Then A * B = f(2^M) * g(2^M). So we first compute h(x) = f(x) * g(x), and then A * B = h(2^M).

To multiply the polynomials f(x) and g(x), we note that their product will be degree 4, and so we can determine it fully if we know its value at 5 independent points. We choose, for convenience, the points 0, 1, -1, 2, infinity. Finally we note that h(0) = f(0) * g(0), h(1) = f(1) * g(1), etc.

Thus to compute the product we only need to find the values of f(x) and g(x) at the chosen points, do five small multiplications to get the value of h(x) at those points, then interpolate to get the coefficients of h(x).

So we have replaced the original large multiplication with five small ones. Note that each of the small multiplications involves integers at most one third of the size. If we just used schoolboy multiplication to multiply the "digits" of our large integers, we'd do 3 x 3 = 9 small "digit multiplications".  Instead, through the magic of evaluation/interpolation we only have 5 small "digit multiplications" to do (and a few additions and subtractions, etc., for the evaluation and interpolation phases).

The basecase multiplication code already handles unbalanced multiplication, so I decided to focus on unbalanced variants of Toom multiplication algorithms.

In MPIR we use Toom-2 (also known as Karatsuba multiplication - though we use a variant called Knuth multiplication where evaluation happens at 0, -1 and infinity), Toom-3, Toom-4 and Toom-7. 

I spent a fair bit of time optimising Toom-3. As we have a lot of new assembly instructions available in MPIR, I was actually able to get the interpolation sequence used down from about 11 to 8 steps in MPIR 1.2. I also discovered that it was possible to use the output integer for temporary space. If one is careful, one can even set it up so that the temporaries stored in the output space don't have to be moved at the end, i.e. they are in precisely the right place as part of the output integer at the end of the algorithm.

I made similar optimisations for Toom-4 and Toom-7 and I also switched the interpolation phase of the algorithms over to twos complement. 

Here are the results of this work in the Toom range:

Toom Multiplication
Kara (3200 bits)Toom3 (7680 bits)Toom4 (25600 bits)Toom7 (131072 bits)
MPIR 1.2.0300414
74337
11428
1153
GMP 4.3.1274738
68498
11263
1042
MPIR 1.1.2260501
60039
9890
993
MPIR 1.0.0261599
62275
9900
826
GMP 4.2.1163273
33460
4980
408

Timings are on a 2.4GHz Core 2 (eno).

The differences in timing in the karatsuba region for MPIR give an indication of how much of a speedup is occurring on account of better assembly code. Anything beyond that indicates a speedup in the Toom algorithms themselves. 

Note GMP 4.2.1 had Karatsuba and Toom-3 only, GMP 4.3.1 does not have Toom-7.

Unbalanced Multiplication

Now unbalanced multiplication proceeds in the same way, except that we have integers, and thus polynomials, of different length:

f(x) = a0 + a1 * x + a2 * x^2 + a3 * x^3
g(x) = b0 + b1 * x

Again the product is degree four and so we need to interpolate at five points, which we can choose to be precisely the same points that we used for Toom-3, even reusing the interpolation code, if we wish.

We call this Toom-42, denoting that we split our integers into 4 and 2 "digits" respectively, in our base 2^M.

The result is 5 small multiplications instead of 4 x 2 = 8; still a substantial saving.

Finally I implemented some of the unbalanced variants. In particular we now have Toom-42, an unbalanced version of Toom-33 (where the top coefficients are not necessarily exactly the same size) and Toom-32.

The results for unbalanced multiplications in the Toom range are now quite good:

Unbalanced Multiplication
15000 x 10000 bits20000 x 10000 bits30000 x 10000 bits
MPIR 1.2.032975
25239
14995
GMP 4.3.131201
24099
14104
MPIR 1.1.223190
19970
13289
GMP 4.2.113235
10855
7235

Timings are on the 2.4GHz Core 2 (eno).

Toom Squaring

Squaring using the Toom algorithms is much the same, except that there is no need to evaluate the same polynomial twice. We also save because the pointwise multiplications are now also squares and we can recurse, right down to a basecase squaring case.

All of the same optimisations can be applied for Toom squaring as for ordinary Toom multiplication. Here are the comparisons:

Toom Squaring
Kara (5120 bits)Toom3 (12800 bits)Toom4 (51200 bits)Toom7 (131072 bits)
MPIR 1.2.0358656
82274
10514
2762
GMP 4.3.0357031
83676
10430
2564
MPIR 1.1.2349686
78510
9917
2347
GMP 4.2.1185201
42256
5112
1237

The performance is now comparable to GMP until the Toom7 region, where we pull ahead considerably. These timings were done on the 2.8GHz Opteron (K8) server.

FFT Multiplication

For multiplication of large integers, MPIR uses a Fast Fourier Transform (FFT) method. Instead of working over the complex numbers, one can work over a ring Z/pZ where p is a special prime, e.g. p = 2^M + 1 where M is some power of 2. This trick is due to Schonhage and Strassen.

For MPIR 1.2 we decided to switch over to the new FFT of Pierrick Gaudry, Alexander Kruppa and Paul Zimmermann. An implementation of ideas they presented at ISAAC 2007 was available to download from Paul and Alex's websites.

Most of the work in merging this FFT was removing bugs, making the code work efficiently on Windows and writing and running tuning code for all the platforms we support.

I wrote the tuning code, Brian Gladman discovered some primitives to use on Windows which replace inline assembler available on Linux and Jason Moxham, Brian Gladman, Jeff Gilchrist and myself tuned the FFT on a range of systems.

Here are some examples of the speedup:

FFT Multiplication
2000000 bits
6000000 bits
20000000 bits64000000 bits
MPIR 1.2.074.9
12.8
5.20
1.47
GMP 4.3.052.4
13.4
3.66
0.813
MPIR 1.1.247.2
12.5
3.09
0.742
GMP 4.2.132.8
8.66
2.11
0.528

Timings are again on the 2.8GHz Opteron server.

Extended GCD

A longstanding issue with MPIR has been the lack of fast extended GCD. We had merged Niels Mollers asymptotically fast GCD code, but unfortunately no suitable extended GCD implementation was available to us. 

I began by reading Niels' paper to understand the half-GCD algorithm (ngcd) that he invented. Essentially half-GCD algorithms get their asymptotic improvement over the ordinary Euclidean algorithm as follows:

Begin by noting that the first few steps only depend on the uppermost bits of the original integers. Thus instead of working to full precision, one can split off the topmost bits and compute the first few steps of the Euclidean GCD on those bits. One keeps track of the steps taken in matrix form. One then has an exact representation for the steps taken so far in the algorithm.

Next one applies the matrix to the original integers. The integers that result are smaller than the original integers. One then finishes off the algorithm by applying the usual steps of the GCD algorithm to the smaller integers.

Of course to get an asymptotic improvement one needs to take care how to apply the matrix and to recurse the whole algorithm down to some fast basecase.

Once I understood that Niels' implementation explicitly computed the matrix at each step, it occurred to me that in order to turn it into an extended GCD algorithm, all I had to do was apply the matrix to the cofactors and keep track of any other changes that would affect the cofactors, throughout the algorithm.

As an example of the speedup, for 1048576 bit integers extended GCD benches as follows:

MPIR 1.1.2:   0.453 
MPIR 1.2.0:     4.06

Again the benchmarks are for the 2.8GHz Opteron.

Assembly Improvements

Finally this mammoth MPIR release also contained numerous sped up assembly routines due to Jason Moxham for AMD Opterons (which usually also translate to improvements for other 64 bit x86 platforms).

Below I include timings (smaller is better) at each point where improvements have been made in the assembly routines in MPIR.

Assembly Improvements
MPIR 0.9MPIR 1.0MPIR 1.1.2MPIR 1.2
mpn_add_n1598
1524

mpn_sub_n1598
1524

mpn_xor_n3015
2271

mpn_and_n
3014

1772
1523
mpn_ior_n
3014

1771
1525
mpn_nand_n
3013

2035
1775
mpn_nior_n
3013
1773


mpn_andn_n
3013
2273


mpn_iorn_n
3015
2271


mpn_addadd_n

2525

2190
mpn_addsub_n

2526

2189
mpn_subadd_n


2526
2190
mpn_addmul_1
3094
2524


mpn_submul_1
3092
2524

mpn_mul_1
3024
2522


mpn_sublsh1_n

2527
2400
2190
mpn_com_n
3014
1271

mpn_divexact_by3
12016
2278


mpn_lshift
2531
1701


mpn_rshift
25321617


mpn_hamdist
8281
1791


mpn_popcount
7281
1527


MPN_COPY
3012

2017
1021
mpn_divrem_1
23649

15119

MPN_ZERO
2013


783

This time timings are taken on a 2.6GHz AMD K10 box.

Numerous new assembly functions have also been added for 64 bit x86 machines since MPIR began, including: mpn_addmul_2, mpn_addadd_n, mpn_sublsh1_n, mpn_divexact_byff, mpn_rsh1add_n, mpn_lshift1, mpn_rshift1, mpn_rsh1sub_n, mpn_mul_2, mpn_lshift2, mpn_rshift2.

2 comments:

  1. Fantastic! I am amazed by your achivements,
    e.g., the toom3, toom42 improvemtns you came up with.
    It is a mere coincidence that the GMP repo had these
    a week before you invented them, surely. And then
    you invented how to do unbalanced toom. We're
    all very impressed. You even handle the case where
    the top coefficient is smaller! Wow!

    Truly lots of original research going on here!

    ReplyDelete
  2. Actually unbalanced Toom methods were proposed by Marco Bodrato and Alberto Zanoni in 2007. See

    Integer and polynomial multiplication: towards optimal Toom-Cook matrices. ISSAC 2007

    and other papers from the same authors following it.

    ReplyDelete