Reading, Writing and Arithmetic: MPIR

Finally version 1.2 of MPIR (Multiple Precision Integers and Rationals), of which I am a lead developer, is released: http://www.mpir.org/

MPIR is an open source project based on the GNU Multi Precision (GMP, see http://www.gmplib.org/) library, but still licensed under version 2 of the LGPL.

About a month ago version GMP 4.3.0 was released, which they had been preparing for a LONG time. We expected some nice features, and found some, which we have been subsequently catching up with.

In particular we needed:

* Faster assembly code for multiplication basecase

* Faster unbalanced integer multiplication (where you are multiplying integers of different sizes)

* Improvements to the speed of multiplying medium sized integers (50-2000 words where 1 word = 2^64 on a 64 bit machine)

* Asymptotically fast extended GCD

* Faster integer multiplication for large integers (> 2000 limbs)

* Faster integer squaring

* Other assembly improvements

Multiplication Basecase

Jason Moxham, a brilliant MPIR developer from the UK decided to take on the mul_basecase challenge. He's been writing an assembly optimiser for quite a few months. It takes hand written assembly code and reorganises the instructions over and over, within permitted bounds, to try and find an optimal sequence.

The results over the past year are pretty impressive to see:

Multiplications per second
	128 x 128 bits	512 x 512 bits	8192 x 8192 bits
MPIR 1.2.0	53794646	12488043	117404
GMP 4.3.0	52766506	10879150	114927
MPIR 1.1.2	51802252	11802334	111772
MPIR 1.0.0	35856598	10928085	111641
MPIR 0.9.0	37299412	8122452	86301
GMP 4.2.1	25896136	6383542	60819

Note that the number of multiplications that can be done per second has more than doubled in most cases since last year, and all this just from assembly language improvements.

The timings for this table were made on an Opteron (AMD K8) server, running at 2.8 GHz.

Toom Multiplication

Toom multiplication can be described in terms of a polynomial multiplication problem. Firstly the large integers to be multiplied are split apart into pieces, which form the coefficients of polynomials. Then the original integer multiplication problem becomes one of polynomial multiplication, then a reconstruction phase at the end, where the polynomial coefficients of the product are stitched together to make the product integer.

This can be thought of in terms of writing the original integer in terms of some large base, 2^M, where M is usually a multiple of the machine word size (e.g. M = 64B on a 64 bit machine).

For example, Toom-3 breaks the two integers into 3 pieces each, corresponding to the digits in base 2^M:

A = a0 + a1 * 2^M + a2 * 2^2M

B = b0 + b1 * 2^M + b2 * 2^2M

So we construct two polynomials:

f(x) = a0 + a1 * x + a2 * x^2

g(x) = b0 + b1 * x + b2 * x^2

Then A * B = f(2^M) * g(2^M). So we first compute h(x) = f(x) * g(x), and then A * B = h(2^M).

To multiply the polynomials f(x) and g(x), we note that their product will be degree 4, and so we can determine it fully if we know its value at 5 independent points. We choose, for convenience, the points 0, 1, -1, 2, infinity. Finally we note that h(0) = f(0) * g(0), h(1) = f(1) * g(1), etc.

Thus to compute the product we only need to find the values of f(x) and g(x) at the chosen points, do five small multiplications to get the value of h(x) at those points, then interpolate to get the coefficients of h(x).

So we have replaced the original large multiplication with five small ones. Note that each of the small multiplications involves integers at most one third of the size. If we just used schoolboy multiplication to multiply the "digits" of our large integers, we'd do 3 x 3 = 9 small "digit multiplications". Instead, through the magic of evaluation/interpolation we only have 5 small "digit multiplications" to do (and a few additions and subtractions, etc., for the evaluation and interpolation phases).

The basecase multiplication code already handles unbalanced multiplication, so I decided to focus on unbalanced variants of Toom multiplication algorithms.

In MPIR we use Toom-2 (also known as Karatsuba multiplication - though we use a variant called Knuth multiplication where evaluation happens at 0, -1 and infinity), Toom-3, Toom-4 and Toom-7.

I spent a fair bit of time optimising Toom-3. As we have a lot of new assembly instructions available in MPIR, I was actually able to get the interpolation sequence used down from about 11 to 8 steps in MPIR 1.2. I also discovered that it was possible to use the output integer for temporary space. If one is careful, one can even set it up so that the temporaries stored in the output space don't have to be moved at the end, i.e. they are in precisely the right place as part of the output integer at the end of the algorithm.

I made similar optimisations for Toom-4 and Toom-7 and I also switched the interpolation phase of the algorithms over to twos complement.

Here are the results of this work in the Toom range:

Toom Multiplication
	Kara (3200 bits)	Toom3 (7680 bits)	Toom4 (25600 bits)	Toom7 (131072 bits)
MPIR 1.2.0	300414	74337	11428	1153
GMP 4.3.1	274738	68498	11263	1042
MPIR 1.1.2	260501	60039	9890	993
MPIR 1.0.0	261599	62275	9900	826
GMP 4.2.1	163273	33460	4980	408

Timings are on a 2.4GHz Core 2 (eno).

The differences in timing in the karatsuba region for MPIR give an indication of how much of a speedup is occurring on account of better assembly code. Anything beyond that indicates a speedup in the Toom algorithms themselves.

Note GMP 4.2.1 had Karatsuba and Toom-3 only, GMP 4.3.1 does not have Toom-7.

Unbalanced Multiplication

Now unbalanced multiplication proceeds in the same way, except that we have integers, and thus polynomials, of different length:

f(x) = a0 + a1 * x + a2 * x^2 + a3 * x^3

g(x) = b0 + b1 * x

Again the product is degree four and so we need to interpolate at five points, which we can choose to be precisely the same points that we used for Toom-3, even reusing the interpolation code, if we wish.

We call this Toom-42, denoting that we split our integers into 4 and 2 "digits" respectively, in our base 2^M.

The result is 5 small multiplications instead of 4 x 2 = 8; still a substantial saving.

Finally I implemented some of the unbalanced variants. In particular we now have Toom-42, an unbalanced version of Toom-33 (where the top coefficients are not necessarily exactly the same size) and Toom-32.

The results for unbalanced multiplications in the Toom range are now quite good:

Unbalanced Multiplication
	15000 x 10000 bits	20000 x 10000 bits	30000 x 10000 bits
MPIR 1.2.0	32975	25239	14995
GMP 4.3.1	31201	24099	14104
MPIR 1.1.2	23190	19970	13289
GMP 4.2.1	13235	10855	7235

Timings are on the 2.4GHz Core 2 (eno).

Toom Squaring

Squaring using the Toom algorithms is much the same, except that there is no need to evaluate the same polynomial twice. We also save because the pointwise multiplications are now also squares and we can recurse, right down to a basecase squaring case.

All of the same optimisations can be applied for Toom squaring as for ordinary Toom multiplication. Here are the comparisons:

Toom Squaring
	Kara (5120 bits)	Toom3 (12800 bits)	Toom4 (51200 bits)	Toom7 (131072 bits)
MPIR 1.2.0	358656	82274	10514	2762
GMP 4.3.0	357031	83676	10430	2564
MPIR 1.1.2	349686	78510	9917	2347
GMP 4.2.1	185201	42256	5112	1237

The performance is now comparable to GMP until the Toom7 region, where we pull ahead considerably. These timings were done on the 2.8GHz Opteron (K8) server.

FFT Multiplication

For multiplication of large integers, MPIR uses a Fast Fourier Transform (FFT) method. Instead of working over the complex numbers, one can work over a ring Z/pZ where p is a special prime, e.g. p = 2^M + 1 where M is some power of 2. This trick is due to Schonhage and Strassen.

For MPIR 1.2 we decided to switch over to the new FFT of Pierrick Gaudry, Alexander Kruppa and Paul Zimmermann. An implementation of ideas they presented at ISAAC 2007 was available to download from Paul and Alex's websites.

Most of the work in merging this FFT was removing bugs, making the code work efficiently on Windows and writing and running tuning code for all the platforms we support.

I wrote the tuning code, Brian Gladman discovered some primitives to use on Windows which replace inline assembler available on Linux and Jason Moxham, Brian Gladman, Jeff Gilchrist and myself tuned the FFT on a range of systems.

Here are some examples of the speedup:

FFT Multiplication
	2000000 bits	6000000 bits	20000000 bits	64000000 bits
MPIR 1.2.0	74.9	12.8	5.20	1.47
GMP 4.3.0	52.4	13.4	3.66	0.813
MPIR 1.1.2	47.2	12.5	3.09	0.742
GMP 4.2.1	32.8	8.66	2.11	0.528

Timings are again on the 2.8GHz Opteron server.

Extended GCD

A longstanding issue with MPIR has been the lack of fast extended GCD. We had merged Niels Mollers asymptotically fast GCD code, but unfortunately no suitable extended GCD implementation was available to us.

I began by reading Niels' paper to understand the half-GCD algorithm (ngcd) that he invented. Essentially half-GCD algorithms get their asymptotic improvement over the ordinary Euclidean algorithm as follows:

Begin by noting that the first few steps only depend on the uppermost bits of the original integers. Thus instead of working to full precision, one can split off the topmost bits and compute the first few steps of the Euclidean GCD on those bits. One keeps track of the steps taken in matrix form. One then has an exact representation for the steps taken so far in the algorithm.

Next one applies the matrix to the original integers. The integers that result are smaller than the original integers. One then finishes off the algorithm by applying the usual steps of the GCD algorithm to the smaller integers.

Of course to get an asymptotic improvement one needs to take care how to apply the matrix and to recurse the whole algorithm down to some fast basecase.

Once I understood that Niels' implementation explicitly computed the matrix at each step, it occurred to me that in order to turn it into an extended GCD algorithm, all I had to do was apply the matrix to the cofactors and keep track of any other changes that would affect the cofactors, throughout the algorithm.

As an example of the speedup, for 1048576 bit integers extended GCD benches as follows:

MPIR 1.1.2: 0.453

MPIR 1.2.0: 4.06

Again the benchmarks are for the 2.8GHz Opteron.

Assembly Improvements

Finally this mammoth MPIR release also contained numerous sped up assembly routines due to Jason Moxham for AMD Opterons (which usually also translate to improvements for other 64 bit x86 platforms).

Below I include timings (smaller is better) at each point where improvements have been made in the assembly routines in MPIR.

Assembly Improvements
	MPIR 0.9	MPIR 1.0	MPIR 1.1.2	MPIR 1.2
mpn_add_n	1598	1524
mpn_sub_n	1598	1524
mpn_xor_n	3015	2271
mpn_and_n	3014		1772	1523
mpn_ior_n	3014		1771	1525
mpn_nand_n	3013		2035	1775
mpn_nior_n	3013	1773
mpn_andn_n	3013	2273
mpn_iorn_n	3015	2271
mpn_addadd_n		2525		2190
mpn_addsub_n		2526		2189
mpn_subadd_n			2526	2190
mpn_addmul_1	3094	2524
mpn_submul_1	3092	2524
mpn_mul_1	3024	2522
mpn_sublsh1_n		2527	2400	2190
mpn_com_n	3014	1271
mpn_divexact_by3	12016	2278
mpn_lshift	2531	1701
mpn_rshift	2532	1617
mpn_hamdist	8281	1791
mpn_popcount	7281	1527
MPN_COPY	3012		2017	1021
mpn_divrem_1	23649		15119
MPN_ZERO	2013			783

This time timings are taken on a 2.6GHz AMD K10 box.

Numerous new assembly functions have also been added for 64 bit x86 machines since MPIR began, including: mpn_addmul_2, mpn_addadd_n, mpn_sublsh1_n, mpn_divexact_byff, mpn_rsh1add_n, mpn_lshift1, mpn_rshift1, mpn_rsh1sub_n, mpn_mul_2, mpn_lshift2, mpn_rshift2.

2 comments:

Unknown27 June 2009 at 10:31
Fantastic! I am amazed by your achivements,
e.g., the toom3, toom42 improvemtns you came up with.
It is a mere coincidence that the GMP repo had these
a week before you invented them, surely. And then
you invented how to do unbalanced toom. We're
all very impressed. You even handle the case where
the top coefficient is smaller! Wow!

Truly lots of original research going on here!
PEAKSPEC9 September 2011 at 03:57
Actually unbalanced Toom methods were proposed by Marco Bodrato and Alberto Zanoni in 2007. See

Integer and polynomial multiplication: towards optimal Toom-Cook matrices. ISSAC 2007

and other papers from the same authors following it.

Reading, Writing and Arithmetic

Sunday, 7 June 2009

MPIR - version 1.2

2 comments:

Followers

Blog Archive

About Me