As I explained in my last two blogs, over the last eight months I have been working on the OpenDreamKit project, funded by the EU in the Horizon2020 initiative. In this blog post I want to discuss the work I've been doing to speed up integer and polynomial multiplication as part of the OpenDreamKit high performance computing and optimisation component.

Flint makes use of a variety of algorithms for integer and polynomials. For integers, GMP/MPIR is used, and algorithms such as the schoolboy algorithm, Karatsuba and Toom-Cook algorithms are used. For polynomials the schoolboy algorithm is used, along with the Karatsuba algorithm and various variants of Kronecker segmentation.

For large integer and polynomial multiplications, however, Flint uses an FFT convolution.

Actually, integer multiplication is reduced to polynomial multiplication. The large integers are broken, down to the bit, into equal sized chunks, which are interpreted as coefficients of a polynomial. The polynomials are multiplied, and the result is again interpreted as a large integer.

Polynomial multiplications of large degree polynomials with small coefficients are also reduced to integer multiplication, by a method known as Kronecker Segmentation, where the polynomials are evaluated at a sufficiently large power of two, the resulting large integers are multiplied, and then the coefficients of the product polynomial are identified in the zero padded output.

#### The FFT convolution

At its core, the FFT convolution efficiently multiplies two large polynomials modulo x^(2^n) - 1. This is like polynomial multiplication with wraparound. Every 2^n coefficients, the result is wrapped around to the start. But we can use it to multiply polynomials of degree d if 2d < 2^n so that wraparound of the result doesn't interfere.

To multiply modulo x^(2^n) - 1 the FFT convolution uses an evaluation, pointwise multiplication, interpolation strategy. The polynomials to be multiplied are first evaluated at 2^n-th roots of unity. These 2^n values for each polynomial are multiplied with each of their counterparts from the other polynomial, to yield 2^n "pointwise products" and then the result polynomial is interpolated from these 2^n values.

Notice that the number of roots of unity is always a power of two in the FFT algorithm, so you are always doing multiplication modulo x^(2^n) - 1 for some n. In other words, convolutions always have length a power of two.

The evaluation phase of the FFT convolution is called the forward FFT. In the middle are the pointwise multiplications, i.e. a total of 2^n coefficient multiplications, and then the interpolation phase is called the inverse FFT or IFFT, for short.

Both the FFT and IFFT take O(n log n) operations, all of which are additions, subtractions and multiplications by roots of unity. The pointwise multiplications are simply coefficient multiplications. For example, if the polynomials being multiplied were over the complex numbers, the pointwise multiplications would be products of complex numbers.

#### The Schoenhage-Strassen algorithm

Performing FFTs and IFFTs over the complex numbers is problematic. When implemented on a computer, we have to use floating point arithmetic for this, which can result in rounding errors which can't be effectively controlled.

Instead of working over the complex numbers, the Schoenhage-Strassen algorithm works over another ring with roots of unity, namely Z/pZ where p = 2^(2^k) + 1. Note that in this ring, 2 is a 2^(k+1)-th root of unity. This allows for convolutions up to that length. But the advantage is that elements of this ring can be represented exactly, using (multiprecision) integer arithmetic,

Pointwise multiplications become multiplications modulo p = 2^(2^k) + 1, which can also be performed exactly. Thus, the result of such an algorithm is always exactly correct if the inputs are polynomials over the integers, so long as the output coefficients are smaller than p.

The Flint FFT convolution code uses the Schoenhage-Strassen algorithm to multiply large polynomials and integers. However, it incorporates many improvements, which I will now describe.

#### Efficient FFT butterflies

The basic low level component of the FFT and IFFT algorithms is called a butterfly. The FFT butterfly transforms pairs of inputs as follows

[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]

The IFFT butterflies are similar. They take the form

[a{i}, b{i}] => [a{i}+z^(-i)*b{i}, a{i}-z^(-i)*b{i}]

Since the FFT coefficients are multiprecision integers (modulo p), the butterfly operations require applying the given operations on multiprecision integers of a fixed precision.

To speed these up, we make use of MPIR's low level assembly optimised functions. The sumdiff function performs an addition and a subtraction at the same time, and of course we make use of code for shifting multiprecision integers. In addition, the subtraction modulo p requires complementing the integers in twos complement format.

One of the OpenDreamKit projects we worked on was to speed up these basic assembly operations on modern CPUs, especially those that support AVX instructions.

You can read about the work that was done on this by Alex Kruppa [1] and the various improvements he got over the old MPIR assembly code.

#### Extended Fermat numbers

Numbers of the form p = 2^(2^k) + 1 are called Fermat numbers. But they limit us to working with coefficients in both our input polynomials and the output polynomial, which don't exceed p. To give us more flexibility, we can work with numbers of the form p = 2^(a*2^k) + 1 for some number a. Now 2^a is a 2^(k+1)-th root of unity, but we get to work with much larger coefficients at very little extra cost. Of course the pointwise multiplications are now more expensive, but sometimes this extra cost is worth paying for the extra flexibility we get.

#### Truncated Fourier Transform

As we noticed above, the length of a convolution is always a power of 2. We are always working modulo x^(2^n) - 1.

This means that if we are multiplying polynomials of degree d, everything is fine until 2d = 2^n. At this point, the output polynomial will become too large and we have to increase n. This leads to ugly steps in performance at power of two lengths.

To get around this, we make use of an algorithm known as the Truncated Fourier Transform (TFT). Truncating the FFT to non-power-of-2 lengths is relatively straightforward. We simply omit the operations which deal with coefficients beyond the desired length.

The pointwise multiplications are also easy. Simply do fewer pointwise multiplications.

But now we have less information going into the inverse FFT than we normally would. We use some clever linear algebra to reconstruct the missing information as the algorithm proceeds.

By truncating the entire convolution in this way, it is possible to save almost all the work that would be required to do a full, length 2^n transform.

#### Negacyclic transform

When the pointwise multiplications get very large, there is a trick we can use to perform them using an FFT of half the usual length.

Notice that the pointwise multiplications are done modulo p = 2^(a*2^k) + 1. If we split such a number up, this almost looks like doing a polynomial multiplication modulo x^(2^n) - 1 for some value of x, except for that pesky -1 instead of +1.

To get around this, we multiply x by a suitable root of unity and rescale so that the FFT convolution effectively becomes multiplication modulo x^(2^n) + 1. In other words, we use the FFT wraparound feature to our advantage.

This trick is due to Nussbaumer and saves time whenever the FFT coefficients are themselves large enough to be multiplied using an FFT, or perhaps slightly smaller than that.

The main problem with this approach is the size of the FFT coefficients of that small FFT convolution. It usually isn't quite the right size.

To get around this, it is possible to do a naive convolution where the coefficients are just one or two words in size and use Chinese remaindering to deal with coefficients that are just slightly larger than this Nussbaumer convolution can handle.

#### Algebraic identities

Another trick that can be used is when the number a is a multiple of 3 in p = 2^(2^a) + 1, specifically when the pointwise multiplications are small.

In such a case, we can use the identity x^{3a} + 1 = (x^a + 1)(x^{2a} - x^a + 1) to break the pointwise multiplications into smaller multiplication modulo x^a + 1 and x^{2a} - x^a + 1 respectively. If those multiplications are small enough to be quadratic time, splitting them up and using Chinese remaindering is actually faster than doing the one large multiplication.

In such a case, we can use the identity x^{3a} + 1 = (x^a + 1)(x^{2a} - x^a + 1) to break the pointwise multiplications into smaller multiplication modulo x^a + 1 and x^{2a} - x^a + 1 respectively. If those multiplications are small enough to be quadratic time, splitting them up and using Chinese remaindering is actually faster than doing the one large multiplication.

Multiplication modulo x^a + 1 takes time some constant multiple of a^2 and multiplication modulo x^{2a} - x^a + 1 takes four times as long again. The total is some constant multiple of 5a^2.

Doing a single multiplication modulo x^{3a} + 1, on the other hand, would take time equal to some multiple of 9a^2. So splitting things up using identities can be a big win.

Doing a single multiplication modulo x^{3a} + 1, on the other hand, would take time equal to some multiple of 9a^2. So splitting things up using identities can be a big win.

#### The sqrt2 trick

In the ring Z/pZ where p = 2^S + 1 the value w = 2^(2S/4) - 2^(S/4) is a square root of 2. And recall that 2 was a root of unity in the Schoenhage-Strassen algorithm. Thus w is also a root of unity.

Making use of the root of unity w allows us to perform convolutions of twice the length using the Schoenhage-Strassen algorithm. And multiplication by powers of w is not so difficult. Any even power of w is just a power of 2, and any odd power is just the difference of two such powers.

Ordinarily, making the length of an FFT convolution twice as big, multiplies the time taken by a factor of four, since the coefficients required to support the larger convolution also have to be twice as large. But the sqrt2 trick allows us to do it for just over twice the cost.

#### The matrix Fourier algorithm

One of the big problems with the Schoenhage-Strassen algorithm as described above is that it works with a lot of data. There are 2^n multiprecision FFT coefficients being dealt with at any one time. The pattern of access to these coefficients is also not very regular throughout the algorithm, and so once the entire convolution falls out of cache, performance can be very bad.

The performance degradation can be reduced to some degree by switching to a recursive implementation of the FFT, which quickly gets down to blocks of coefficients that fit in cache. But it doesn't mitigate it entirely.

One way of resolving this is the matrix Fourier algorithm. Instead of viewing the input as a linear array of coefficients, it breaks the single large FFT down into many smaller FFTs, as though the input coefficients were the rows and columns of a matrix.

Two passes over the data are required, in order to handle lots of row FFTs and lots of column FFTs. But the overall saving is dramatic.

Of course, combining this trick with all the other tricks above is really a lot of work. But we do this in Flint.

In particular, we break the coefficients of the truncated Fourier transform up in such a way that each of the row and column FFTs has no wasted space.

#### Parallelisation

Now we have an obvious target for parallelisation. As we have broken the single large FFT into many smaller independent ones, these can be sent to different threads to perform.

Of course, nothing is ever that simple. We must ensure that the temporary space allocated throughout the algorithm is not overwritten by different threads. The algorithm is also more efficient if some of the inward column FFT transforms for example, are combined with the relevant pointwise multiplications and outward IFFT column transforms.

All of this also requires all the coefficients to be in memory in the right order, so that memory access is not complicated. Fortunately there are two kinds of FFT and IFFT, namely decimation in time and decimation in frequency. Picking the correct kinds at the right time allows things to be managed more or less efficiently.

The matrix Fourier algorithm also requires multiplying by some additional roots of unity at the right points, and these need to be merged into one layer of the FFTs and IFFTs so that they don't become an additional cost.

But all complications aside, the matrix Fourier algorithm is both cache efficient and a good target for parallelisation.

As part of my OpenDreamKit work, I parallelised the Flint Schoenhage-Strassen code in precisely this way, maintaining all of the above optimisations. I made use of OpenMP for this.

The results were quite good, with up to a 5.5x speedup on 8 cores for large integer multiplications (I also parallelised splitting the large integer up into FFT coefficients).

You can see timings and a high level summary of the work that I did in the writeup I did for the final deliverable for this task of the OpenDreamKit project [1].

[1] https://github.com/OpenDreamKit/OpenDreamKit/issues/120