We have one more linear function to implement before we move on to some

assembly language optimisation and then quadratic functions, namely mod1.

It returns the remainder after division by a single word.

One might think that this cannot be computed any faster than divrem1,

however the following algorithm allows us to perform the computation

slightly faster.

The algorithm is credited to Peter Montgomery. It is based on the following

observation. Suppose we have computed, in advance, the values

b2 = B^2 mod d and b3 = B^3 mod d.

Then a number of the form a0 + a1 * B + a2 * B^2 + a3 * B^3 can be

reduced mod d by computing s = a0 + a1 * B + a2 * b2 + a3 * b3, then

later reducing s mod d.

Note that as d can be at most B - 1, the values bi are at most B - 2.

The values ai are at most B - 1. Thus ai * bi is at most B^2 - 3*B + 2.

This means that when summing up s, we are adding at most B^2 - 3*B + 2

to B^2 - 1 at each step. The total is at most 2*B^2 - 3*B + 1. In

other words, there might be a carry of *1* into a third limb. But we

already know that b2 = B^2 mod d. Thus we can get rid of this 1 from

the third limb by adding b2 to our total in its place. The total will

then be at most B^2 - 2*B - 1, which fits easily into two limbs.

We may need to do this adjustment twice when summing up. It is an

unpredicted branch to do this correction, but with a deep enough pipeline

it is possible the processor will compute both paths, minimising the cost

of a misprediction.

Thus, with some precomputation, we can reduce four limbs of our dividend

to two limbs with 2 multiplications. Another advantage is that these

multiplications are independent. This gives the processor lots of

opportunity to pipeline them and compute them quickly.

One tricky implementation detail is in testing when s overflows two limbs.

We can accumulate into three limbs, but this is not quite efficient.

One way of testing for overflow of a double word addition is to check if

the result is smaller than one of the summands. If so, an overflow

occurred and we can make our adjustment.

In the case where d is at most B/3, we can make another optimisation. Also

compute b1 = B mod d and perform a third multiplication a1 * b1. We have

that bi is less than B/3. Thus the three products ai*bi are at most (B/3 - 1)*(B - 1)

= B^2/3 - 4*B/3 + 1. Clearly s is now at most B^2 - 3*B + 2 and no overflow

occurs. Therefore no tests or adjustments are required to compute s.

I will leave this optimised case as an exercise and just implement the

generic case for now.

At the end of the algorithm we must reduce a double word mod d. This could

be optimised with a precomputed inverse a la divrem1, but again I skip this

optimisation and leave it as an exercise.

At the start of the algorithm we have to set up differently depending on

whether there are an even or odd number of words (including the carry-in).

If there is an odd number in total, we need to do a single reduction of

the extra word so that an even number remain.

We add a new mod_preinv1_t type and a function precompute_mod_inverse1

to compute it. This computes B, B^2 and B^3 mod d. For now this precomputation

is not optimised to use a precomputed inverse. This would actually save

significant time, but again I leave it as an exercise to optimise this.

The end result looks a little messy compared to the algorithms implemented

so far. I wonder if anyone can find any simplifications of the code.

Here v0.10 is the github branch for this post.

Previous article: v0.9 - divrem1_hensel

Next article: v0.11 - generic test code

## No comments:

## Post a Comment