We have one more linear function to implement before we move on to some
assembly language optimisation and then quadratic functions, namely mod1.
It returns the remainder after division by a single word.
One might think that this cannot be computed any faster than divrem1,
however the following algorithm allows us to perform the computation
slightly faster.
The algorithm is credited to Peter Montgomery. It is based on the following
observation. Suppose we have computed, in advance, the values
b2 = B^2 mod d and b3 = B^3 mod d.
Then a number of the form a0 + a1 * B + a2 * B^2 + a3 * B^3 can be
reduced mod d by computing s = a0 + a1 * B + a2 * b2 + a3 * b3, then
later reducing s mod d.
Note that as d can be at most B - 1, the values bi are at most B - 2.
The values ai are at most B - 1. Thus ai * bi is at most B^2 - 3*B + 2.
This means that when summing up s, we are adding at most B^2 - 3*B + 2
to B^2 - 1 at each step. The total is at most 2*B^2 - 3*B + 1. In
other words, there might be a carry of *1* into a third limb. But we
already know that b2 = B^2 mod d. Thus we can get rid of this 1 from
the third limb by adding b2 to our total in its place. The total will
then be at most B^2 - 2*B - 1, which fits easily into two limbs.
We may need to do this adjustment twice when summing up. It is an
unpredicted branch to do this correction, but with a deep enough pipeline
it is possible the processor will compute both paths, minimising the cost
of a misprediction.
Thus, with some precomputation, we can reduce four limbs of our dividend
to two limbs with 2 multiplications. Another advantage is that these
multiplications are independent. This gives the processor lots of
opportunity to pipeline them and compute them quickly.
One tricky implementation detail is in testing when s overflows two limbs.
We can accumulate into three limbs, but this is not quite efficient.
One way of testing for overflow of a double word addition is to check if
the result is smaller than one of the summands. If so, an overflow
occurred and we can make our adjustment.
In the case where d is at most B/3, we can make another optimisation. Also
compute b1 = B mod d and perform a third multiplication a1 * b1. We have
that bi is less than B/3. Thus the three products ai*bi are at most (B/3 - 1)*(B - 1)
= B^2/3 - 4*B/3 + 1. Clearly s is now at most B^2 - 3*B + 2 and no overflow
occurs. Therefore no tests or adjustments are required to compute s.
I will leave this optimised case as an exercise and just implement the
generic case for now.
At the end of the algorithm we must reduce a double word mod d. This could
be optimised with a precomputed inverse a la divrem1, but again I skip this
optimisation and leave it as an exercise.
At the start of the algorithm we have to set up differently depending on
whether there are an even or odd number of words (including the carry-in).
If there is an odd number in total, we need to do a single reduction of
the extra word so that an even number remain.
We add a new mod_preinv1_t type and a function precompute_mod_inverse1
to compute it. This computes B, B^2 and B^3 mod d. For now this precomputation
is not optimised to use a precomputed inverse. This would actually save
significant time, but again I leave it as an exercise to optimise this.
The end result looks a little messy compared to the algorithms implemented
so far. I wonder if anyone can find any simplifications of the code.
Here v0.10 is the github branch for this post.
Previous article: v0.9 - divrem1_hensel
Next article: v0.11 - generic test code
No comments:
Post a Comment