Today we'll implement a not-stupid division function. Since the earliest
days of computers, people realised that where possible, one should
replace division with multiplication. This is done by computing a
precomputed inverse v of the divisor d and multiplying by v instead of
dividing by d.
One particularly ingenious example of this, which was developed recently,
is the algorithm of Möller and Granlund. It can be found in this paper:
If one has a precomputed inverse v as defined at the start of section 3
of that paper, then one can compute a quotient and remainder using
algorithm 4 of the paper using one multiplication and one mullow.
There is one caveat however. The divisor d must be "normalised". This
means that the divisor is a word whose most significant bit is a 1.
This is no problem for us as we can simply shift n to the left so that it is
normalised, and also shift the two input limbs to the left by the same
amount. The quotient will then be the one we want and the remainder
will be shifted to the left by the same amount as n. We can shift it back
to get the real remainder we are after.
We can use this algorithm in our divrem1 function. However, rather than
shift the remainder back each iteration, as it is the high limb of the
next input to the same algorithm we can simply pass it unmodified to the
next iteration of the divrem1 function.
The name of our function is nn_divrem1_preinv. We also introduce a type,
preinv1_t, to nn.h which contains the precomputed inverse (dinv) and the
number of bits n has been shifted by (norm). We add a function,
precompute_inverse1, which computes this.
In order to compute the number of leading zeroes of n, we simply use a gcc
intrinsic __builtin_clzl which does precisely what we are after. This gives
the number of bits we have to shift by.
We implement the Möller-Granlund algorithm as a macro to save as many cycles
as we can. This time we really have to be careful to escape the variables we
introduce inside the macro, to prevent clashes with parameters that might be
passed in. The standard way of doing this is to prepend some underscores to
the variables inside the macro. We call the macro divrem21_preinv1 to signify
that it is doing a division with dividend of 2 words and divisor of 1 word.
Once this is done, the nn_divrem1_preinv algorithm is very simple to implement.
It looks even simpler than our divrem1_simple function. The test code is very
similar as well. The new function runs much faster than the old one, as can
easily be seen by comparing the pauses for the different functions when
running the test code.
Later we'll have to introduce a timing and profiling function so that we can
be more explicit about how much of a speedup we are getting.
I haven't experimented with the divrem21_preinv macro to see if I can knock
any cycles off the computation. For example, careful use of const on one or
more of the variables inside the macro, or perhaps fewer word_t/dword_t
casts might speed things up by a cycle or two. Let me know if you find a
better combination.
Of course this is really just doing optimisations that the C compiler should
already know about. The best way to get real performance out of it is to add
some assembly language optimisations, which we'll eventually do.
The github branch for today's post is here: v0.8
Previous article: bsdnt assembly language
Next article: v0.9 - divrem1_hensel
No comments:
Post a Comment