Today we'll implement a not-stupid division function. Since the earliest

days of computers, people realised that where possible, one should

replace division with multiplication. This is done by computing a

precomputed inverse v of the divisor d and multiplying by v instead of

dividing by d.

One particularly ingenious example of this, which was developed recently,

is the algorithm of Möller and Granlund. It can be found in this paper:

If one has a precomputed inverse v as defined at the start of section 3

of that paper, then one can compute a quotient and remainder using

algorithm 4 of the paper using one multiplication and one mullow.

There is one caveat however. The divisor d must be "normalised". This

means that the divisor is a word whose most significant bit is a 1.

This is no problem for us as we can simply shift n to the left so that it is

normalised, and also shift the two input limbs to the left by the same

amount. The quotient will then be the one we want and the remainder

will be shifted to the left by the same amount as n. We can shift it back

to get the real remainder we are after.

We can use this algorithm in our divrem1 function. However, rather than

shift the remainder back each iteration, as it is the high limb of the

next input to the same algorithm we can simply pass it unmodified to the

next iteration of the divrem1 function.

The name of our function is nn_divrem1_preinv. We also introduce a type,

preinv1_t, to nn.h which contains the precomputed inverse (dinv) and the

number of bits n has been shifted by (norm). We add a function,

precompute_inverse1, which computes this.

In order to compute the number of leading zeroes of n, we simply use a gcc

intrinsic __builtin_clzl which does precisely what we are after. This gives

the number of bits we have to shift by.

We implement the Möller-Granlund algorithm as a macro to save as many cycles

as we can. This time we really have to be careful to escape the variables we

introduce inside the macro, to prevent clashes with parameters that might be

passed in. The standard way of doing this is to prepend some underscores to

the variables inside the macro. We call the macro divrem21_preinv1 to signify

that it is doing a division with dividend of 2 words and divisor of 1 word.

Once this is done, the nn_divrem1_preinv algorithm is very simple to implement.

It looks even simpler than our divrem1_simple function. The test code is very

similar as well. The new function runs much faster than the old one, as can

easily be seen by comparing the pauses for the different functions when

running the test code.

Later we'll have to introduce a timing and profiling function so that we can

be more explicit about how much of a speedup we are getting.

I haven't experimented with the divrem21_preinv macro to see if I can knock

any cycles off the computation. For example, careful use of const on one or

more of the variables inside the macro, or perhaps fewer word_t/dword_t

casts might speed things up by a cycle or two. Let me know if you find a

better combination.

Of course this is really just doing optimisations that the C compiler should

already know about. The best way to get real performance out of it is to add

some assembly language optimisations, which we'll eventually do.

The github branch for today's post is here: v0.8

Previous article: bsdnt assembly language

Next article: v0.9 - divrem1_hensel

## No comments:

## Post a Comment