Saturday 25 September 2010

BSDNT - configure and assembly improvements

It's time we added architecture dependent assembly support to bsdnt.

Here is how we are going to do it. For each of the implementation files
we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,
e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc.

This file will be #included in the relevant implementation file, e.g.
nn_linear_arch.h will be #included in nn_linear.c.

These _arch.h files will be generated by a configure script, based on
the CPU architecture and operating system kernel. They will merely
include a list of architecture specific .h files in an arch directory.

For example we might have nn_linear_x86_64_core2.h in the arch directory,
which provides routines specific to core2 processors running in 64 bit
mode.

In these architecture specific files, we'll have inline assembly routines
designed to replace various routines in the pure C implementation files
that we have already written. They'll do this by defining flags, e.g.
HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific
version of nn_mul1_c is available. We'll then wrap the implementation of
nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,
the C version will not be compiled.

In order to make this work, the configure script has to work out whether
the machine is 32 or 64 bit and what the CPU type is. It will then link
in the correct architecture specific files.

At the present moment, we are only interested in x86 machines running on
*nix (or Windows, but the architecture will be determined in a different
way on Windows).

A standard way of determining whether the kernel is 64 bit or not is to
search for the string x86_64 in the output of uname -m. If something else
pops out then it is probably a 32 bit machine.

Once we know whether we have a 32 or 64 bit machine, we can determine the
exact processor by using the cpuid instruction. This is an assembly
instruction supported by x86 cpus which tells you the manufacturer, family
and model of the CPU.

We include a small C program cpuid.c with some inline assembly to call the
cpuid instruction. As this program will only ever be run on *nix, we can
make use of gcc's inline assembly feature.

When the parameter to the cpuid instruction is 0 we get the vendor ID,
which is a 12 character string. We are only interested in "AuthenticAMD"
and "GenuineIntel" at this point.

When we pass the parameter 1 to the cpuid instruction, we get the processor
model and family.

For Intel processors, table 2-3 in the following document gives information
about what the processor is:

http://www.intel.com/Assets/PDF/appnote/241618.pdf

However the information is out of date. Simply googling for Intel Family 6
Model XX reveals other models that are not listed in the Intel documentation.

The information for AMD processors is a little harder to come by. However,
one can essentially extract the information from the revision guides, though
it isn't spelled out clearly:

http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides

It seems AMD only list recent processors here, and they are all 64 bit.
Information on 32 bit processors can be found here:

http://www.sandpile.org/ia32/cpuid.htm

At this point we'd like to identify numerous different architectures. We
aren't interested in 32 bit architectures, such as the now ancient Pentium
4 or AMD's K7. Instead, we are interested when 32 bit operating system
kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify
as x86.

There are numerous 64 bit processors we are interested in:

64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4.
All the 64 bit ones support SSE2 and SSE3 and are based on the netburst
technology.

The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.
They were an enhanced version of the p6 architecture. They get branded as x86
for which only generic 32 bit assembly code will be available, if any.

Core 2's are very common. They will identify as core2. They all support SSE2,
SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we
don't distinguish these at this stage).

Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are
mostly 64 bit. We identify them as atom.

More recently Intel has released Core i3, i5, i7 processors, based on the
Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,
SSSE3, SSE4.1 and SSE4.2.

AMD K8's are still available today. They support SSE2 and SSE3. We identify
them as k8.

AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a.
We identify these as k10. There are three streams of AMD K10 processors,
Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.

So in summary, our configure script first identifies whether we have a 32 or
64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2,
nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.

Our configure script then links in architecture specific files as appropriate.

The only assembly code we include so far are the nn_add_mc functions we wrote
for 64 bit core2 and k10. As these are better than nothing on other 64 bit
processors from the same manufacturers, we include this code in the k8
specific files until we write versions for each processor. We also add an
nn_sub_mc assembly file for Intel and AMD 64 bit processors.

The configure script includes the architecture specific .h files starting from
the most recent processors so that code for earlier processors is not used
when something more recent is available.

The github branch for this revision is here: asm2

Previous article: v0.12 mul_classical

Friday 24 September 2010

BSDNT - v0.12 mul_classical

I've been on holidays for a couple of days, hence the break in transmission.
(For academics: the definition of a holiday is a day where you actually
stop working and do nothing work related for that whole day. I realise the
definition is not widely known, nor is it well-defined.)

Note, due to accidentally including some files in today's release that I
intended to hold over till next time, you have to do ./configure before
make check. This detects your CPU and links in any assembly code
that is relevant for it. More on this when I do the actual blog post about it.

It's time to start implementing quadratic algorithms (i.e. those that
take time O(n^2) to run, such as classical multiplication and division).

Before we do this, we are going to reorganise slightly. The file which
is currently called nn.c we will rename nn_linear.c to indicate that it
contains our linear functions. We'll also rename t-nn.c to t-nn_linear.c.

The macros in that file will be moved into test.h.

We also modify the makefile to build all .c files in the current directory
and to run all tests when we do make check. A new directory "test" will
hold our test files.

Finally, we add an nn_quadratic.c and test/t-nn_quadratic.c to hold the
new quadratic routines and test code that we are about to write.

The first routine we want is nn_mul_classical. This is simply a mul1
followed by addmul1's. Of course this will be horrendously slow, but once
again we defer speeding it up until we start adding assembly routines,
which will start happening shortly.

In a departure from our linear functions, we do not allow zero lengths.
This dramatically simplifies the code and means we do not have to check
for special cases.

There does not appear to be any reason to allow a carry-in to our
multiplication routine. An argument can be made for it on the basis of
consistency with mul1. However, the main use for carry-in's and carry-out's
thus far has been for chaining.

For example, we chained mul1's s*c and t*c for s and t of length m and
c a single word in order to compute (s*B^m + t)*c.

But the analogue in the case of full multiplication would seem to be
chaining to compute (s*B^m + t)*c where s, t and c all have length m. But
it probably doesn't make sense to chain full multiplications in this way
as it would involve splitting the full product t*c say, into two separate
parts of length m, which amongst other things, would be inefficient.

It might actually make more sense to pass a whole array of carries to our
multiplication routine, one for every word of the multiplier. However it is
not clear what use this would be. So, for now at least, we pass no carry-ins
to our multiplication routine.

We settle on allowing a single word of carry out. This may be zero if the
leading words of the multiplicands are small enough. Rather than write this
extra word out, we simply return it as a carry so that the caller can decide
whether to write it.

The classical multiplication routine is quite straightforward, but that plus the
rearrangement we've done is more than enough for one day.

The github branch for this release is here: v0.12

Previous article: v0.11 - generic test code

Monday 20 September 2010

BSDNT - v0.11 generic test code

It's time we improved our test framework for all the functions we are
adding, before things get out-of-control.

The best strategy is to actually write an interpreter for a small
expression parser which allows us to test arbitrary expressions.
However, that would be a distraction at this point, so we opt to
simply add some convenience functions.

In particular, we'll add some new random functions which generate
lists of random words and random garbage collected nn's. Our only aim
here is to simplify our test code and cut down the number of lines of
it.

We don't want to make it *more* complicated for people to use, and
C89 places some hard limits on us in that regard.

The first step is to add two new files, test.c and test.h which
will contain our garbage collection and random functions.

We begin by adding an enum type_t, which parameterises all the different
types our garbage collection can clean up. At the moment we only need
one type, NN.

Next we add a variadic function (variable number of arguments) which
creates a whole bunch of random words. Unfortunately C89 doesn't support
variadic macros, so we need to pass pointers to the words, so that the
actual words can be modified.

This random function takes a flag which specifies what type of random
number we want. Specifically we'll have flags for ODD, NONZERO, ANY.

C is also so stupid that we actually need to tell it explicitly, or mark
somehow, the number of arguments we are giving to the variadic function.
In C99 one can work around this by wrapping a variadic function with a
variadic macro. But we don't have that facility here. We choose to simply
pass NULL as the final argument to variadic functions.

We also introduce a global garbage stack which will keep a pointer to all
the objects allocated so far, so that a garbage collector can clean them
up later on, when called. The fact that this garbage is global makes it not
threadsafe, but we are only using the test support for test code at this
point and so we don't care about thread safety. Later we could pass a
context around as a parameter if we wanted to make it threadsafe.

We also add a gc _cleanup function, which cleans away all the garbage, so
that memory leaks don't occur. Later we could expand this function to do
redzone checking and various other automatic tests on the garbage it is
disposing of.

Now we can add some convenience functions for generating random
multiprecision integers.

The randoms_of_len function again takes a NULL terminated list of
*pointers* to nn_t's and both initialises them to the given length and
sets them to random limbs. We can specify various flags here, such as
ANY, FULL, the latter returning a multiprecision integer with exactly
the given number of limbs, with the top limb being nonzero (unless
the number of limbs requested is zero).

Now that we have all this, in our test file test.h, we define a macro
TEST_START which takes a number of iterations and simply sets up a loop
with that many iterations. A TEST_END function then calls garbage collection
to clean up at the end.

Finally, we add a test_generics function, which generates some random values
using our new generics and then cleans up.

By running a program called valgrind over our test code, we can see if all
the memory used by our generics actually got cleaned up after use.

We now roll out these changes to the rest of our test code. We note that the
entire test file is now about two-thirds of the length it was before we rewrote
it!

The new branch is on github: v0.11

Previous article: v0.10 - mod1_preinv
Next article: v0.12 - mul_classical

Sunday 19 September 2010

BSDNT - v0.10 mod1_preinv

We have one more linear function to implement before we move on to some
assembly language optimisation and then quadratic functions, namely mod1.
It returns the remainder after division by a single word.

One might think that this cannot be computed any faster than divrem1,
however the following algorithm allows us to perform the computation
slightly faster.

The algorithm is credited to Peter Montgomery. It is based on the following
observation. Suppose we have computed, in advance, the values
b2 = B^2 mod d and b3 = B^3 mod d.

Then a number of the form a0 + a1 * B + a2 * B^2 + a3 * B^3 can be
reduced mod d by computing s = a0 + a1 * B + a2 * b2 + a3 * b3, then
later reducing s mod d.

Note that as d can be at most B - 1, the values bi are at most B - 2.
The values ai are at most B - 1. Thus ai * bi is at most B^2 - 3*B + 2.

This means that when summing up s, we are adding at most B^2 - 3*B + 2
to B^2 - 1 at each step. The total is at most 2*B^2 - 3*B + 1. In
other words, there might be a carry of *1* into a third limb. But we
already know that b2 = B^2 mod d. Thus we can get rid of this 1 from
the third limb by adding b2 to our total in its place. The total will
then be at most B^2 - 2*B - 1, which fits easily into two limbs.

We may need to do this adjustment twice when summing up. It is an
unpredicted branch to do this correction, but with a deep enough pipeline
it is possible the processor will compute both paths, minimising the cost
of a misprediction.

Thus, with some precomputation, we can reduce four limbs of our dividend
to two limbs with 2 multiplications. Another advantage is that these
multiplications are independent. This gives the processor lots of
opportunity to pipeline them and compute them quickly.

One tricky implementation detail is in testing when s overflows two limbs.
We can accumulate into three limbs, but this is not quite efficient.

One way of testing for overflow of a double word addition is to check if
the result is smaller than one of the summands. If so, an overflow
occurred and we can make our adjustment.

In the case where d is at most B/3, we can make another optimisation. Also
compute b1 = B mod d and perform a third multiplication a1 * b1. We have
that bi is less than B/3. Thus the three products ai*bi are at most (B/3 - 1)*(B - 1)
= B^2/3 - 4*B/3 + 1. Clearly s is now at most B^2 - 3*B + 2 and no overflow
occurs. Therefore no tests or adjustments are required to compute s.

I will leave this optimised case as an exercise and just implement the
generic case for now.

At the end of the algorithm we must reduce a double word mod d. This could
be optimised with a precomputed inverse a la divrem1, but again I skip this
optimisation and leave it as an exercise.

At the start of the algorithm we have to set up differently depending on
whether there are an even or odd number of words (including the carry-in).
If there is an odd number in total, we need to do a single reduction of
the extra word so that an even number remain.

We add a new mod_preinv1_t type and a function precompute_mod_inverse1
to compute it. This computes B, B^2 and B^3 mod d. For now this precomputation
is not optimised to use a precomputed inverse. This would actually save
significant time, but again I leave it as an exercise to optimise this.

The end result looks a little messy compared to the algorithms implemented
so far. I wonder if anyone can find any simplifications of the code.

Here v0.10 is the github branch for this post.

Previous article: v0.9 - divrem1_hensel

Saturday 18 September 2010

BSDNT - v0.9 divrem1_hensel

The next division function we'll introduce is so-called Hensel division, or
right-to-left division. This starts at the least significant word and applies
the usual division algorithm back towards the most significant word.

Of course, now any "remainder" will be at the most significant word end and
have a completely different meaning.

Hensel division is actually just division mod B^m. Thus if q is the Hensel
quotient, you have a = q * d mod B^m. In fact, if r is the Hensel remainder,
you have a = q*d + r*B^m.

If a fits in m words and the division is exact (i.e. a = q*d), the Hensel
division and ordinary (euclidean) division return the same quotient. If the
division is not exact, this is no longer the case.

However, it may be that chaining Hensel divisions together does produce an
exact division, even when division over the bottom part of the chain
wouldn't produce an exact division. Thus, the ability to chain Hensel
divisions, and thus the ability to return the Hensel remainder, is
important.

Well, now we come to the first issue if we want to implement this. What is
the C operator for Hensel division? Actually, there doesn't seem to be one.
We have to implement it ourselves.

If a0 is a word from our dividend, and d is our single word divisor, then
we require q0 such that d * q0 = a0 mod B. In other words, we want q0 and
r0 such that d * q0 = a0 + r0*B.

Suppose we can find a value dinv such that dinv * d = 1 mod B. Then
we have that dinv * d * q0 = dinv * a0 mod B, i.e. q0 = dinv * a0 mod B,
which is a single word-by-word mullow.

Once we have q0 we can compute d * q0, the high word of which is the Hensel
"remainder" r0. We need to subtract this from the next limb of our divisor
before continuing.

We'll make the convention that carry-ins and carry-outs from our Hensel
division are positive rather than negative. We'll just remember to subtract
them. This way, if q is our Hensel quotient and r our Hensel remainder,
then d * q = a + r * B^n.

So how do we compute dinv? Firstly, we better require that d is odd, else
dinv * d = 1 mod B is an impossibility. With this restriction, we have
gcd(d, B) = 1 and therefore the extended euclidean algorithm lets us find
s, t such that s*d + t*B = 1. Then modulo B we have s*d = 1 mod B, i.e.
dinv = s is the precomputed inverse that we are after.

Another way to solve dinv * d = 1 mod B is to use Hensel lifting. This
works by first solving the equation mod 2, then use that solution to
solve it mod 4 and so on. This can all be done with nothing more than
multiplications.

Firstly, we know a solution mod 2, namely 1 (as d is odd), i.e. if v1 = 1
then we have v1 * d = 1 mod 2. Now suppose we have a solution vk mod 2^k,
i.e. vk * d = 1 mod 2^k. We'd like to find a value wk such that
(vk + 2^k * wk) * d = 1 mod 2^2k. This is always possible -- I omit the
easy proof -- so we only need to solve this equation for wk, i.e.
(d * vk - 1) = -2^k * wk * d. We can get rid of the d on the right hand
side by multiplying by vk, i.e. 2^k * w = ((1 - d * vk) * vk) (mod 2^2k).

In other words, two multiplications will suffice to compute
2^k * wk (mod 2^2k) from vk. Then v{k+1} = vk + 2^k * wk is the inverse
of d mod 2^2k.

As a further optimisation of all this, one can just work mod 2^64 all
the way through the computation. In other words, start with v1 = 1
and compute v{k+1} = vk + ((1 - d * vk) * vk) (mod 2^64), six times.

To get from a solution mod 2 to a solution mod 2^64 clearly only takes
six Hensel lifting steps. Of course, with a lookup table mod 2^8 to start
from, instead of starting from a solution mod 2, one can do it in three
steps, requiring just six word-by-word multiplications (mullow only).

However, today I am feeling very lazy and I leave it as an exercise for
someone to implement the Hensel lifting lookup table method.

We add a simple type to hold our precomputed Hensel inverse, and a function
for computing it.

The actual nn_divrem_hensel_1_preinv function is again a straightforward
loop. We ensure our remainder is positive at each step and ensure that
we propagate any borrows from subtractions at each step.

The test code for the Hensel division is very similar to that for the
ordinary euclidean division except that d must be odd and the chaining
test will proceed right-to-left.

OK, that is enough brain strain for one day!

Here v0.9 is the github repository for today's post.

Previous post: v0.8 - divrem1_preinv

Thursday 16 September 2010

BSDNT - v0.8 divrem1_preinv

Today we'll implement a not-stupid division function. Since the earliest
days of computers, people realised that where possible, one should
replace division with multiplication. This is done by computing a
precomputed inverse v of the divisor d and multiplying by v instead of
dividing by d.

One particularly ingenious example of this, which was developed recently,
is the algorithm of Möller and Granlund. It can be found in this paper:

If one has a precomputed inverse v as defined at the start of section 3
of that paper, then one can compute a quotient and remainder using
algorithm 4 of the paper using one multiplication and one mullow.

There is one caveat however. The divisor d must be "normalised". This
means that the divisor is a word whose most significant bit is a 1.
This is no problem for us as we can simply shift n to the left so that it is
normalised, and also shift the two input limbs to the left by the same
amount. The quotient will then be the one we want and the remainder
will be shifted to the left by the same amount as n. We can shift it back
to get the real remainder we are after.

We can use this algorithm in our divrem1 function. However, rather than
shift the remainder back each iteration, as it is the high limb of the
next input to the same algorithm we can simply pass it unmodified to the
next iteration of the divrem1 function.

The name of our function is nn_divrem1_preinv. We also introduce a type,
preinv1_t, to nn.h which contains the precomputed inverse (dinv) and the
number of bits n has been shifted by (norm). We add a function,
precompute_inverse1, which computes this.

In order to compute the number of leading zeroes of n, we simply use a gcc
intrinsic __builtin_clzl which does precisely what we are after. This gives
the number of bits we have to shift by.

We implement the Möller-Granlund algorithm as a macro to save as many cycles
as we can. This time we really have to be careful to escape the variables we
introduce inside the macro, to prevent clashes with parameters that might be
passed in. The standard way of doing this is to prepend some underscores to
the variables inside the macro. We call the macro divrem21_preinv1 to signify
that it is doing a division with dividend of 2 words and divisor of 1 word.

Once this is done, the nn_divrem1_preinv algorithm is very simple to implement.
It looks even simpler than our divrem1_simple function. The test code is very
similar as well. The new function runs much faster than the old one, as can
easily be seen by comparing the pauses for the different functions when
running the test code.

Later we'll have to introduce a timing and profiling function so that we can
be more explicit about how much of a speedup we are getting.

I haven't experimented with the divrem21_preinv macro to see if I can knock
any cycles off the computation. For example, careful use of const on one or
more of the variables inside the macro, or perhaps fewer word_t/dword_t
casts might speed things up by a cycle or two. Let me know if you find a
better combination.

Of course this is really just doing optimisations that the C compiler should
already know about. The best way to get real performance out of it is to add
some assembly language optimisations, which we'll eventually do.

The github branch for today's post is here: v0.8

Previous article: bsdnt assembly language
Next article: v0.9 - divrem1_hensel

Tuesday 14 September 2010

BSDNT - assembly language

Today a few of us (Antony Vennard, Brian Gladman, Gonzalo Tornaria and myself) had a go at writing some assembly language.

The results of our labour, for AMD64 can be found on the asm branch on github:

http://github.com/wbhart/bsdnt/tree/asm

This is not a proper revision of bsdnt, but a heavily hacked version, just to demonstrate the principles.

In the nn.c file is some inline assembly code which implements the nn_add_mc function in x64 assembler. The format used is Intel format assembly code, instead of the usual AT&T syntax used by inline gcc normally. We made a change to the makefile so that gcc would use the Intel format assembly code pervasively. Note, this will cause gcc to miscompile bsdnt for many revisions of gcc. This is due to bugs in gcc. Specifically gcc versions 4.4.1 and 4.4.4 work, however.

At this point we haven't unrolled the loop (repeated the contents of the loop over and over to save some time on loop arithmetic). But the results are still pretty nice.

The straight C version we had before took about 12 cycles per word (with gcc 4.4.1 on an AMD Opteron K10 machine). With this new assembly code it takes about 3 cycles per word.

With loop unrolling we might hope to halve that again, but this messy optimisation will have to wait, otherwise work on the functionality of the library will slow right down.

We also wrote some assembly code for Core 2 machines. This runs in about 4.5 cycles per word. At the moment this is not committed anywhere, but you can find a copy here:

http://groups.google.co.uk/group/bsdnt-devel/msg/15e2883d94014196?hl=en

The important thing to realise about assembly language is it is highly architecture dependent. Our AMD64 code is actually slower than C on an Intel Core 2 machine! The assembly code needs to be prepared for the machine in question. Moreover, this code won't work at all on Windows 64 or on any 32 bit machine!

At some point we'll introduce a configuration script which will select the correct assembly code for the architecture being used.

To actually time this code, we simply hacked t-nn.c to only run one of the nn_add_m tests, but with the nn_add_m executed 1000 times and the actual test that the code works, turned off.

We set the number of words being added in each nn_add_m to 24, as the test machine we used was 2.4GHz. This means the number of seconds the test takes to run is approximately the number of cycles per word. This is of course a temporary hack, just to illustrate working assembly code (the test will pass if you remove the loop that calls nn_add_m a thousand times instead of once).

Previous article: v0.7 - divrem1simple
Next article: v0.8 - divrem1_preinv

Monday 13 September 2010

BSDNT - v0.7 divrem1_simple

It's time for our first division function. This will again be a linear function
in that it divides a multiprecision integer by a single word and returns the
(single word) remainder.

This is the first time we can do substantially better than a simple for loop
(other than when we coded nn_cmp and other functions, which allowed early exit).

To get us going, we'll write a really dumb division function first. It'll just
use C's "/" and "%" operators in a for loop.

C is supposed to optimise simultaneous occurrences of / and % together in the
same piece of code to use a single processor instruction where available.
However, even with that optimisation, our function will be exceedingly slow.
We'll see why that is when we implement a not-stupid divrem1, next time around.

To emphasise its stupidity, we'll call our exceedingly dumb division function
nn_divrem1_simple. Perhaps we can use it in our test code to compare against
when we write the more sophisticated version.

Hopefully our naming conventions won't be defeated by this function, which
proceeds left to right (or most significant word first).

One doesn't usually need to write the remainder down immediately following
the quotient (though one could imagine a function operating that way).
However, it is convenient to be able to accept a "remainder-in", essentially
thought of as an extra limb of the dividend, so that divrem can be chained.

In this way, divrem1 is most similar to right shift in semantics (the latter
is just division by a power of two, after all).

Now, if we blindly try to implement this function, we quickly discover a
problem. Even if we work with double words, a division of a two limb quantity
by a single limb can be problematic.

We can illustrate this using decimal digits, instead of machine words. Suppose
we divided 94 by 7. The resulting quotient is 13 with remainder 3 (I hope!).
The point is, the quotient takes up two "words" 1 and 3, not just a single
word.

We can get around it by reducing the top word first. We have 9 divide 7 is 1
remainder 2. Thus the first word of our quotient is 1. Now we are left with
24 divide 7. This time the quotient is 3 with remainder 3 and everything is
fine. Note the remainder, 3, is less than 7.

The critical thing to note is that once we get things started, everything
works fine from then on. The top word in our double words are always reduced
modulo the divisor after the first iteration.

So the almost trivial algorithm almost works, except for the lead-in, where
we have to get things right. After that, we can divide our dividend word on
word until we reach the bottom, where we'll have a single limb remainder.

Now as mentioned, it is also convenient to supply a "carry-in". This acts like
an additional limb of the dividend. Assuming this carry-in was reduced mod
the divisor, we find that we end up with as many words in our quotient as we
had in our original dividend.

Clearly however, if the carry-in is not reduced, we will end up with even one
more quotient limb! However, there is an easy way to avoid this problem.
We simply decide to restrict the carry-in to be reduced modulo the divisor!

This restriction is not a problematic restriction because we will still be
able to chain our division functions together with this restriction (so long
as we use the same divisor throughout).

Anyhow, with this restriction, the quotient will have precisely the same number
of limbs as the dividend, and since we'll start with the carry-in, not the top
limb of the dividend, and that carry-in will already be reduced, there is no
bootstrapping iteration required, i.e. the problem of a non-reduced first limb
as explained above, simply doesn't exist.

In this way we avoid any complex feed-in and wind-down code, and as for all
the functions so far, the length m can be zero. The result is satisfyingly
symmetric, even if it is terribly slow.

In the test code, if q = a / c with remainder r, we can check that a = q*c + r.
We can use the functions we have already implemented, including our mul1_c code
to verify this.

We also check that passing in a reduced carry-in results in a correct result,
by chaining two divisions together.

This will do for today, as a more serious implementation of divrem1 will
require much more serious code. We'll take a look at that next time.

The github branch here v0.7 is for this post.


Sunday 12 September 2010

BSDNT - v0.6 addmul1, submul1

At last we come to addmul1 and submul1. After implementing these, we would
already be able to implement a full classical multiplication routine, using the
naive O(n^2) multiplication algorithm.

It turns out that addmul1 and submul1 are quite similar to mul1. Instead of
just writing out the result, we have to add it to, or subtract it from the
first operand.

These are really combined operations, i.e. combined mul1 and add_m or sub_m
respectively.

The reason we introduce these is that the processor has multiple pipelines
and merging two operations like this gives us the chance to push more
through those pipelines. We'll add more combined operations later on.

The fact that we can add b[i]*c, a[i] and ci and not overflow a double word
needs some justification.

Clearly b[i] and c are at most B-1. Thus b[i]*c is at most B^2-2B+1. And
clearly a[i] and ci are at most B-1. Thus the total is at most
B^2-2B+1 + (B-1) + (B-1) = B^2-1, which just fits into a double word.

The dword_t type allows us to get away with adding all these in C as though
we don't care about any overflow (in fact we know there is none). Of course
that doesn't make it efficient. This function is still a candidate for
assembly optimisation and loop unrolling later on.

The submul function is essentially the same, so long as we have taken care to
perform operations in the right order.

The naming of our functions causes us slight difficulties here. We'd ideally
like versions of addmul1 and submul1 which write the carry out to the high
limb and versions which add/sub it from the high limb. We opt for the latter
for now. After all, if one were accumulating addmuls, this is what one would
require most often.

Maybe someone reading this has a better idea how to handle these.

We add a test which checks that doing two addmuls in a row is the same as
doing a single addmul with the multiply constant equal to the sum of the
original two. We also add a test to check that chaining addmuls works. We
add similar tests for submul.

We delay coding up a quadratic multiplication basecase as there are a few more
linear functions to work on, most notably the various kinds of division and
remainder functions. These are all interesting functions to write.

The github repo for this version is here: v0.6

Previous article: v0.5 add, sub, cmp
Next article: v0.7 divrem1_simple

Saturday 11 September 2010

BSDNT - v0.5 add, sub, cmp

It's time we added a full add and sub routine. These perform addition and
subtraction with possibly different length operands. For simplicity, we
assume the first operand is at least as long as the second. This saves us
a comparison and a swap or function call.

Add is simply an add_m followed by an add1 to propagate any carries right
through to the end of the longest operand. A similar thing holds for sub.

We'd like to code these as macros, however we'd also like to get the return
value. Thus, we use a static inline function, which the compiler has the
option of inlining if it wants, saving the function call overhead.

We add tests for (a + b) + c = (a + c) + b where b and c are at most the same
length as a. We also check that chaining an add with equal length operands,
followed by one with non-equal operands, works as expected. We add similar
tests for subtraction too.

The next function we add is comparison. It should return a positive value
if its first operand is greater than its second, a negative value if it is
the other way, and zero if the operands are equal.

As for equal_m and equal, we introduce different versions of the function
for equal length operands and possibly distinct operands.

The laziest way to do comparison is to subtract one value from the other
and see what sign the result is. However, this is extremely inefficient
in the case that the two operands are different. It's very likely that
they already differ in the most significant limb, so we start by checking
limb by limb, from the top, until we find a difference.

Various tests are added. We test that equal things are equal, that operands
with different *lengths* compare in the right order, and operands with
different *values* but the same length compare in the right order.

We also take the opportunity to do a copy-and-paste job from our cmp test
code and quickly generate some test code for the nn_equal function, which
didn't have test code up to this point.

From the next section onwards, we will only deal with one or two functions
at a time. All the really simple functions are now done, and we can move on
to more interesting (and useful) things.

The github branch for this post is here: v0.5

Previous article: 0.4 - add1, sub1, neg, not
Next article: v0.6 addmul1, submul1

Friday 10 September 2010

BSDNT - v0.4 add1, sub1, neg, not

Today we introduce some more convenience functions, neg and not, and we
also add the functions add1 and sub1.

All of these functions operate slightly differently to the ones we have
introduced so far.

The functions add1 and sub1 simply add a single word to a multiprecision
integer, propagating any carries/borrows all the way along.

The main loops of add1 and sub1 need to stop if the carry becomes zero. This
is for efficiency reasons. In most cases when adding a constant limb to a
multiprecision integer, only the first limb or two are affected. One doesn't
want to loop over the whole input and output if that is the case.

However, we must be careful, as in the case where the input and output are
not aliased (at the same location), we still need to copy the remaining
limbs of the input to the output location.

We add tests that (a + c1) + c2 = (a + c2) + c1 and do the same thing for
subtraction.

We also check that chaining of add1's and chaining of sub1's works. Until
we can generate more interesting random test integers this test doesn't
give our functions much of a workout. We eventually want to be able to
generate "sparse" integers, i.e. integers with only a few binary 1's or a
few binary 0's in their binary representation. The latter case would be
interesting here as it would test the propagation of carries in our add1
and sub1 functions. We'd also eventually like to explicitly test corner
cases such as multiprecision 0, ~0, 1, etc.

A final test of add1/sub1 that we add is a + c1 - c1 = a.

The not function is logical not. It complements each limb of the input. It
is a simple for loop.

The neg function is twos complement negation, i.e. negation modulo B^m. In
fact, twos complement negation is the same as taking the logical not of the
integer, then adding 1 to the whole thing. The implementation is similar to
add1, except that we complement each limb after reading it, but before adding
the carry.

One difference is that we still need to complement the remaining limbs after
the carry becomes zero, regardless of whether the input and output are
aliased.

The carry out from (neg a) is notionally what you would get if you were
computing 0 - a. In other words, the carry is always 1 unless a is 0. In
order to allow chaining, neg must notionally subtract the carry-in from the
total.

We test that (not (not a)) = a and that neg is the same as a combination of
not and add1 with constant 1. We can also test that adding -b to a is the
same computing as a - b. And finally we can test chaining of neg, as always.

I wonder what the most interesting program is that we could implement on top
of what we have so far. Tomorrow we add a few more convenience functions
before we start heading into the more interesting stuff.

I think I may have solved the test framework problem. More on that when we
get to v0.11 and v0.12.

There is a github branch here v0.4 for this article.

Previous article: v0.3 - copy, zero, normalise, mul1
Next article: v0.5 - add, sub, cmp

Thursday 9 September 2010

BSDNT - v0.3 copy, zero, normalise, mul1

In this section we add a few more trivial convenience functions, nn_copy,
nn_zero and nn_normalise.

The first two are simple for loops which we implement in nn.h.

A pair {c, m} will be considered normalised if either m is zero
(representing the bignum 0) or the limb c[m-1] is nonzero. In other words,
the most significant word of {c, m} will be non-zero.

The only tricky thing to do with nn_normalise is to make sure nothing goes
wrong if all the limbs are zero.

We add numerous tests of these new functions. We zero an integer and check
that it normalises to zero limbs. We copy an integer and make sure it
copies correctly. We also copy an integer, then modify a random limb and
then check that it is no longer equal. This provides a further check that
the nn_equal function is working correctly.

Finally, we do a more thorough test of nn_normalise by zeroing the top few
limbs of an integer then normalising it. We then copy just this many limbs
into a location which has been zeroed and check that the new integer is
still equal to the original *unnormalised* integer. This checks that
nn_normalise hasn't thrown away any nonzero limbs.

Next we add a mul_1 function. As for all of the linear functions we have
been adding, this will be very slow in C. This time we need to retrieve
the top limb of a word-by-word multiply. Again, ANSI C doesn't provide this
functionality, so we use the gcc extensions that allow us to upcast to a
double word before doing the multiply. Again the compiler is supposed to
"do the right thing" with this combination.

We add a test to check that a * (c1 + c2) = a * c1 + a * c2. We also add
a test for chaining of mul1's together, passing the carry-out of one to
the carry-in of the next.

The github branch for this post is here: v0.3

Wednesday 8 September 2010

BSDNT - v0.2 subtraction, shifting

In today's revision we add a subtraction function and left and right
shift functions.

Subtraction is almost trivial after addition. We copy what we did for
addition. The main change is that carry-ins become borrows. we have to
negate our borrows at each step, so that we are always dealing with
positive quantities. The returned borrow is zero or positive.

In the test code for subtraction, we can add a test which checks that
a + b - b = a. This is a further test of the addition function as well.

Left shift is straightforward. Again we make use of the dword_t type
which is twice as wide as a word_t. This allows us to do a single shift
(which the compiler has the option of converting to two shifts if required).

We shift each limb by the specified bits, then hang onto the bits that were
shifted out the top, to be added back in at the next iteration.

For right shift we only have three functions instead of four. It doesn't make
sense to write out the "carry-out" at the bottom of a right shift. Instead
we do a "read-in" nn_r_rsh version which is the exact opposite of nn_s_lsh. It
reads the "carry-in" from an (m + 1)-th limb (which it shifts appropriately).

The nn_r_rsh function, also returns the bits that are shifted out the bottom end.

We can provide an extra test for addition now, by checking that a + a = a << 1
We also check that (a << sh1) >> sh1 gets us back to a.

Finally, we test that we can chain our carry-in carry-out functions together. For
example, I should be able to shift the first few limbs of an integer, feed the
carry-out as a carry-in in to the next shift, which will then shift the rest of
the integer. This should give me the same result as if I had shifted the whole
integer in one go.

We also do similar carry-in/carry-out chaining with the addition and subtraction
functions.

We are going to need to set up a proper test framework at some point, but for
now, whilst we are still working on implementing "linear" functions, I'm simply
going to keep duplicating the test functions over and over and making the
minor modifications required to test the new functions we are adding.

Here is v0.2 on github for this post.

Tuesday 7 September 2010

BSDNT v0.1 - basic types and addition

We start with v0.1 of bsdnt. See the github branch:


Firstly, we set up a bit of infrastructure before we add our first function,
the addition function.

We begin that with some basic types in nn.h:

word_t - represents a machine word (either 32 or 64 bit)
dword_t - twice the size of a machine word (to handle carries from addition)

nn_t - our basic multiprecision integer type, consisting of an array of words
nn_src_t - same as an nn_t, but declared const, so it can't be modified, used
for input/source parameters
len_t - the length in words of a multiprecision integer. We don't use a struct
as we'd have to pass it by reference, which would be less efficient due
to having to dereference the pointer all the time

rand_t - a random state, unused for now. This will ensure thread safety of our
random functions later on.

We also define WORD_BITS, the number of bits in a machine word. To simplify
things, we'll informally let B refer to the number 2^WORD_BITS in what follows.

Our basic bignum type is a pair {c, len} consisting of an nn_t c and a len_t len
counting the number of limbs of our bignum. At this stage we restrict len to
being non-negative, and we allow leading zero limbs in our representation of
our bignums. This allows for twos complement arithmetic with fixed length
bignums (arithmetic modulo B^len for some fixed len).

The first function we write is a simple addition function. This needs to be
pretty sophisticated. In particular, we want to be able to pass in a carry, so
that addition functions can be chained together, or on the end of other functions.

The first addition function will add two operands of the same length (number of
machine words), which we signify with an m at the end of the function name.

We signify that the function takes a carry-in by appending c to the function name.

We will also want the function to pass a carry-out. It'll return this carry out as a
word_t.

The function nn_add_mc is defined in nn.c. Notice how we cast each word to a
dword first, then do the addition, then cast back to a word for the low word of
the sum, and shift by WORD_BITS to get the high word of the result.

The compiler is *supposed* to optimise this combination. Actually, it makes poor
use of the processor carry flag, and screws up the loop, which is why C is not
the best language for a bignum library. But for now it is the best we can do.
Later on we'll add assembly optimisations to our library.

We could unroll the loop (write the contents four times in a row, say, to amortise
the cost of the loop counter arithmetic over four iterations). However, we are
after a cleanly coded library, so we resist this temptation until we write some
assembly code.

Now we introduce some extra macros in nn.h for our addition function. These
represent various permutations of allowing a carry in or not, and one more
interesting macro, nn_s_add_m. This function checks whether the result
and first input are aliased. If so (in other words, we have a = a + c), then the
carry-out gets *added* to the correct limb of a.

If a and b are not aliased (we have a = b + c), then the carry is *written*
and not added, to the correct limb of a. This macro will make coding cleaner
later on.

Notionally, the extra 's' in the function name stands for 'store'.

We use macros for these functions to stop code duplication, and because macros
prevent extra function call overhead, and sometimes offer an opportunity for the
compiler to optimise further.

Note that C macros are not hygienic. In fact they are just text replacement macros.
Thus we need to be careful about naming the macro variables so they are unlikely
to conflict with symbols passed in by the caller. In this case we don't introduce
any variables inside our macros, so this precaution is not necessary.

We also need to take care when using the macro parameters inside the macro. In
some cases it is necessary to put parentheses around the parameters in case
the caller passes in a complex expression which combines badly with expressions
inside our macro (e.g. due to operator precedence). Where there can be no
confusion when substituting a macro, we do not need to do this.

Next we will need to add some random functions, to generate random integers to add
in our test code. In nn.c, we add functions to generate a random word of data, a
random integer up to a limit and a random multiprecision integer.

The random functions take a random state as a parameter. This is essential to
ensure that we can make our random functions threadsafe at a later date. For now
the random state and associated init and clear functions do nothing.

The random word function generates two random integers slightly bigger than half
a word and stitches them together. This is done using an el cheapo pseudorandom
function which works modulo a prime slightly bigger than half a word. We take some
care not to always end up with even output, or output divisible by a given small
prime.

The test code is in t-nn.c. We add a comparison function nn_equal_m to nn.h, to
test equality of two mp integers, to be used in our test code, and some other
convenience functions. We generate random length mp integers, and check an
identity, in this case the associative law for addition. This is not a very
sophisticated test, and we'll have to improve out test code at a later date.

For now, it passes, and we move on to grander things.

Previous article: BSDNT-introduction

BSDNT - Introduction

Over the next few weeks to months, I'll be blogging about a new project of mine, called bsdnt.

The aim of this project is to develop a cleanly coded bignum library with (eventually) reasonable performance and a BSD license. I'll be developing this code for a while in a blog/tutorial style.

First some organisational matters.

For those who wish to follow along with the code as it is written, it can be found at:


If you have a github account, you can clone my project on github.

To make a clone of my repository on your local machine, you can do:

git clone git://github.com/wbhart/bsdnt.git bsdnt

The various branches, as I add them, will be called v0.1, v0.2, etc.

To switch branches, simply do, for example:

git checkout origin/v0.1

To make your own branch of this, to mess around in, do:

git checkout -b mybranch

Antony Vennard, set up a google groups discussion list. You can access that
here:


A word on licensing: if you substantially copy my code, then I appreciate you
retaining the copyright, however if you write substantially your own code,
merely being "inspired" by my code, I am fine with you not adding my copyright to your project. But still let me know about your code, because I'd be interested in seeing it.

To compile the project, you will need a recent version of gcc. I'm using
gcc 4.4.1, and some of the features may not work with earlier than gcc 4.4.

You can proceed to the first blog about the code here: