Sunday, 20 November 2016

New Computer Algebra System : OSCAR

On Friday we found out the exciting news that the German funding agency (DFG) has approved a massive transregional grant here in Germany, that the computer algebra group here at the University of Kaiserslautern has been heavily involved with. There was much celebration in the Department of Mathematics where I work when we received the news!



The official announcement was made on the DFG website on Monday. If you can read German, it is here.

The grant is called "Symbolic Tools in Mathematics and their Application" and is a collaboration between multiple universities here in Germany: Aachen, Kaiserslautern and Saarbrücken, with additional external people at Siegen, Berlin, Stuttgart and Tübingen, with a total of about twenty extremely experienced PIs. The grant is initially for four years, but can be renewed up to twice, for a total of 12 years (assuming we get renewed).

The exciting news from my point of view is project B1 of the grant, which we have internally been calling project B1 : Central Software Project. This is a collaboration to produce a visionary new Open Source computer algebra system. It will be called OSCAR : Open Source Computer Algebra Resource and is to be built on four "cornerstones", namely the Gap System, Singular, Polymake and ANTIC (the internal name for a new project I've been involved with, under Claus Fieker; see below).

The Principal Investigators


Before I get to any details, in no particular order, let me introduce the PI's for the Central Software Project.


Prof. Mohamed Barakat is behind the HomAlg project. HomAlg is a GAP package that allows one to do homological algebra from a categorical point of view. (You may also know Mohamed Barakat's students Sebastian Gutsche and Sebastian Posur, who worked on HomAlg/CAP.)


Dr. Frank Lübeck is one of the main authors of the Chevie project, which is a computer algebra project for computing with generic character tables of Lie groups, Coxeter groups, Iwahori-Hecke algebras and others. It makes use of the GAP system. Over the years, Frank has been a major contributor to GAP.


Prof. Michael Joswig is one of the main authors of the Polymake system. This is an Open Source project for research into polyhedral geometry. Michael Joswig is the Einstein Professor for Discrete Mathematics/Geometry at TU Berlin.

Prof. Claus Fieker is one of the original authors of the KASH/KANT system, for algebraic number theory, which became part of Magma. He is also one of the main authors of Hecke, a new Open Source algebraic number theory project based on Nemo. (Some people may also know his postdoc, Tommy Hofmann, who is a coauthor with him of Hecke and of Nemo.)


Prof. Wolfram Decker is the coordinator of a prior Priority Project funded by the DFG on Computer Algebra, which is just concluding. Wolfram Decker is one of the people who is directly responsible for directing the Singular project and is a key member of the Algebra, Geometry and Computer Algebra group in Kaiserslautern.






In this blog, I'm only going to speak about the Central Software Project. But it is important to realise that the grant is much, much larger than this one project (it is just one of about 23 funded, interrelated projects in the grant). The Central Software Project should be thought of as providing the technical infrastructure for a very large mathematics grant in the area of computer algebra, although there are other projects within the grant that will contribute in a significant way to OSCAR. I'll maybe talk about some of them in later blogs.

The entire grant application had more than 100 pages in the preproposal and 400 pages in the full proposal, and took years to put together! At the final referee stage, the grant had 10 eminent referees, most from overseas.

What is OSCAR?


So what is OSCAR all about? Again, I can only speak about it in terms of what I understand from the PI's, and my knowledge gets greater the closer you get to the bit I will be directly involved with.

The basic idea is to develop a new Computer Algebra System, covering group theory, algebraic geometry/commutative algebra, polyhedral geometry and number theory. Our aim is to directly compete with the closed source system, Magma, hence the focus on many areas where Magma has been particularly influential.

All of the cornerstone systems (except ANTIC, which is quite new) have developed significant projects and communities in their own right, over decades, and it is our hope to combine the strengths of all these systems into a single system, which is highly integrated and easy to use. And we hope to do all this without significant disruption to those existing ecosystems. OSCAR will be a single system, which builds on all four of those existing systems and which tightly integrates those systems.

One major motivation for OSCAR is mathematical applications where it is necessary to do cross-disciplinary research, for example, where one might need a multigraded, equivariant Cox ring of a toric variety over a number field, or graphs of groups in division algebras, or matrix groups over a polynomial ring over a number field. To perform such exotic calculations, it is quite often not enough to simply interface systems such as GAP, Singular and ANTIC which separately provide either matrix groups, polynomials rings or number fields, say. One needs a very tight integration, especially if one wants such computations to be practical.

Some of us believe that it is the monolithic integration of Magma that has allowed it to be successful in the areas we hope to impact. Therefore, we in the Open Source community need to get really smart about integration in order to compete.

I personally will be working on both low and mid-level integration of the four cornerstone systems, using the programming language Julia. But there will be other efforts too, for example, there will be a coordinated serialisation effort, a parallelism project, new packages, interfaces and technologies in the cornerstone systems, and so on.

There will be a number of people other than myself hired to work on OSCAR. Let me mention two that will be associated with the Central Software Project specifically.

Dr. Reimer Behrends, who shares an office with me, and who is one of the main architects and authors of HPC-GAP and HPC-Singular will be working on aspects of parallelisation in OSCAR.

As mentioned already, Sebastian Gutsche will be working on both GAP and Polymake integration with the other cornerstones. He will be doing this using Julia, and in other ways.

How am I involved in this?


So, the rest of what I'm going to write is just about the part I have been involved with. The OSCAR website when it goes live will be the best source of information on the direction the overall project is going.

Some of you may have heard of my work on the Nemo project (joint work with Claus Fieker, Tommy Hofmann, Fredrik Johansson and others). This is a computer algebra package written in the Julia programming language. The main idea behind this, from my perspective, is to demonstrate two things:

1) It is possible to integrate multiple packages, such as Flint, Arb, Singular, Hecke, Antic (the C library -- not the ANTIC cornerstone mentioned above) and so on, in an efficient way, using the Julia programming language. And we've had some early success; already, we have demonstrated that it is possible to build Singular polynomial rings over coefficient rings provided by Nemo itself, or by Antic or Hecke, etc.

2) It is possible to develop extremely efficient implementations of generic algorithms over generic rings (rather than the specific rings that we have highly optimised implementations for, e.g. in the Flint system). This is possible in the Julia language, due to its innovative type system and JIT (Just-In-Time) compiler.

(It should be noted that Polymake has a very sophisticated form of runtime specialisation, and what may be thought of as an early kind of JIT compiler.)

I personally believe that high level algorithms implemented in an old-style interpreter are sometimes not enough to get a practical implementation for some research questions. Sometimes, the difference between a practical implementation and failure, is not bottlenecks in low level C code, but generic algorithms implemented in a high level, interpreted language. But unfortunately, we don't have enough skilled, low-level developers to reimplement such things in a low-level language every time they appear!

So one of the things I am particularly interested in, is applications where a "mid-level" language like Julia can be used to speed up generic algorithms. This has been one of the ideas behind my work on Nemo, so far, and the reason why I chose to develop it in the Julia programming language. Julia facilitates high-level constructs, with low-level performance.

Already, on top of Nemo, Prof. Claus Fieker and Dr. Tommy Hofmann, have built the Hecke system, also written in the Julia programming language, for doing computations in orders of number fields and class group computations. This is a culmination of efforts of Claus Fieker's project under the Priority Project, which I mentioned above.

What is the ANTIC cornerstone?


So with all of that background in mind, I can now describe the ANTIC cornerstone of the Central Software Project. ANTIC is the internal name we have had for the combination of GMP/MPIR, MPFR, Flint, Antic, Arb, Nemo and Hecke, that we have been busy with integrating and in some cases, developing under the Priority Programme. It will now become one of the four cornerstones of the Central Software Project, and will essentially be responsible for providing algebraic number fields and related things to the new Computer Algebra System, OSCAR.

We now begin the task of doing our part to help with the integration of ANTIC with the other cornerstones of the OSCAR system, namely Singular, Polymake and GAP.

There are already preexisting and independent projects underway to integrate Flint and Antic (the C library) into GAP as packages, which we aim to support in whatever way we can.

And within our collaboration, Michael Joswig has already suggested projects to me that will require integration of Antic with Polymake, and work is already quite advanced in the direction of integrating Singular and Nemo/Hecke. And there have already been extensive discussions about future integration of ANTIC with GAP, probably via Nemo and/or Hecke.

The ANTIC project achieves its integration, and some of its generic algorithms are directly implemented in, the Julia programming language. We aim to contribute the expertise we have been developing here in Kaiserslautern, in the Julia language, to the wider integration of the four cornerstones in OSCAR.

Will you join us?


Obviously OSCAR is a massive undertaking. We hope that the mathematics community and Computer Algebra communities at large will support our efforts.

OSCAR is an Open Source project, and we hope that it will attract many contributors over the years and that it will support many Masters and PhD projects!

The really important thing is that OSCAR is motivated by a large number of serious mathematical projects across computer algebra. These applications will drive development of the project and I believe, lead to a high quality, well-tested project, leading to many substantial contributions and publications for a long time to come.

Join us if you are interested. In coming days we will very quickly be setting up websites and infrastructure. In the meantime, feel free to contact me or any of the other people you know who are involved with this project.

And join us in celebrating what is a fantastic investment in Open Source computer algebra, that we hope will be of great benefit for the mathematical community at large!

Further Information


I hope to keep people up-to-date with what I'm personally working on, through this blog. I'll also link to other press releases, blogs and articles about OSCAR and our grant, as I become aware of them. So follow this blog if you are interested in hearing more as things develop.

Here is the University of Kaiserslautern press release (German).

Tuesday, 22 May 2012

Reduced Binary Quadratic Forms

Over the past few weeks I've been writing some code for computing reduced binary quadratic forms:

(a, b, c) : = ax^2 + bxy + cy^2

The discriminant of a form is d = b^2 - 4ac. The code I wrote works for d < 0.

Such a form is reduced if |b| <= a <= c and |b| >= 0 if a = c or a = |b|.

You can see the code here:


There are two versions of the function qfb_reduced_forms.

Version 1


The first version works by iterating over all possible values "a" (it's a theorem that |b| <= a <= sqrt(|d|/3)) and c <= (1 - d)/4) and searching for valid triples (a, b, c).

To find valid triples, we note that if d = b^2 - 4ac then d = b^2 mod 4a. So we simply solve this quadratic congruence and search for all roots "b" in the correct range.

To speed things up, we factorise all the different possible values "a" by sieving. This makes the square root code much faster, as it doesn't have to factorise 4a before computing square roots.

This approach requires softly O(|d|^{1/4}) primes. There are about O(|d|^{1/2}) forms on average, and indeed the run time is softly O(|d|^{1/2}) (subject to the Generalised Riemann Hypothesis -- needed to guarantee we can find quadratic nonresidues quickly enough for the Tonelli-Shanks modular square root algorithm).


Version 2


The second approach iterates over all possible values |b| (the same bound applies as for "a").

This time we factorise (d - b^2)/4 for each "b" into products "ac". Again, we do this by sieving (this time quadratic sieving, which first finds square roots of "d" mod various primes "p"). 

However, we must sieve with primes "p" dividing values (d - b^2)/4. In other words we need softly O(|d|^{1/2}) primes. On a 64 bit machine, this exhausts the precomputed primes at about |d| = 100000000, so this technique is a bit limited.

The advantage of this technique is that it is about 50% faster as implemented. It still takes time softly O(|d|^{1/2}) though (again subject to GRH).


Timings


Magma computes reduced binary quadratic forms, so I did two comparisons:

Comparison A) Compute all reduced forms for discriminants 0 > d > -1000000.

Comparison B) Compute all reduced forms for disciminants -1000000000 > d > -1000000100.

Here are the timings:

Comparison flint/qfb Magma
A 118s 947s
B 1s 120s






Sunday, 23 October 2011

BSDNT - interlude

You will notice that I have not worked on BSDNT for about a year now. Well, I'm thinking of restarting the project soon. I did complete two new revisions v0.25 and v0.26 since I stopped blogging. The first of these added random functions for a single word. They generate single words which have an increased probability of triggering corner cases, e.g. by a sparse binary representation. The second of these updates is a bsdnt_printf function. This is like printf but adds a %w format specifier for printing single words. There is also a %m for printing a len_t and %b for printing a bits_t.

I likely won't get much more done on BSDNT until early next year, but here is what I am planning:

1) I am tremendously grateful to Dr. Brian Gladman for his work on an MSVC version of the library. However, I started to struggle to keep up with this side of things more than I thought. Microsoft's MSVC doesn't support inline assembler in 64 bit x86. This means the entire plan of the MSVC version has to be different. It seems like far too much effort to combine both sets of code into a single library. I've therefore decided (sorry Brian) to ditch the MSVC code from my copy of BSDNT.

2) I wasn't happy with the interface of the division code. There are a few issues to consider.

The first issue is chaining. Obviously carry-in occurs at the left rather than the right. But for general division functions should the carry-in be a single limb or multiple limbs. It seems like the remainder after division is going to be m limbs and so the carry-in should be also. It is not clear what is better here. Internally, the algorithms deal with just a single carry-in limb because they use 2x1 divisions to compute the quotient digits. Perhaps chaining just means that we consider the first digit of the remainder to be carry-in for the next division.

Another problem associated with this is that when reading the carry-in from the array, if the carry-in happens to be zero then the array entry may not exist in memory. This means the code has to always check if the carry-in should be zero or not before proceeding.

The other issue to consider is shifting for normalisation. One assumes that the precomputed inverse is computed from a normalised value (the first couple of limbs of the divisor). Now, it is not necessary to shift either dividend or divisor ahead of time. One can still perform the subtractions that occur in division, on unshifted values. One does need to shift limbs of the dividend in turn however, as the algorithm proceeds, in order to compute the limbs of the quotient. But this shifting can occur in registers and need not be written out anywhere. This is implemented, but currently every limb gets shifted twice. This can be cut down to a single shift.

3) The PRNGs are currently quite hard to read. They have numerous macros to access their context objects. They are extremely flexible, but possibly overengineered. I'd like to simplify their implementations somewhat.

4) The configure script is a little overengineered. The idea of supporting lots of compilers is nice. But in reality GCC should exist almost everywhere. The original concept of BSDNT was to use inline assembly for architecture support. This gets around issues with global symbol prefixes and wotnot. It also makes the library really simple to read. Even on Windows 64 there is MinGW64 and this is the only setup I aim to target in that direction.

I hope to deal with all of these issues before proceeding with development of BSDNT. Give me some time as I am busy until about the end of the year. However, I do plan to continue development of BSDNT after sorting out these issues, because I think that fundamentally what we have is very solid.

Saturday, 20 November 2010

BSDNT - v0.24 nn_bitset/clear/test and nn_test_random

In today's update we make a long overdue change to bsdnt, again to improve our testing
of the library.

We are going to add a function for generating random bignums with sparse binary
representation. We'll also add some other random functions based on this primitive.

Using test integers with sparse binary representation in our test code will push our
functions harder because it will test lots of corner cases such as words that are all
zero, in the middle of routines, and so on. As it is currently, we'd be extremely
lucky for the random word generator we've been using to generate an all zero word, or
a word with all bits set to one for that matter.

The first step to generating such test randoms is to write a function for setting a
given bit in an integer. This will be an nn_linear function despite it not actually
taking linear time. In fact, it will take essentially constant time. However, it is an
nn type function, so it belongs in an nn module.

The routine is very straightforward. Given a bit index b, starting from 0 for the least
significant bit of a bignum, it will simply use a logical OR to set bit b of the bignum.

Rather than construct a bignum 2^b and OR that with our number, we simply determine
which word of the bignum needs altering and create an OR-mask for that word.

Computing which word to adjust is trivial, but depends on the number of bits in a word.
On a 64 bit machine we shift b to the right by 6 bits (as 2^6 = 64), and on a 32 bit
machine we shift b to the right by 5 bits (2^5 = 32). This has the effect of dividing
b by 64 or 32 respectively (discarding the remainder). This gives us the index of the
word we need to adjust.

Now we need to determine which bit of the word needs setting. This is given by the
remainder after dividing b by 64 or 32 respectively, and this can be determined by
logical AND'ing b with 2^6-1 or 2^5-1 respectively. This yields a value c between 0 and
63 (or 31) inclusive, which is a bit index. To turn that into our OR-mask, we simply
compute 2^c (by shifting 1 to the left by c bits).

Now that we have our OR-mask and the index of the word to OR it with, we can update the
required bit. We call this function nn_bit_set.

While we are at it we create two other functions, nn_bit_clear and nn_bit_test.

It's now straightforward to write test functions which randomly set, clear and test
bits in a random bignum.

Next we create a random bignum generator which sets random bits of a bignum. In order
to do this, we simply choose a random number of bits to set, from 0 to the number of words
in the bignum, then we set that many bits at random.

We call this function nn_test_random1.

We now add a second random bignum generator. It works by creating two bignums using the
function nn_test_random1 and subtracting one from the other. This results in test randoms
with long strings of 1's and 0's in its representation.

We call this function nn_test_random2.

Finally, we create a function nn_test_random which randomly switches between these two
algorithms and our original nn_random algorithm to generate random bignums. We switch all
our test code to use nn_test_random by changing the function randoms_of_len to use it.

At this point we can have greater confidence that our functions are all working as they
are supposed to be, as our test code has been suitably fortified at last! (Well, they are
working now, after I spent a day hunting down the bugs that these new randoms found - no,
I am not kidding. That's how good at finding bugs this trick is!)

Today's code is found here: v0.24

Previous article: v0.23 - sha1 and PRNG tests
Next article: BSDNT - interlude

Friday, 12 November 2010

BSDNT - v0.23 sha1 and PRNG tests

In a recent update we added three PRNGs (pseudo random number
generators). What we are going to do today is add test code for
these.

But how do you test a pseudo random generator? It's producing
basically random values after all. So what does it matter if the
output is screwed up!?

Well, it does matter, as shown by the problems on 32 bit machines
which I wrote about in the PRNG blog post. It would also matter if
the PRNGs were broken on some platform such that they always output
0 every time!

There's a few ways of testing PRNGs. One is simply to run them for a
given number of iterations, write down the last value it produces and
check that it always does this.

The method we are going to use is slightly more sophisticated. We are
going to hash a long series of outputs from the PRNGs, using a hash
function, and check that the hash of the output is always the same.

Basically, our test code will take a long string of words from the
PRNGs, write them to an array of bytes, then compute the sha1 hash of
that array of bytes. It'll then compare the computed hash with a hash
we've computed previously to ensure it has the same value as always.

Moreover, we'll set it up so that the hash is the same regardless of
whether the machine is big or little endian.

The hash function we are going to use is called sha1. Specifically,
we'll be using an implementation of the same written by Brian Gladman
(he also supplied the new test code for the PRNGs for today's update).

The first step is to identify whether the machine is big or little
endian. This refers to the order of bytes within a word in physical
memory. On little endian machines (such as x86 machines) the least
significant byte of a word comes first. On big endian machines the
order is the other way around. Thus the number 0x0A0B0C0D would have
the byte 0x0D stored first on a little endian machine, but 0x0A stored
first on a big endian machine.

Endianness can be identified by architecture, or it can be identified
with a short program. We choose to use the latter method as it should
be hard to fool. At configure time a short C program will run that will
place bytes into a four byte array, then read that array as a single
32 bit number. We then compare the value to a 32 bit value that would
be stored in the given way on a little endian machine. If it compares
equal, then the machine must be little endian. If not we compare with
a number that would be stored in the given way on a big endian machine.

If the machine doesn't match either order, then it must be a very rare
machine with mixed endianness, which we don't support in bsdnt.

The configure script will write some defines out to config.h which
then allow bsdnt modules to identify whether the machine is little or
big endian at compile time, i.e. at zero runtime cost.

Now to discuss the sha1 hashing scheme.

A hashing scheme simply takes a piece of data and computes from it a
series of bits which serve to "identify" that piece of data. If
someone else has access to the same hashing algorithm and a piece of
data which purports to be an exact copy of the original, then they
can verify its identity by computing its hash and comparing.

Of course a hash is only valuable in this sense if it is much shorter
than the piece of data itself (otherwise you'd just compare the
actual data itself).

A very simple hashing scheme might simply add all the words in the
input to compute a hash consisting of a single word.

A secure hashing scheme has at least two other properties.

i) It shouldn't be possible to determine the original data from its
hash. (The data may be secret and one may wish to provide for the
independent verification of its authenticity by having the recipient
compare the hash of the secret data with a publicly published value.
Or, as is sometimes the case, the hash of secret data, such as a
password, might be transmitted publicly, to compare it with a
pre-recorded hash of the data.)

ii) It must be computationally infeasible to substitute fake data
for the original such that the computed hash of the fake data is the
same as that of the original data.

Of course, if the hash is short compared to the data being hashed,
then by definition many other pieces of data will have the same hash.
The only requirement is that it should be computationally infeasible
to find or construct such a piece of data.

The first step in the SHA1 algorithm is some bookkeeping. The
algorithm, as originally described, works with an input message which
is a multiple of 16 words in length. Moreover, the last 64 bits are
reserved for a value which gives the length of the original message in
bits.

In order to facilitate this, the original message is padded to a
multiple of 16 words in length, with at least enough padding to allow
the final 64 bits to be part of the padding, and to not overlap part
of the message.

The padding is done by first appending a single binary 1 bit, then
binary zeroes are appended until the message is the right length.
Then of course the length in bits of the original message is placed
in the final 64 bits of the message.

The hashing algorithm itself performs a whole load of prescribed
twists and massages of the padded message.

For this purpose some functions and constants are used.

Given 32 bit words B, C and D there are four functions:

1) f0(B, C, D) = (B AND C) OR ((NOT B) AND D)

2) f1(B, C, D) = B XOR C XOR D

3) f2(B, C, D) = (B AND C) OR (B AND D) OR (C AND D)

4) f3(B, C, D) = B XOR C XOR D

and four corresponding 32 bit constants:

1) C0 = 0x5A827999

2) C1 = 0x6ED9EBA1

3) C2 = 0x8F1BBCDC

4) C3 = 0xCA62C1D6

To begin the algorithm, we break the padded message up into 16 word
blocks M1, M2, M3, i.e. each Mi is 16 words of the padded message.

Each block is processed via a set of steps, and an "accumulated" hash
of 160 bits, consisting of five 32 bit words (the final hash we are
after) is computed: H0, H1, H2, H3, H4.

The algorithm starts by initialising the five "hash words" to the
following values:

H0 = 0x67452301, H1 = 0xEFCDAB89, H2 = 0x98BADCFE, H3 = 0x10325476
and H4 = 0xC3D2E1F0.

Each block of 16 words, Mi, of the padded message is then used in
turn to successively transform these five words, according to the
following steps:

a) Break Mi into 16 words W0, W1, ..., W15.

b) For j = 16 to 79, let Wj be the word given by

Wj = S^(-1)(W{j-3}) XOR W{j-8} XOR W{j-14} XOR W{j-16}),

where S^n(X) means to rotate the word X to the left through n bits
(a negative n means right rotation).

c) Make a copy of the hashing words:

A = H0, B = H1, C = H2, D = H3, E = H4

d) For j = 0 to 79 repeat the following set of transformations in
the order given:

TEMP = S^5(A) + f{j/20}(B, C, D) + E + Wj + C{j/20},

E = D, D = C, C = S^30(B), B = A, A = TEMP,

where j/20 signifies "floor division" by 20, and where f and C are
the above-defined functions and constants.

e) Update the hashing words according to the following:

H0 = H0 + A, H1 = H1 + B, H2 = H2 + C, H3 = H3 + D, H4 = H4 + E.

Note that steps a-e are repeated for each block of 16 words, Mi in
the padded message, further manipulating the five words with each run.
The resulting five words H0, H1, H2, H3, H4 after all the words of the
padded message have been consumed, constitutes the sha1 hash of the
original message.

The function to compute the sha1 hash of a message is given in the
files sha1.c and sha1.h in the top level directory of the source
tree.

A new test file t-rand.c is added in the test directory. It contains
the sha1 hash of a large number of words as output by our three
PRNGs. If a user of bsdnt has the same hash for the PRNGs when run
on their machine, then they can have a very high level of confidence
that they are working as expected.

Note that the sha1 algorithm is known as a secure hashing algorithm,
which means that in theory it can be used to hash very important
data so that the recipient can independently confirm the data hasn't
been tampered with (by computing the hash of the value and making
sure it matches some published value).

We don't explain how sha1 actually works. The mysterious constants
are not so mysterious. C0 is the square root of 2 in hexadecimal, C1 is
the square root of 3, C2 the square root of 5, C3 the square root of 10.

I don't know the meaning of the functions f0-f3.

What is worth noting is that in recent years, people have figured out
how to produce sha1 hash collisions (two messages with the same hash).
I don't pretend to be an expert in such things.

All we care about here is that a broken PRNG really can't pretend to
be working, and for that, sha1 works a treat.

Disclaimer: use the information in this post at your OWN RISK!! We
make no representations as to its correctness. The same goes for
bsdnt itself. Read the license agreement for details.

Warning: cryptography is restricted by law in many countries including
many of those where the citizens believe it couldn't possibly be so.
Please check your local laws before making assumptions about what you
may do with crypto.

The code for today's article is here: v0.23

Previous article: v0.22 - Windows support

Thursday, 11 November 2010

BSDNT - v0.22 Windows support

Today's update is a massive one, and comes courtesy of Brian Gladman. At last we add
support for MSVC 2010 on Windows.

In order to support different architectures we add architecture specific files in the arch
directory. There are three different ways that architectures might be supported:

* Inline assembly code

* Standalone assembly code (using an external assembler, e.g. nasm)

* Architecture specific C code

Windows requires both of the last two of these. On Windows 64 bit, MSVC does not support
inline assembly code, and thus it is necessary to supply standalone assembly code for this
architecture. This new assembly code now lives in the arch/asm directory.

On both Windows 32 and 64 bit there is also a need to override some of the C code from the base
bsdnt library with Windows specific code. This lives in the arch directory.

Finally, the inline assembly used by the base bsdnt library on most *nix platforms is now in the
arch/inline directory.

In each case, the os/abi combination is specified in the filenames of the relevant files. For
example on Windows 32, the files overriding code in nn_linear.c/h are in arch/nn_linear_win32.c/h.
(Note win32 and x64 are standard Windows names for 32 and 64 bit x86 architectures, respectively.)

If the code contains architecture specific code (e.g. assembly code) then the name of the file
contains an architecture specifier too, e.g. arch/inline/nn_linear_x86_64_k8.h for code specific
to the AMD k8 and above.

It's incomprehensible that Microsoft doesn't support inline assembly in their 64 bit compiler
making standalone assembly code necessary. It would be possible to use the Intel C compiler on
Windows 64, as this does support inline assembly. But this is very expensive for our volunteer
developers, so we are not currently supporting this. Thus, on Windows 64, the standalone
assembly is provided in the arch/asm directory as just mentioned.

Brian has also provided MSVC build solution files for Windows. These are in the top level source
directory as one might expect.

There are lots of differences on Windows that requires functions in our standard nn_linear.c,
nn_quadratic.c and helper.c files to be overridden on Windows.

The first difference is that on 64 bit Windows, the 64 bit type is a long long, not a long. This
is handled by #including a types_arch.h file in helper.h. On most platforms this file is empty.
However, on Windows it links to an architecture specific types.h file which contains the
requisite type definitions. So a word_t is a long long on Windows.

Also, when dealing with integer constants, we'd use constants like 123456789L when the word type
is a long, but it has to become 123456789LL when it is a long long, as on Windows 64. To get
around this, an architecture specific version of the macro WORD(x) can be defined. Thus, when
using a constant in the code, one merely writes WORD(123456789) and the macro adds the correct
ending to the number depending on what a word_t actually is.

Some other things are different on Windows too. The intrinsic for counting leading zeroes is
different to that used by gcc on other platforms. The same goes for the function for counting
trailing zeroes. We've made these into macros and given them the names high_zero_bits and
low_zero_bits respectively. The default definitions are overridden on Windows in the architecture
specific versions of helper.h in the arch directory.

Finally, on Windows 64, there is no suitable native type for a dword_t. The maximum sized
native type is 64 bits. Much of the nn_linear, and some of the nn_quadratic C code needs to
be overridden to get around this on Windows. We'll only be using dword_t in basecase algorithms
in bsdnt, so this won't propagate throughout the entire library. But it is necessary to
override functions which use dword_t on Windows.

Actually, if C++ is used, one can define a class called dword_t and much of the code can
remain unchanged. Brian has a C++ branch of bsdnt which does this. But for now we have C code
only on Windows (otherwise handling of name mangling in interfacing C++ and assembly code
becomes complex to deal with).

Brian has worked around this problem by defining special mul_64_by_64 and div_128_by_64
functions on 64 bit Windows. These are again defined in the architecture specific version of
helper.h for Windows 64.

Obviously some of the precomputed inverse macros need to be overridden to accomodate these
changes, and so these too have architecture specific versions in the Windows 64 specific version
of the helper.h file.

In today's release we also have a brand new configure file for *nix. This is modified to handle
all the changes we've made to make Windows support easy. But Antony Vennard has also done
some really extensive work on this in preparation for handling standalone assembly on arches
that won't handle our inline assembly (and for people who prefer to write standalone assembly
instead of inline assembly).

The new configure file has an interactive mode which searches for available C compilers (e.g.
gcc, clang, icc, nvcc) and assemblers (nasm, yasm) and allows the user to specify which to use.
This interactive feature is off by default and is only a skeleton at present (it doesn't actually
do anything). It will be the subject of a blog post later on when the configure file is finished.

The code for today is found at: v0.22

Previous article: v0.21 - PRNGs

Sunday, 31 October 2010

BSDNT - v0.21 PRNGs

In this update we are going to replace the el cheapo random generator in
bsdnt with something of higher quality.

Some time ago, Brian Gladman brought to my attention the fact that on 32
bit machines, the test code for bsdnt actually caused Windows to hang!

The issue required some sleuthing work on Brian's part to track down.
He eventually discovered the cause of the problem, and it was, oddly
enough, the pseudo-random number generator (PRNG) I had used.

Brian suspected the PRNG immediately because of his past experience as a
professional cryptographer. In fact, it turns out that PRNG's of the
kind that I had used, aren't particularly good even if they don't have
bugs!

The particular kind of PRNG I had used is called a linear congruential
PRNG. They start with the PRNG initialised to some random seed value,
n = seed. Then each time they are called, they apply the transformation
n = (n*c1 + c2) % p for some explicit constants c1, c2 and a large
enough "prime" p.

One can take c2 = 0 in the transformation and it is also common to see
p = 2^b for some b (e.g. b = WORD_BITS, and yes, I know p = 2^b is not
usually prime). When p = 2^b it is usually the case that the top half of
the bits output have reasonable random properties, but the bottom half
do not. In this case it is acceptable to stitch two LC PRNG's together to
give the full number of bits.

When p is an actual prime, these PRNG's are called prime modulus linear
congruential PRNG's and they aren't too bad when implemented properly.
They still fail to meet some standards of random quality.

To return a whole word of random bits, one either needs to use a prime
p that is larger than a word, which is usually impractical, or again
stitch the output of two prime modulus LC PRNG's together.

However, one needs to be careful, as the period of the generator is p-1,
so if one is on a 32 bit machine, it doesn't do to use a prime p just
over half the size of a word (the first mistake I made), otherwise the
period is just over 65536. That isn't too good for generating random
values for test code!

But how was my LC PRNG causing Windows to crash!? The reason was that
in some of the test functions we required bignums that didn't overflow
when summed together. This of course depends almost entirely on the top
few bits of the bignums being added together.

The problem was that in the expression n = (n*c1 + c2) % p, I was using
values of c1 and c2 which were not reduced mod p. It turns out that this
is crucial to correct operation. It might seem that the result ends up
being reduced mod p anyway, and indeed it would be if n*c1 fit in a word.
However, because it doesn't you actually get ((n*c1 + c2) % 2^32) % p
which causes the binary representation of the value generated to be quite
regular.

Anyhow, on Windows (and probably on other 32 bit machines) the test code
generates length 90 bignums over and over at some point, looking in vain
to find pairs of such bignums which when added do not overflow. As these
are garbage collected at the end of the test function, memory starts
filling up with the orphaned bignums that are discarded by the test code
as it looks for appropriate values. This eventually overwhelms the heap
allocator on Windows and causes the entire OS to crash!

The problem of writing decent PRNG's has been studied extensively, and one
expert in the subject is George Marsaglia. He famously turned up on a
usenet forum in January of 1999 and dumped not one, but piles of fast, high
quality PRNG's which do not suffer from the problems that other PRNG's do.

Amazingly, many of the PRNG's in common usage today are either written by
George, or based on ones he wrote. So he's some kind of legend!

Anyhow, we will make use of two of his PRNG's, Keep It Simple Stupid (KISS)
and Super KISS (SKISS) and a third PRNG called Mersenne Twister, due to
Makoto Matsumoto and Takuji Nishimura in 1997.

George's generators are in turn based on some simpler PRNG's. He begins by
defining a linear congruential generator, with c1 = 69069 and c2 = 1234567.
This is taken p = mod 2^32 (yes, it's not prime). This has good properties
on its top 16 bits, but not on its bottom 16 bits, and for this reason had
been widely used before George came along. This generator has period 2^32.

Next he defines a pair of multiply with carry (MWC) generators. These are
of the form n = c1*lo(n) + hi(n) where lo(n) is the low 16 bits of n, hi(n)
is the high 16 bits and c1 is an appropriately chosen constant.

He stitches together a pair of these MWC PRNG's mod 2^16 to give 32 random
bits. For simplicity we'll refer to this combined random generator as MWC.
This has a period of about 2^60.

Thirdly he defines a (3-)shift-register generator (SHR3). This views the
value n as a binary vector of 32 bits and applies linear transformations
generated from 32 x 32 matrices L and R = L^T according to
n = n(I + L^17)(I + R^13)(I + L^5) where I is the 32 x 32 identity matrix.

In order to speed things up, special transformations are chosen that can
be efficiently implemented in terms of XOR and shifts. This is called an
Xorshift PRNG. We'll just refer to it as SHR3.

Now given appropriate seed values for each of these PRNG's Marsaglia's
KISS PRNG is defined as MWC ^ CONG + SHR3. This generator passes a whole
slew of tests and has a period of 2^123. In this update we make it the
default random generator for bsdnt.

Super KISS is a random generator defined by George later in 2009. It gives
immense periods by adding together the output of three PRNG's, one with a
massive order. It is basically defined by SKISS = SUPR + CONG + SHR3.

Here, the new generator SUPR is based on a prime p = a*b^r + 1 such that
the order of b mod p has magnitude quite near to p - 1.

It starts with a seeded vector z of length r, all of whose entries are
less than b and an additional value c which is less than a.

One then updates the pair (z, c) by shifting the vector z to the left by
one place and setting the right-most entry to (b - 1) - ((a*z1 + c) mod b)
where z1 is the entry shifted out at the left of z. Then c is set to t/b.

Naturally in practice one uses b = 2^32 so that all the intermediate
reductions mod b are trivial.

As with most generators which have massive periods the "state" held by this
generator is large. It requires data mod p for a multiprecision p.

Note the similarity with the MWC generator except for the "complement" mod
b that occurs. This is called a CMWC (Complemented-Multiply-With-Carry)
generator.

George proposed using the prime p = 640*b^41265+1, where the order of b is
5*2^1320481. The period of the CMWC generator is then greater than
2^1300000.

Of course, at each iteration of the algorithm, 41265 random words are
generated in the vector. Once these are exhausted, the next iteration of
the algorithm is made.

The algorithm SUPR in the definition of SKISS is thus just a simple
array lookup to return one of the words of the vector z. Each time SKISS
is run, the index into the array is increased until all words of the array
are exhausted, at which point the CMWC algorithm is iterated to refill the
array.

We now come to describing the Mersenne twister.

It is based on the concept of a feedback shift register (FSR). An FSR shifts
its value left by 1 bit, feeding at the right some linear combination of the
bits in its original value. The Mersenne twister is conceptually a
generalisation of this.

The difference with the Mersenne twister is that the "feedback" is effected
by a certain "twist". This is effected by applying a "linear transformation"
A of a certain specific form, with multiplication by A having addition
replaced by xor in the matrix multiplication. The twist can be described
more straightforwardly, and we give the more straightforward description
below.

One sets up a Mersenne twister by picking a recurrence degree n, a "middle
word" 1 <= m <= n and a number of bits for a bitmask, 0 <= r <= 32. One
picks these values so that p = 2^(n*w - r) - 1 is a Mersenne prime (hence
the name of this PRNG). Given a vector of bits a = [a0, a1, ..., a{w-1}] of length
w and a sequence x of words of w bits, the Mersenne twister is defined by a
recurrence relation x[k+n] = x[k+m] ^ ((upper(x[k]) | lower(x[k+1])) A)
where upper and lower return the upper w - r and lower r bits of their
operands, and where A is the "twist" spoken of and defined below, in terms of
a. Of course ^ here is the xor operator, not exponentiation. For a vector X of w
bits, XA is given by X>>1 if X[0] == 0 otherwise it is given by (X>>1) ^ a.

Some theory is required to find an A such that the Mersenne twister will have
maximum theoretical period 2^(n*w - r) - 1.

To finish off, the Mersenne twister is usually "tempered". This tempering
simply mangles the bits in a well understood way to iron out some of the
known wrinkles in the MT algorithm.

Only a couple of sets of parameters are in common use for Mersenne twisters.
These are referred to as MT19937 for 32 bit words and MT19937-64 for 64 bit
words.

As with all PRNG's, there is a whole industry around "cracking" these things.
This involves starting with a short sequence of values from a PRNG and
attempting to find the starting constants and seed values.

Obviously, in crytographic applications, there is not much point generating
"secure" keys with a PRNG with a single source of entropy. Even if your key
is generated by multiplying primes of many words in length, if those words
were generated from a PRNG seeded from the current time, it may only take
a few iterations and a good guess as to which PRNG you used, to determine
the constants used in the PRNG and thus your entire key. And that's
irrespective of which constants you chose in your PRNG!

So if you are doing crypto, you need to take additional precautions to
generate secure keys. Just seeding a PRNG from the time probably won't cut
it!

Some PRNG's are more "secure" than others, meaning that knowing a
few output values in a row doesn't give terribly much information about
which values may follow. But if you rely on a PRNG to be secure, you
are essentially betting that because you don't know how to get the
next few values and nor does anyone else that has written about the
subject, then no one at all knows. Of course one needs to ask oneself
if they would tell you if they did.

Another assumption one should never make is that no one has the computing
power to brute force your PRNG.

Some PRNG's are designed for cryptographic applications, and maybe one can
believe that these are "safe" to use, for some definition of safe.

Anyhow, we only care about random testing at this point. In today's update
32 and 64 bit KISS, SKISS and MT PRNG's are added in the directory rand.
Our randword, randinit, and randclear functions are all replaced with
appropriate calls to KISS functions.

There is also an option to change the default PRNG used by bsdnt. Is it my
imagination or does the test code now run faster, even on a 64 bit machine!

At some point we will add some tests of the new PRNG's. These will compare
the outputs with known or published values to check that they are working as
designed for a large number of iterations.

Brian Gladman contributed to this article and also did most of the work
in implementing Marsaglia's PRNG's in bsdnt. The Mersenne twisters were
originally written by Takuji Nishimura and Makoto Matsumoto and made available
under a BSD license. Brian did most of the work in adapting these for bsdnt.

The code for today's update is here: v0.21

Previous article: v0.20 - redzones