tag:blogger.com,1999:blog-76511154304161566362019-03-01T01:35:04.699-08:00Reading, Writing and ArithmeticWilliam Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.comBlogger40125tag:blogger.com,1999:blog-7651115430416156636.post-17309443869690959462017-09-23T17:34:00.000-07:002017-09-24T14:24:38.994-07:00Parallel multivariate multiplication<br />Since my last blog on multivariate polynomial arithmetic, a lot has happened, so it's time to give an update, including our first parallel sparse multiplication benchmarks! In fact, there is so much to report that I'm restricting this blog to multiplication only!<br /><br />The most exciting news since last time is that Daniel Schultz has gotten parallel sparse multiplication working. Daniel is the person we hired on the OpenDreamKit grant to work on parallel multivariate arithmetic. He's also implemented numerous improvements to the code, which I'll mention below.<br /><br />After the last blog post, I discovered that a number of things affect the timings:<br /><br /><ul><li>Polynomial ordering (lex, deglex, degrevlex)</li><li>Ordering of variables (x, y, z, t, u)</li><li>Whether one computes f*g or g*f, or in the case of division, p/f or p/g</li></ul><br />Obviously in a perfect system, none of these would matter, but especially with our old code and with certain systems, they do. In particular, some of the division timings I gave previously for Flint, were invalid. I've now corrected them on the old blog.<br /><br />In this blog, I will give new timings which always force lex ordering, for x > y > z > t > u, and will always take f*g.<br /><br />Sdmp and Maple prefer deglex, so I will time these separately. We also don't have them available on our benchmark machine anyway.<br /><br />Magma, Sage and Singular support a virtual panoply of orderings, whilst some systems are restricted just to lex or just to deglex. Flint supports lex, deglex and degrevlex. (Since there is no agreement on what revlex should be, we decided not to support it.)<br /><br />We also discovered that we had implemented degrevlex incorrectly, and this has now been fixed. This didn't affect the timings in the last blog.<br /><br /><h3>Sage division improvements</h3><div><br /></div><div>In the last blog we noted that Sage performed much worse than Singular for division over Z. The Sage developers immediately took note of this and set about fixing it. The latest release is 8.0 and it doesn't seem to have the fix, so we have to wait for the new version to give the new Sage division timings. It's great to see how quickly the community responded to this.</div><div><br /></div><div>It was of course fixed by using Singular for that computation, rather than a very slow generic implementation of division.</div><div><br /></div><h3>Singular improvements</h3><div><br /></div><div>Hans Schoenemann has been working on improving the use of Factory from Singular. This means that Singular will call Factory for large arithmetic operations, since it is much faster. Singular itself is optimised for Groebner basis computations, not arithmetic.</div><div><br /></div><div>The new improvements will currently take effect for computations with over 1000 terms or so, which is heuristically chosen to avoid overhead in converting to and from Factory's lex ordering and recursive polynomial format. Changing from an arbitrary ordering to a lex ordering and back can be <i>extremely</i> expensive, so this crossover is essential to avoid harming performance for smaller operations.</div><div><br /></div><div>We've also decided that when we have a really solid, performant implementation of multivariate arithmetic, we are going to make use of it in Factory. The end result is that the code that we are writing now will be available to all Singular (and Factory) users.</div><br />The main operations that really benefit from such arithmetic are rational functions and multivariate factorisation. They aren't for speeding up Groebner Basis computations!<br /><br /><h3>Flint improvements</h3><div><br /></div><div>Most of the actual work done on Flint since my last blog, has been done by Daniel Schultz. Here are some of the improvements:</div><div><br /></div><h4>Improved exponent packing</h4><div><br /></div><div>When using deglex or degrevlex, our monomial format packs the total degree field into the same words as all the other exponent fields. The trick of packing exponent vectors into bit fields goes way back to the early days of Singular. We get the idea from a paper of Hans Schoenemann.</div><div><br /></div><div>Actually, Singular keeps the degree field separate from the rest of the vector (which in hindsight might have been a better thing for us to do as well), but we decided to try and pack it along with the other exponents.</div><div><br /></div><div>The problem with our implementation of deglex and degrevlex is precisely this extra field that is required for the degree. If we don't want to overflow a word for each exponent vector, we have fewer bits per exponent when using degree orderings.</div><div><br /></div><div>The maximum total degree is also often larger than the maximum of any individual degree, further exacerbating the problem. These issues mean that if we aren't careful, we can take more space for packed monomials using degree orderings than for lex orderings.</div><div><br /></div><div>In practice, this meant that for our largest benchmarks, two 64 bit words were being used for exponent vectors, instead of one. This is because we were packing each exponent into a 16 bit field. Obviously six times 16 bits requires more than 64 bits.</div><div><br /></div><div>We were using 16 bit fields, since we only supported fields of 8, 16, 32 or 64 bits (including an overflow bit), and 8 bit fields just weren't large enough; the maximum total degree in our largest benchmark is 160, (which requires 8 bits, plus an overflow bit).</div><div><br /></div><div>We now pack into an arbitrary number of bits, and for 5 variables plus a degree field, it obviously makes sense to pack into 10 bit fields, since you get 6 of those in 64 bits. This is plenty big enough for all our benchmarks, in fact. The result is a 50% saving in time when using degree orderings.</div><div><br /></div><h4>Delayed heap insertion</h4><div><br /></div><div>Recently at ISSAC 2017 here in Kaiserslautern, Roman Pearce, one of the authors of the Sdmp library, told me about an improvement for multiplication that delays inserting nodes into the heap. In some cases, we know that they will be inserted at a later time anyway, so there is no hurry to put them in at the earliest possible opportunity, as this just increases the size of the heap.</div><div><br /></div><div>Daniel implemented a version of this that he came up with, and it makes a big difference in timings, and shrinks the heap to a fraction of its original size.</div><div><br /></div><h4>Multi-core multiplication</h4><div><br /></div><div>Flint can now multiply sparse multivariate polynomials using threads. Daniel experimented with various strategies, including breaking one of the polynomials up. This strategy has the disadvantage of requiring a merge of polynomials from the various threads at the end, which is expensive.</div><div><br /></div><div>The best strategy we found was to take "diagonals". That is, we divide the output polynomial up into chunks instead of dividing up the input polynomials. This has to be done heuristically, of course, since you don't have the output polynomial until it is computed.<br /><br />This idea is contained in a paper of Jacques Laskar and Mickael Gastineau.<br /><br /></div><div>To ensure that all threads take about the same amount of time, we start with big blocks and take smaller and smaller blocks towards the end, so that no thread will be left running for a long time at the end while all the others are waiting. This gives fairly regular performance. The scheme we use is a variation on that found in the paper by Laskar and Gastineau.</div><div><br /></div><div>The advantage of the diagonal splitting method is that no merge is required at the end of the computation. One simply needs to concatenate the various outputs from the threads, and this can be done relatively efficiently.</div><div><br /></div><h4>Improved memory allocation</h4><div><br /></div><div>At scale, everything breaks, and malloc and the Flint memory manager are no exception. I wrote a new memory manager for Flint which uses thread local storage to manage allocations in a way that tries to prevent different threads writing to adjacent memory locations (which would incur a performance penalty).</div><div><br /></div><div>The design of this memory manager is beyond the scope of this blog, but it seems to work in practice. </div><div><br /></div><div>[A brief summary is that it allocates blocks of memory per thread, consisting of a few pages each. Each page has a header which points to the header of the first page in the block, where a count of the number of active allocations is stored. When this count reaches zero, the entire block is deleted. When threads terminate, blocks are marked as moribund so that further allocations are not made in them. The main thread can still clean thread allocated blocks up when their count reaches zero, but no further allocations are made in those blocks.]</div><div><br /></div><div>The main advantage of this scheme, apart from keeping memory separated between threads is that we can block-allocate the mpz_t's that are used for the polynomial coefficients. This is critical, since in our largest benchmarks, almost a third of the time is spent allocating space for the output!</div><div><br /></div><div>Unfortunately, each mpz_t allocates a small block of memory internally. That memory is handled by GMP/MPIR, not Flint, and so we still have 28 million small allocations for our largest benchmark!</div><div><br /></div><div>I was able to make a small improvement here by changing the default allocation to two words, since we don't really use one word GMP integers in Flint anyway. But fundamentally, we are still making many small allocations at this point, which is very expensive.</div><div><br /></div><div>This can be improved in a future version of Flint by setting a custom GMP memory allocator. But it will require implementation of a threaded, pooled, bump allocator. It can be done, but it will be a lot of work.</div><div><br /></div><div>If we want to scale up to more than 4 cores (and we do!), we have no choice but to fix this!</div><div><br /></div><h3>Trip homogeneous format</h3><div><br /></div><div>Mickael Gastineau, one of the authors of Trip, was kind enough to send me a special version of Trip which allows access to their homogeneous polynomial format. It isn't optimised for integer arithmetic, but their papers show it is very fast for dense multivariate arithmetic, assuming you are happy to wait for a long precomputation.</div><div><br /></div><div>The basic idea is to split polynomials into homogeneous components and precompute a very large table in memory for lookup. One only needs a table for the given degree bound, and Trip manages this automatically upon encountering this degree for the first time.</div><div><br /></div><div>Before we get to the main timings below, we present Trip timings with this homogeneous format (excluding the precomputation).</div><div><br /></div><div>We only time this for dense computations, since the precomputation is not practical for most of the sparse benchmarks. Moreover, since the code is not optimised for integer arithmetic, we use floating point to exactly represent integers, which means that the largest benchmarks can't be performed, since they would overflow the double precision floating point numbers.</div><div><br /></div><div>Since Flint doesn't yet have parallel dense multiplication, these timings are restricted to a single core.<br /><br /></div><div><br /><h4>Dense Fateman (1 core)</h4></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-blWeoWaJnR4/Wcb3fIfkQDI/AAAAAAAAB4E/FjNe0iglMtwQG88Y690ChuX7rhzQ-Zj3gCLcBGAs/s1600/trip1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="281" data-original-width="392" height="143" src="https://3.bp.blogspot.com/-blWeoWaJnR4/Wcb3fIfkQDI/AAAAAAAAB4E/FjNe0iglMtwQG88Y690ChuX7rhzQ-Zj3gCLcBGAs/s200/trip1.PNG" width="200" /></a></div></div><div><br /></div><h3>New sparse multiplication benchmarks</h3><div><br /></div><div>In this section, we give timings for all systems (except Sdmp and Maple) on 1, 2 and 4 cores (where available). Most of the systems, including ours, perform badly on larger numbers of cores, unless the problems are really huge (> 30 minutes to run). When we have fixed our memory management issues, we'll do another blog post with timings on larger numbers of cores.</div><div><br /></div><div>All the systems below were forced to use lexicographical ordering, with x > y > z > t > u, and we always take f*g (the polynomials f and g are the same length, so it's an arbitrary choice in reality).</div><div><br /></div><div>The new delayed insertion technique Daniel implemented actually makes the timings for f*g and g*f the same for our new code. But other systems that support lex don't necessarily have this.</div><div><br /></div><div>All of the timings in this section are done on the same machine, an old 2.2GHz Opteron server.</div><div><br />First, we give all the single core timings. We've added Piranha and Trip timings this time around. Incidentally, my understanding is that Piranha uses a hash table and therefore doesn't have a specific polynomial ordering. We include timings for Piranha to quench your burning curiosity.<br /><br /><h4 style="text-align: left;">Dense Fateman (1 core)</h4></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-KyyxIw09sXI/Wcb5gWufhvI/AAAAAAAAB4Q/DRAKF3XnFS0_IYLReSRa9cSHxkfoNTMiQCLcBGAs/s1600/fateman1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="275" data-original-width="891" height="184" src="https://3.bp.blogspot.com/-KyyxIw09sXI/Wcb5gWufhvI/AAAAAAAAB4Q/DRAKF3XnFS0_IYLReSRa9cSHxkfoNTMiQCLcBGAs/s400/fateman1.PNG" width="597" /></a></div><div><br /></div><div><br /></div><div><div><h4 style="text-align: left;">Sparse Pearce (1 core)</h4></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-uz71sopdUYc/Wcb5gZJMFfI/AAAAAAAAB4Y/_nb_xg9fLe8wNQwVvCBHFlX3BaZo6HklwCEwYBhgL/s1600/pearce1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="311" data-original-width="851" height="208" src="https://2.bp.blogspot.com/-uz71sopdUYc/Wcb5gZJMFfI/AAAAAAAAB4Y/_nb_xg9fLe8wNQwVvCBHFlX3BaZo6HklwCEwYBhgL/s400/pearce1.PNG" width="584" /></a></div><div><br /></div><div><br /></div><div>Of course, not all the systems can do parallel multiplication. So a smaller set of benchmarks for a smaller set of systems follows.</div><div><br /></div><div></div><div><h4 style="text-align: left;">Sparse Pearce (2 core)</h4></div><div></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-WeER9CPhYc4/Wcb5gdZd3UI/AAAAAAAAB4s/VlXjahqCcTspO-X8RKAv3SbBP3KoZjSgQCEwYBhgL/s1600/pearce2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="311" data-original-width="500" height="247" src="https://4.bp.blogspot.com/-WeER9CPhYc4/Wcb5gdZd3UI/AAAAAAAAB4s/VlXjahqCcTspO-X8RKAv3SbBP3KoZjSgQCEwYBhgL/s400/pearce2.PNG" width="400" /></a></div><h4></h4><h4 style="text-align: left;">Sparse Pearce (4 core)</h4></div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-4sc1sxn5T4o/Wcb5g8K_D7I/AAAAAAAAB4s/Jmj-M2REROQy7HJ9fItSNHOFvRM9l7tUACEwYBhgL/s1600/pearce4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="310" data-original-width="498" height="248" src="https://4.bp.blogspot.com/-4sc1sxn5T4o/Wcb5g8K_D7I/AAAAAAAAB4s/Jmj-M2REROQy7HJ9fItSNHOFvRM9l7tUACEwYBhgL/s400/pearce4.PNG" width="400" /></a></div><h3></h3><h3>Sdmp and Maple 2017 timings</h3><div><br /></div><div>Michael Monagan very kindly granted me guest access to a machine that has Maple 2017 installed, which is currently also the only way to time Sdmp. Roman Pearce has also kindly given lots of information on how to use Maple and Sdmp, so that we can get fair timings.</div><div><br /></div><div>The first thing to note is that the timings here are on a completely different machine to the above. So they are not in any way comparable.</div><div><br /></div><div>I believe Sdmp internally only supports deglex. Therefore, we decided it best to use deglex for the Maple and Sdmp timings, otherwise we may be comparing apples and oranges. This is another reason one can't make a direct comparison with the timings for the other systems above. It also means there are no dense benchmarks, since our dense code doesn't support deglex efficiently yet.</div><div><br /></div><div>It's also very important to point out that we have no way to run Sdmp except via Maple. We don't always get the timings for Sdmp below in practice, they are simply the self-reported times for Sdmp as a component of the full Maple run times.</div><div><br /></div><div>It's not entirely clear why Maple is so slow on large inputs. Maple makes use of a DAG structure for its symbolic engine, rather than directly manipulating Sdmp polynomials. Something is apparently causing this to be incredibly slow when the polynomials are large. According to the Maple CodeTools it seems to be doing some long garbage collection runs after each multiplication. One wonders if there isn't something better that Maple can do so that their users can take practical advantage of the full performance of the Sdmp library in all cases!</div><div><br /></div><div>An important benchmarking consideration is that for Maple, all Sdmp computations can be considered static. This means that Sdmp can block allocate their outputs, since the coefficients will never be mutated and therefore will never be reallocated. To simulate this, when comparing Sdmp and Flint, we write Flint polynomials into an already allocated polynomial. This probably slightly advantages Flint in some cases, but since we don't actually get the stated Sdmp timings in practice, this seems reasonable.<br /><br />Of course, this isn't fair to Maple, but we are talking about a few percent, and the comparison with Sdmp was more interesting to us, since it is a library like Flint.</div><div><br /></div><div>When comparing against Maple, we do the Flint timings using Nemo, which is a Julia package that wraps Flint. The code isn't on by default in Nemo, since we don't yet have GCD code, for example, but we switch the Flint code on in Nemo for the basic arithmetic timings in this blog. </div><div><br /></div><div>We take considerable care to ensure that both Maple and Sdmp use the same effective variable order that we do.</div><div><br /></div><div><h4><div style="font-weight: normal;"></div></h4><div><h4 style="text-align: left;">Sparse Pearce (1 core)</h4></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-GXfvD7kmiQI/Wcb5hbie4aI/AAAAAAAAB4s/jYCtVe_wezg1lrfEhOAsf0XZ0FjVXx9DwCEwYBhgL/s1600/sdmp1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="310" data-original-width="383" height="259" src="https://2.bp.blogspot.com/-GXfvD7kmiQI/Wcb5hbie4aI/AAAAAAAAB4s/jYCtVe_wezg1lrfEhOAsf0XZ0FjVXx9DwCEwYBhgL/s320/sdmp1.PNG" width="320" /></a></div><div><div><br /></div><h4 style="text-align: left;">Sparse Pearce (2 core)</h4></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-uevMJ2OWpng/Wcb5hWV_S4I/AAAAAAAAB4s/598XhHvl5nggmcU5AlIOOsMo9GZR9IDCwCEwYBhgL/s1600/sdmp2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="310" data-original-width="398" height="249" src="https://4.bp.blogspot.com/-uevMJ2OWpng/Wcb5hWV_S4I/AAAAAAAAB4s/598XhHvl5nggmcU5AlIOOsMo9GZR9IDCwCEwYBhgL/s320/sdmp2.PNG" width="320" /></a></div><div><br /></div><h4 style="text-align: left;">Sparse Pearce (4 core)</h4></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-kbqyU4-tdd4/Wcb5hZdcgVI/AAAAAAAAB4s/Pw5YnGRdH8cIaqTOQXKMgdcuT2c7FM63wCEwYBhgL/s1600/sdmp4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="312" data-original-width="399" height="250" src="https://4.bp.blogspot.com/-kbqyU4-tdd4/Wcb5hZdcgVI/AAAAAAAAB4s/Pw5YnGRdH8cIaqTOQXKMgdcuT2c7FM63wCEwYBhgL/s320/sdmp4.PNG" width="320" /></a></div><h4>Conclusion</h4><div><br /></div><div>The project is coming along nicely. </div><div><br /></div><div>Daniel has started working on arithmetic over Q, and we'll update everyone when that is working. Hopefully we'll also find a good strategy for parallelisation of dense multiplication.</div><div><br /></div><div>I should point out that the code described in this blog is very new, and after the last blog, we found many things that changed the results. We've taken much more care this time, but even an infinite amount of effort won't avoid all possible sources of error. The code itself is merged into trunk, since we mostly trust it, but it should still be considered experimental until it settles a bit.</div><div><br /></div><div>I wish to thank Roman Pearce, Michael Monagan and Mickael Gastineau who have very generously provided details of how to use their systems, and gave us access so that we could do timings. There is an academic version of Trip available, which has everything but the homogeneous algorithm. For that, Mickael prepared a special version we could make use of.</div><div><br /></div><div>We also want to thank the EU for their Horizon 2020 funding of the OpenDreamKit project.</div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-78792128410796158132017-08-05T05:12:00.000-07:002017-08-06T10:27:29.442-07:00<h2></h2><h2>Oscar:Antic Update</h2><div><br /></div><div>This week we had the first OSCAR:Antic workshop and coding sprint [1]. We had about 10 people turn up, though unfortunately we clashed with big meetings in Atlanta and Leiden. Still, we got a lot done, so we'll probably have another one of these soon!</div><div><br /></div><div>For those who don't know, OSCAR is the computer algebra system being developed as part of a large DFG (German Research Council) transregional grant. The Antic subsystem of OSCAR is the Number Theory subsystem, consisting of Julia packages such as Hecke.jl, Nemo.jl and making use of Flint, Arb, the Antic C library, etc.<br /><br />The workshop and expenses for invited speakers were funded by DFG TRR 195, but I should also note that a number of the participants are currently employed by OpenDreamKit, an EU Horizon 2020 grant.<br /><br /></div><h3>Fp bar</h3><div><br /></div><div>The focus of the workshop more or less ended up being finite fields and their embeddings. Luca De Feo gave a really fantastic set of talks on algorithms for embedding finite fields with the aim of eventually reaching an implementation of Fp bar (the algebraic closure of the finite field of p elements).</div><div><br /></div><div>We learned about all the recent breakthroughs in using elliptic curves over a finite field to efficiently do such constructions, including the papers of Luca and his coauthors [3].</div><div><br /></div><div>Luca put all this work in the context of the older work of Bosma-Cannon-Steel and Lenstra-deSmit and the algorithms of Allombert and Plesken.</div><div><br /></div><div>One of the main aims of the coding sprints was to implement a naive embedding method in Nemo, which just relies on finding roots of one finite field in another. This is still work in progress, but I think it is fair to say we made some good progress towards this. <br /><br />The eventual aim is to have a modern sophisticated and highly performant implementation in Nemo, based on the code of Luca and his student Jean Kieffer. Some of the basic building blocks for this will be in Flint, of course.</div><div><br /></div><div>The basic steps were:</div><div><ul><li>Add some additional helper functions to Flint</li><li>Add root finding, based on equal degree factorisation, to Flint</li><li>Port the Hecke code for maps into Nemo</li><li>Add the code for handling embeddings, going up and down along embedding maps</li><li>Thinking about how to handle compatibility</li><li>Designing an interface for Nemo</li></ul><h3>Elliptic curve cryptography in Nemo</h3></div><div><br /></div><div>Luca and Jean presented a poster at ISSAC, which was held the previous week in Kaiserslautern. I think they even won an award for their poster.</div><div><br /></div><div>Following on from this, Jean gave a talk at our Antic workshop on the work he and Luca have been doing on a Nemo module for elliptic curves over a finite field [4]. You can find the slides on the workshop homepage [1].</div><div><br /></div><div>One remark that I found interesting was a note Jean made on performance. One of his graphs showed that Sage was beating Nemo for a generic implementation of Weierstrass elliptic curves over a finite field of cryptographic sized characteristic (e.g. p = 2^502 + 49), especially as the degree of the field grows.</div><div><br /></div><div>We eventually traced this down to two specific things:</div><div><br /></div><div><ul><li>Sage was using one division instead of two in its implementation of the addition law on an elliptic curve (this was easily fixed in Jean's code)</li><li>Sage was using Pari for finite fields, and Pari is <u>really</u> fast at finite field arithmetic</li></ul><div>My contribution was to speed up arithmetic for finite fields of smallish degree for large characteristic. Addition is still faster in Pari, due to their stack based memory management, but we now have multiplication that is faster than Pari, inversion about the same speed (sometimes faster, sometimes slower, to within about 1-2%), division and powering faster than Pari.</div><div><br /></div><div>The tricks involved in speeding up inversion (and hence division) were as follows:</div><div><br /></div><div><ul><li>Improvements to the memory manager in Flint to block allocate mpz_structs</li><li>implement a gcdinv function for the Flint fmpz_mod_poly module (polynomials over Z/pZ for large p), instead of using the xgcd implementation, which does additional work</li><li>use temporary allocations (on the stack where possible) throughout the gcdinv and divrem code</li><li>delay reduction mod p in the divrem_basecase code in fmpz_mod_poly</li><li>inplace divrem_basecase</li><li>change crossover to divrem_divconquer to a higher value, in light of new basecase improvements</li></ul><div>These improvements dropped the time for inversion in a finite field of degree 5 in large characteristic from about 12.7s for 100,000 inversions to about 8.7s.</div></div></div><div><br /></div><div>There's still one potential improvement we could make, which would speed things up quite a bit:</div><div><br /></div><div><ul><li>implement a dedicated gcdinv function that doesn't call out to divrem, but does division itself, delaying reduction all the way through the entire Euclidean algorithm, except when actually needed</li></ul><div>This would be quite a lot of work and so we didn't implement this yet. Contributions welcome.</div></div><div><br /></div><h4>An interesting observation</h4><div><br /></div><div>Initially we believed that Pari has special code for quadratic extensions of Fp. This is because we were constructing Fp^2 and then inverting f = x^1234 as a benchmark for inversion.</div><div><br /></div><div>After investigating this, I realised that Pari was using cyclotomic polynomials for at least degree 2 and 4 extensions, where possible. This meant that in the degree 2 case, x^3 = 1 and in the degree 4 case x^5 = 1.</div><div><br /></div><div>Of course x^1234 is not a really random looking element in such extensions, and it just so happens that the extended gcd computation that is necessary for inversion really goes fast for such computations.<br /><br />I don't see how to make use of this in general, but it was interesting that it made such a huge difference (a factor of 10 or so in Pari).</div><div><br /></div><div>Actually, we don't yet get such an improvement in Flint (it's more like 5x), so I will look into that today. It's almost certainly due to not terminating the xgcd early enough, or not handling a leading coefficient of 1 optimally, or something like that. It'll be something trivial, but worth optimising for, I think.<br /><br />Update 06 Aug: this is now fixed in flint/trunk. It was indeed due to not dealing with length 1 remainders in a special case in the gcdinv code.<br /><br /></div><h3>Other topics</h3><div><br /></div><div>Many other topics were covered at the workshop.</div><div><br /></div><h4>Ball arithmetic</h4><div><br /></div><div>Fredrik Johansson gave a very nice presentation on Arb for ball arithmetic over R and C. The slides are on the workshop website [1].</div><div><br /></div><h4>Generic polynomial testing</h4><div><br /></div><div>We decided to write test code for generic polynomials in Nemo. Three of us (me, Luca and Remi) wrote test functions, with particular emphasis on:</div><div><br /></div><div><ul><li>towers of univariate polynomial rings</li><li>generic polynomials where the base ring is built over a characteristic p field</li><li>randomised tests</li></ul></div><div>As expected, this turned up numerous corner cases and failures, most of which are fixed (the rest will be fixed before the next Nemo release).</div><div><br /></div><h4>Multivariate polynomial arithmetic over Z</h4><div><br /></div><div>We finally merged all the new multivariate code in Flint/trunk that I have been working on (recently with the help of Daniel Schultz). I've spoken about this code in a recent blog post, and there will be another blog post on this topic fairly soon. Lots of things happened since the last update, including some speedups.</div><div><br /></div><h4>Memory leak in Flint matrices</h4><div><br /></div><div>Fredrik fixed a serious memory leak in Flint matrices, which was slowing things down.</div><div><br /></div><h4>A 2500 times speedup</h4><div><br /></div><div>Fredrik sped up some Flint code by a factor of 2500. The slow Flint code was pointed out by the Sage developers, who were using it for modular symbols. The code is still in a branch, and could be improved even further. But since the Flint code is also used in Nemo, this will speed up some things in Nemo too.</div><div><br /></div><h4>Flint on FreeBSD</h4><div><br /></div><div>Dima Pasechnik helped with getting Flint working on FreeBSD.</div><div><br /></div><h4>Gap-Julia interface</h4><div><br /></div><div>Thomas Breuer and Sebastian Gutsche made progress in the OSCAR Gap-Julia interface. They worked on:<br /><ul><li>Extended the GAP-Package JuliaInterface (Type conversions, e.g., of GAP cyclotomics into Nemo cyclotomics, access to more functionality, improved error handling)</li><li>Translating shortest vector GAP function to Julia to look to test how speed improves.</li><li>Tried to access compiled Julia methods directly (unsucessfully so far)</li></ul><br />There is an upcoming Gap days in Siegen early next month [2]. </div><div><br /></div><div>Register for the Gap conference, or the subsequent coding sprint if you are interested in general Gap development, or a Gap-Julia interface.</div><div><br /><h4>Moving finite fields out of Flint</h4></div><div><br /></div><div>We discussed the future possibility of moving finite fields out of Flint, into a separate library, similar to the way Antic (for number field arithmetic) is separate from Flint.</div><div><br /></div><div>The reason for this is the templated code (using C #defines) that is used in the finite fields module. This suggests we should be using C++ for finite fields (with a C interface to maintain compatibility with existing code).</div><div><br /></div><div>Some of the work Luca wants to do on finite field extensions, which requires bivariate polynomial arithmetic and so on, would be fiddly to develop using the Flint templating system. C++ would make this easier, and we'd still get good performance if we restricted to a C-style interface and only used templates to make the code easier to write.</div><div><br /></div><div>The other possibility is to reimplement finite fields directly in Julia. But we decided to delay a project like that until we see how the Julia JIT compiler will be improved. Putting too many layers of code in Julia is asking for trouble right now. When JIT compilation is improved after the language matures, this would be a sensible choice. Each time Julia improves, the sweet spot enlarges, and we can do more in Julia. For example, traits are a highly anticipated feature of Julia, which should materialise over the next couple of years.</div><div><br /></div><div>Rewriting the finite fields code in C++ (with a C-style interface, for performance) is certainly a worthwhile project. Volunteers welcome.<br /><br /></div><h4>References and links</h4><div><br /></div><div>[1] <a href="https://github.com/Nemocas/Nemo.jl/wiki/OSCAR-:-Antic-Workshop--and--coding-sprint">https://github.com/Nemocas/Nemo.jl/wiki/OSCAR-:-Antic-Workshop--and--coding-sprint</a></div><div>[2] <a href="http://gapdays.de/gapdays2017-fall/">http://gapdays.de/gapdays2017-fall/</a></div><div>[3] <a href="http://defeo.lu/">http://defeo.lu/</a><br />[4] <a href="https://github.com/defeo/EllipticCurves.jl">https://github.com/defeo/EllipticCurves.jl</a></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-86748713525532180872017-07-09T08:07:00.000-07:002017-09-23T17:06:46.649-07:00Update on fast multivariate polynomial arithmetic in Flint<h2>Fast multivariate polynomial arithmetic in Flint</h2><div><br /></div><div>It's time for me to update everyone on the implementation of fast multivariates in Flint. We are working on this as part of the OpenDreamKit project, funded by the EU. The eventual aim is to provide a robust, extremely fast implementation of multivariate polynomial arithmetic and GCD that:</div><div><br /></div><div><ul><li>is parallelised (hopefully scaling linearly with number of cores)</li><li>supports basic monomial orderings</li><li>supports individual exponents up to 64 bits and exponent vectors of arbitrary size</li><li>provides multiplication, exact division, divisibility testing, division with remainder and GCD</li><li>is open source</li><li>is highly optimised for Z, Q and Z/pZ</li></ul><h3>What else already exists?</h3></div><div><br /></div><div>But this is a solved problem, right? There's heaps of libraries that already do this, aren't there?</div><div><br /></div><div>Indeed there are lots of cool libraries and proprietary software. Let's summarise each of them:</div><div><br /></div><h4>Singular</h4><div>Singular is a Computer Algebra System for polynomial systems, developed here in Kaiserslautern. It has sparse multivariate polynomial arithmetic over various fields, supporting exponents from 2 to 64 bits (no smooth promotion for the user doing polynomial arithmetic, but internally in Groebner bases promotion happens optimally and automatically) and in practice, any number of variables (up to about 2^15).</div><div><br /></div><div>But Singular's polynomial representation is optimised for Groebner bases (GBs). In particular, GBs require that one be able to get the leading monomial of a polynomial very fast and to be able to multiply polynomials by monomials extremely fast. Whilst a basic Buchberger GB implementation can benefit from fast basic arithmetic, most of the really efficient GB algorithms cannot. And it is not enough to be able to return the value of the leading monomial. One really needs to be able to return it without copying any data. For this purpose, Singular uses a linked list implementation of (sparse distributed) multivariate polynomials. Of course this makes Singular very slow for ordinary multivariate arithmetic.</div><div><br /></div><div>Naturally, Singular's multivariate arithmetic is not parallelised.</div><div><br /></div><div>Another issue with Singular is that its multivariate arithmetic is only "fast" if you don't supply a module ordering. But if you don't supply one, you can't use the polynomials you generate in most of Singular's higher level functions. In other words, the implementation is very highly tied to Singular's domain of specialty.</div><div><br /></div><div>Some of Singular's multivariate functionality is supplemented by Factory, which we'll describe next.</div><div><br /></div><h4>Factory</h4><div>Factory is a library distributed with Singular, for multivariate polynomial arithmetic, GCD and factorisation. Martin Lee put a lot of time into Factory over the past few years, to speed it up, in particular switching to Flint for univariate polynomial arithmetic and speeding up factorisation routines.<br /><br />Factory still uses a sparse representation for polynomials, but optimised more for arithmetic. I believe it uses a sparse recursive representation, which is a logical choice for such a library.</div><div><br /></div><div>Some operations in Factory are parallelised, basically because those operations are parallelised in Flint. But basic multivariate arithmetic is not parallel.</div><div><br /></div><div>In the long run, the work that we are currently doing for ODK will possibly replace the core of Factory. I'm sorry that I don't have explicit timings for Factory below. I may include them in future.</div><div><br /></div><h4>Magma</h4><div>This is a closed source Computer Algebra System. Based on my timings, I now believe that Magma does certain critical operations in a heuristic way, so I no longer trust it. But still, this is the system to beat if possible. It has for a long time provided exceptionally fast basic arithmetic. Allan Steel in particular has done an amazing job with core arithmetic in Magma. Unfortunately his ideas are often not published, and he doesn't seem to find time to reply to emails asking about his algorithms. An open code base (even if not Open Source) would be highly preferable!</div><div><br /></div><div>Magma supports exponents up to about 32 bits, but its basic multivariate arithmetic is not parallelised. Some operations seem to be missing (or I couldn't find them documented). But it generally has very good coverage. Magma is good at both GBs and basic arithmetic.</div><div><br /></div><h4>Giac</h4><div>Giac is part of the Xcas computer algebra system, which is widely used in schools in France. Moreover, it is Open Source!</div><div><br /></div><div>Bernard Parisse has done a magnificent job speeding up multivariate arithmetic in Giac. It has also been really fantastic to have email exchanges with Bernard and exchange ideas.</div><div><br /></div><div>Giac supports exponents up to 16 bits, but as far as I'm aware, only supports lexical orderings. The front end is symbolic and a conversion is done to a raw, internal polynomial format for doing polynomial arithmetic. One can change the priority of variables, but degree orderings and reverse orderings can only be supported artificially.</div><div><br /></div><div>Since Flint currently only allows the order of variables to be reversed and not the order of comparisons (as required for true reverse lexical orderings), I decided to blog about our timings now, since comparison with Giac is actually fair. Properly supporting reverse lexical orderings (work in progress right now), will very slightly slow down our benchmarks.</div><div><br /></div><div>Giac is parallelised and gives a nice speedup with multiple threads. It also has really fast multivariate GCD and factorisation. </div><div><br /></div><h4>Piranha</h4><div>Piranha is another Open Source library, written in C++. It makes heavy use of templating and is insanely fast. It is also parallelised and scales nearly linearly with the number of cores. Its maintainer, Francesco Biscani is also really talkative by email and has lots of great ideas.</div><div><br /></div><div>Piranha has a reputation for being fast, and it is well-deserved. However, I don't include it in timings below since Piranha uses a hashed representation for polynomials, which means that it doesn't support things like division of multivariate polynomials (it's really hard to get the leading term of a polynomial from a hash table!)</div><div><br /></div><div>For those interested to know, I think we are beating Piranha on multiplication of dense polynomials, but it beats us by a factor of 3 or so on the sparse benchmarks. This is to be expected given their representation. We are thinking of also adding a "hash" ordering for cases where division is not required, so that we can compete with Piranha on that benchmark.</div><div><br /></div><div>Piranha is an amazing project. It supports arithmetic and symbolic operations. It even supports Poisson and divisor series, which are presumably important in applications in Celestial mechanics. </div><div><br /></div><div>Piranha also has very limited range for its packed monomial representation, which is fine for some applications, but not useful for our purposes. It is still under development though, and I expect some of its limitations will eventually be lifted.</div><div><br /></div><h4>Sdmp</h4><div>Sdmp is the library developed by Michael Monagan and Roman Pearce for fast polynomial arithmetic in Maple. It's closed source, but its authors regularly publish papers describing their (revolutionary) ideas. Unfortunately, I only have access to the versions of these papers on their website and have had trouble implementing some of their algorithms as given in their papers. The fixes seem to be nontrivial and have consumed a lot of my time. An open code base (even if not Open Source) would be highly preferable!</div><div><br /></div><div>I do not provide timings of Sdmp below, since I do not have access to it. I believe we are probably beating Sdmp on the dense benchmarks, but maybe 10-15% slower on the sparse benchmarks, but this is purely conjecture based on various bits of information I have.</div><div><br /></div><div>Of course, Sdmp is parallelised, and scales well with the number of cores up to about 16 cores.</div><div><br /></div><h4>Trip</h4><div>Trip has a reputation for being extremely fast, represents decades of work, and should be hard to compete with especially on multiple cores, as it scales extremely well. It is closed source, though occasionally papers are published describing the algorithms in very high level detail, and giving timings.</div><div><br /></div><div>I don't have a full copy of Trip, but the authors confirm that the academic version is full speed and can be used for timings for academic purposes. I should be able to include timings for functionality available in the academic version of Trip at some point in the future. I give just a couple of data points in the comments below, but want to get familiar with the system before putting up official timings.</div><div><br /></div><h4>Pari/GP</h4><div>Pari/GP is Open Source and implements something like a recursive dense or even recursive sparse format, depending on how you define it. Unfortunately I can't time it, since it isn't capable of the benchmarks I'm doing, due to the memory usage of their representation.<br /><br />What Pari/GP is doing is usually fine for Number Theory, where things tend to be pretty dense. But it isn't trying to solve the same problem we are trying to solve, which is a really fast, flexible system for multivariate arithmetic in both the sparse and dense cases.</div><div><br /></div><h4>SageMath</h4><div>Initially I thought it was pointless including SageMath since I assumed it just used Singular. However, on my machine, Sage is actually slower than Singular, and so I decided to include timings for that, at least for the cases that finished in under two hours.</div><div><br /></div><div>Presumably the Sage developers didn't know about certain operations in Factory (or they weren't available at the time), since some Sage timings are really bad. I don't know what they do instead. Sage is Open Source, but there'd be more motivation to look up the source code if it was faster than everyone else, instead of slower.</div><div><br /></div><div>As the work we are doing is part of OpenDreamKit, it is anticipated that Sage will eventually benefit greatly from the work we are doing.</div><div><br /></div><h3>Update on Flint multivariate polynomial arithmetic</h3><div><br /></div><div>Over the past year, I've spent a lot of time focusing on an implementation of multivariate arithmetic over the integers. Of course, one needs to be a little careful to describe what one actually means by division with remainder of multivariate polynomials over Z. But the idea has been to get a decent implementation over Z that can be easily modified for multivariate polynomials over Z/pZ and Q.<br /><br />We currently have single core code only, for multiplication, divisibility testing, quotient only, quotient with remainder and reduction by an ideal (specified as an array of generators), plus a bunch of simpler operations, of course.<br /><br />The code supports exponents up to 64 bits (actually 63 bits, since we use the top bit to detect overflows), with automatic packing of monomials where possible. The number of words in an exponent vector is essentially unlimited.</div><div><br /></div><div>We support lexical and degree lexical orderings, with degree reverse lexical ordering not far away (we currently support reversing the order of variables, just not the order of comparisons).<br /><br />The big news for this blog post is that it all now passes very strong test code, and I've documented the interface. It should be merged into flint/trunk soon.</div><div><br /></div><div>The other exciting news is we've hired Daniel Schultz to work on the project for OpenDreamKit, for two and a half years. He will primarily be focusing on parallelising the code, though he'll no doubt work on many other things as well, such as a hash ordering, reverse lexical orderings, GCD and so on.<br /><br />One major thing we are still missing is quasidivision. There are two kinds required: one that allows division over Q using only integer arithmetic, and the other that allows division over the fraction field of a multivariate polynomial ring using only polynomial arithmetic over the integers. Neither of these should be difficult to implement now that we have the basics working.</div><div><br /></div><h3>A puzzle</h3><div><br /></div><div>So far we can't get down to the same time as Magma for divisibility testing of multivariate polynomials. This is a critical operation for various fast multivariate GCD algorithms. Magma seems to be able to do this 200 times faster than it can do exact quotient!<br /><br />I've asked a lot of smart people about this, and no one has an idea how it is possible to do this so fast. Moreover, Magma curiously takes about the same time as a heuristic test. There's really not enough time to do anything other than a few evaluations at some small values, and even then you have to do some tricks to pull that off!<br /><br />I've written to Allan Steel to ask how this is done, but he hasn't replied. At this stage I strongly suspect Magma is doing a heuristic test only. Since the source code is not open for me to examine, and no one published this amazing algorithm, I am free to assume what I want about it.<br /><br />Because I don't believe Magma is doing the right thing here, I've asked it to provide me with the exact quotient when doing divisibility testing (which just happens to be what our code does anyway). This makes the Magma times more realistic.<br /><br /><h3>Benchmarks</h3></div><div><br /></div><div>And now for the benchmarks of all the systems I have on one machine (it's nearly 7 years old), with the exceptions noted above.<br /><br />In each case there are "dense" and "sparse" benchmarks. Note that we don't include really dense or really sparse benchmarks. In the really dense case, you'd probably want to use an FFT based algorithm. Our code is probably more suited to the really sparse case. But it does well in the sparse case anyway (if we don't compare against a hash based approach), so that's what we've chosen to use for the benchmarks.<br /><br />Timings that took longer than 2 hours are marked with a dash.</div><div><br /></div><div>In some cases, systems don't provide the exact operation required, but I've allowed division with remainder to be substituted when that is the case.</div><div><br /></div><div>The coefficient ring is Z for all operations, though Giac uses Q and internally optimises when working over Z. The benchmarks are picked to not introduce denominators.</div><div><br /><h4>Multiplication (dense Fateman):</h4></div><div><ul><li>f = (1 + x + y + z + t)^n</li><li>time p = f*(f + 1)</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/--uwP96kaV-U/WWJCOaHAtqI/AAAAAAAAB2U/CGCOxveByHgAyREwkZZRmHqanOmxtSC2gCLcBGAs/s1600/DenseMult.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="359" data-original-width="755" height="190" src="https://4.bp.blogspot.com/--uwP96kaV-U/WWJCOaHAtqI/AAAAAAAAB2U/CGCOxveByHgAyREwkZZRmHqanOmxtSC2gCLcBGAs/s400/DenseMult.PNG" width="400" /></a></div><h4>Multiplication (sparse Pearce):</h4></div><div><ul><li>f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n</li><li>g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n</li><li>time p = f*g</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-3XZxk-Gg7Go/WWJDZcBET4I/AAAAAAAAB2Y/yvTRVmmcUxwkSAzTlrOeBUGMCM0x-hhRwCLcBGAs/s1600/SparseMult.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="386" data-original-width="698" height="220" src="https://2.bp.blogspot.com/-3XZxk-Gg7Go/WWJDZcBET4I/AAAAAAAAB2Y/yvTRVmmcUxwkSAzTlrOeBUGMCM0x-hhRwCLcBGAs/s400/SparseMult.PNG" width="400" /></a></div><div><br /></div><h4>Quotient only division (dense):</h4></div><div><ul><li>f = (1 + x + y + z + t)^n</li><li>p = f*(f + 1) + x^(n - 3)</li><li>time q = p/f discard remainder</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-8bv-5FZGkvU/WWJDfQoo2mI/AAAAAAAAB2c/i_Xd2Z_hzjoMfLva1ebhcbkvXnABG4yJACLcBGAs/s1600/DenseQuot.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="350" data-original-width="741" height="188" src="https://3.bp.blogspot.com/-8bv-5FZGkvU/WWJDfQoo2mI/AAAAAAAAB2c/i_Xd2Z_hzjoMfLva1ebhcbkvXnABG4yJACLcBGAs/s400/DenseQuot.PNG" width="400" /></a></div><div><br /></div><div><div><h4>Quotient only division (sparse):</h4></div><div><ul><li>f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n</li><li>g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n</li><li>p = f*g + x^(n - 3)</li><li>time q = p/f discard remainder</li></ul><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-DN_siE7Pcv4/WWYb5gspUUI/AAAAAAAAB28/LjDkOvfIbOABqborY-L41y5RIBHsjAgbgCLcBGAs/s1600/SparseQuot.PNG" imageanchor="1"><img border="0" data-original-height="299" data-original-width="540" height="221" src="https://3.bp.blogspot.com/-DN_siE7Pcv4/WWYb5gspUUI/AAAAAAAAB28/LjDkOvfIbOABqborY-L41y5RIBHsjAgbgCLcBGAs/s320/SparseQuot.PNG" width="400" /></a></div><br /></div><div>Yes, exact quotient can be much faster than multiplication!!<br /><br />N.B: due to an accident of interfacing, previously published Flint times for this benchmark were using reflected lexicograhical ordering, instead of lexicographical ordering. This has now been corrected.</div></div></div></div><div><div><h4>Divisibility test with quotient (dense):</h4><h4><div></div><div><ul style="font-weight: normal;"><li>f = (1 + x + y + z + t)^n</li><li>p = f*(f + 1)</li><li>time flag, q = divides(p, f)</li><li>flag indicates whether division was exact, in which case q is the quotient</li><li>(division with remainder can be substituted for this operation)</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-TYO1YKGVzKU/WWJDs1hf0vI/AAAAAAAAB2k/MTJwSSyFxHc_x9yAeCXL5bSGekjHunbRACLcBGAs/s1600/DenseDivides.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="353" data-original-width="729" height="192" src="https://2.bp.blogspot.com/-TYO1YKGVzKU/WWJDs1hf0vI/AAAAAAAAB2k/MTJwSSyFxHc_x9yAeCXL5bSGekjHunbRACLcBGAs/s400/DenseDivides.PNG" width="400" /></a></div><div><span style="font-weight: normal;"><br /></span></div></div></h4><h4>Divisibility test with quotient (sparse):</h4><h4><div></div><div><ul style="font-weight: normal;"><li>f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n</li><li>g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n</li><li>p = f*g</li><li>time flag, q = divides(p, f)</li><li>flag indicates whether division was exact, in which case q is the quotient</li><li>(division with remainder can be substituted for this operation)</li></ul><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-sdRzEf-wo0Y/Wcb3Bvubb1I/AAAAAAAAB4A/ZB3OEOdzdMkBxNKcf5tg3m4cAhBrGKG7QCLcBGAs/s1600/div1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="361" data-original-width="657" height="218" src="https://4.bp.blogspot.com/-sdRzEf-wo0Y/Wcb3Bvubb1I/AAAAAAAAB4A/ZB3OEOdzdMkBxNKcf5tg3m4cAhBrGKG7QCLcBGAs/s400/div1.PNG" width="400" /></a></div><span style="font-weight: normal;"><br /></span></div></div></h4></div><div><div><div></div></div></div></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com12tag:blogger.com,1999:blog-7651115430416156636.post-30718689954055982692017-02-28T06:01:00.002-08:002017-02-28T08:12:24.326-08:00Parallelising integer and polynomial multiplication in Flint<br />As I explained in my last two blogs, over the last eight months I have been working on the OpenDreamKit project, funded by the EU in the Horizon2020 initiative. In this blog post I want to discuss the work I've been doing to speed up integer and polynomial multiplication as part of the OpenDreamKit high performance computing and optimisation component.<br /><br />Flint makes use of a variety of algorithms for integer and polynomials. For integers, GMP/MPIR is used, and algorithms such as the schoolboy algorithm, Karatsuba and Toom-Cook algorithms are used. For polynomials the schoolboy algorithm is used, along with the Karatsuba algorithm and various variants of Kronecker segmentation.<br /><br />For large integer and polynomial multiplications, however, Flint uses an FFT convolution.<br /><br />Actually, integer multiplication is reduced to polynomial multiplication. The large integers are broken, down to the bit, into equal sized chunks, which are interpreted as coefficients of a polynomial. The polynomials are multiplied, and the result is again interpreted as a large integer.<br /><br />Polynomial multiplications of large degree polynomials with small coefficients are also reduced to integer multiplication, by a method known as Kronecker Segmentation, where the polynomials are evaluated at a sufficiently large power of two, the resulting large integers are multiplied, and then the coefficients of the product polynomial are identified in the zero padded output.<br /><br /><h4>The FFT convolution</h4><div><br /></div>The FFT convolution is a method that has much better complexity than all of these methods. It can multiply polynomials in O(d log d) steps, where d is the degree of the polynomial. Compare this with the classical schoolboy method, which takes O(d^2) steps.<br /><br />At its core, the FFT convolution efficiently multiplies two large polynomials modulo x^(2^n) - 1. This is like polynomial multiplication with wraparound. Every 2^n coefficients, the result is wrapped around to the start. But we can use it to multiply polynomials of degree d if 2d < 2^n so that wraparound of the result doesn't interfere.<br /><br />To multiply modulo x^(2^n) - 1 the FFT convolution uses an evaluation, pointwise multiplication, interpolation strategy. The polynomials to be multiplied are first evaluated at 2^n-th roots of unity. These 2^n values for each polynomial are multiplied with each of their counterparts from the other polynomial, to yield 2^n "pointwise products" and then the result polynomial is interpolated from these 2^n values.<br /><br />Notice that the number of roots of unity is always a power of two in the FFT algorithm, so you are always doing multiplication modulo x^(2^n) - 1 for some n. In other words, convolutions always have length a power of two.<br /><br />The evaluation phase of the FFT convolution is called the forward FFT. In the middle are the pointwise multiplications, i.e. a total of 2^n coefficient multiplications, and then the interpolation phase is called the inverse FFT or IFFT, for short.<br /><br />Both the FFT and IFFT take O(n log n) operations, all of which are additions, subtractions and multiplications by roots of unity. The pointwise multiplications are simply coefficient multiplications. For example, if the polynomials being multiplied were over the complex numbers, the pointwise multiplications would be products of complex numbers.<br /><br /><h4>The Schoenhage-Strassen algorithm</h4><div><br /></div><div>Performing FFTs and IFFTs over the complex numbers is problematic. When implemented on a computer, we have to use floating point arithmetic for this, which can result in rounding errors which can't be effectively controlled.</div><div><br /></div><div>Instead of working over the complex numbers, the Schoenhage-Strassen algorithm works over another ring with roots of unity, namely Z/pZ where p = 2^(2^k) + 1. Note that in this ring, 2 is a 2^(k+1)-th root of unity. This allows for convolutions up to that length. But the advantage is that elements of this ring can be represented exactly, using (multiprecision) integer arithmetic,</div><div><br /></div><div>Pointwise multiplications become multiplications modulo p = 2^(2^k) + 1, which can also be performed exactly. Thus, the result of such an algorithm is always exactly correct if the inputs are polynomials over the integers, so long as the output coefficients are smaller than p.</div><div><br /></div><div>The Flint FFT convolution code uses the Schoenhage-Strassen algorithm to multiply large polynomials and integers. However, it incorporates many improvements, which I will now describe.</div><div><br /></div><h4>Efficient FFT butterflies</h4><div><br /></div><div>The basic low level component of the FFT and IFFT algorithms is called a butterfly. The FFT butterfly transforms pairs of inputs as follows</div><div><br /></div><div><pre style="box-sizing: border-box; color: #333333; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 15px; font-stretch: normal; line-height: normal; white-space: pre-wrap;">[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]</pre><pre style="box-sizing: border-box; color: #333333; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 15px; font-stretch: normal; line-height: normal; white-space: pre-wrap;"></pre></div>where z is some power of the 2^(k+1)-th root of unity discussed above. As this root of unity is a power of 2 in the Schoenhage-Strassen algorithm, multiplication by z is just a binary shift to the left or right.<br /><br />The IFFT butterflies are similar. They take the form<br /><br /><pre style="box-sizing: border-box; color: #333333; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 15px; font-stretch: normal; line-height: normal; white-space: pre-wrap;">[a{i}, b{i}] => [a{i}+z^(-i)*b{i}, a{i}-z^(-i)*b{i}]</pre><br />Since the FFT coefficients are multiprecision integers (modulo p), the butterfly operations require applying the given operations on multiprecision integers of a fixed precision.<br /><br />To speed these up, we make use of MPIR's low level assembly optimised functions. The sumdiff function performs an addition and a subtraction at the same time, and of course we make use of code for shifting multiprecision integers. In addition, the subtraction modulo p requires complementing the integers in twos complement format.<br /><br />One of the OpenDreamKit projects we worked on was to speed up these basic assembly operations on modern CPUs, especially those that support AVX instructions.<br /><br />You can read about the work that was done on this by Alex Kruppa [1] and the various improvements he got over the old MPIR assembly code.<br /><br /><h4>Extended Fermat numbers</h4><div><br /></div><div>Numbers of the form p = 2^(2^k) + 1 are called Fermat numbers. But they limit us to working with coefficients in both our input polynomials and the output polynomial, which don't exceed p. To give us more flexibility, we can work with numbers of the form p = 2^(a*2^k) + 1 for some number a. Now 2^a is a 2^(k+1)-th root of unity, but we get to work with much larger coefficients at very little extra cost. Of course the pointwise multiplications are now more expensive, but sometimes this extra cost is worth paying for the extra flexibility we get.</div><div><br /></div><h4>Truncated Fourier Transform</h4><div><br /></div><div>As we noticed above, the length of a convolution is always a power of 2. We are always working modulo x^(2^n) - 1.</div><div><br /></div><div>This means that if we are multiplying polynomials of degree d, everything is fine until 2d = 2^n. At this point, the output polynomial will become too large and we have to increase n. This leads to ugly steps in performance at power of two lengths.</div><div><br /></div><div>To get around this, we make use of an algorithm known as the Truncated Fourier Transform (TFT). Truncating the FFT to non-power-of-2 lengths is relatively straightforward. We simply omit the operations which deal with coefficients beyond the desired length.</div><div><br /></div><div>The pointwise multiplications are also easy. Simply do fewer pointwise multiplications.</div><div><br /></div><div>But now we have less information going into the inverse FFT than we normally would. We use some clever linear algebra to reconstruct the missing information as the algorithm proceeds. </div><div><br /></div><div>By truncating the entire convolution in this way, it is possible to save almost all the work that would be required to do a full, length 2^n transform.<br /><br /></div><h4>Negacyclic transform</h4><div><br /></div><div>When the pointwise multiplications get very large, there is a trick we can use to perform them using an FFT of half the usual length.</div><div><br /></div><div>Notice that the pointwise multiplications are done modulo p = 2^(a*2^k) + 1. If we split such a number up, this almost looks like doing a polynomial multiplication modulo x^(2^n) - 1 for some value of x, except for that pesky -1 instead of +1.</div><div><br /></div><div>To get around this, we multiply x by a suitable root of unity and rescale so that the FFT convolution effectively becomes multiplication modulo x^(2^n) + 1. In other words, we use the FFT wraparound feature to our advantage.</div><div><br /></div><div>This trick is due to Nussbaumer and saves time whenever the FFT coefficients are themselves large enough to be multiplied using an FFT, or perhaps slightly smaller than that.</div><div><br /></div><div>The main problem with this approach is the size of the FFT coefficients of that small FFT convolution. It usually isn't quite the right size.</div><div><br /></div><div>To get around this, it is possible to do a naive convolution where the coefficients are just one or two words in size and use Chinese remaindering to deal with coefficients that are just slightly larger than this Nussbaumer convolution can handle.</div><div><br /></div><h4>Algebraic identities</h4><div><br /></div><div>Another trick that can be used is when the number a is a multiple of 3 in p = 2^(2^a) + 1, specifically when the pointwise multiplications are small.<br /><br />In such a case, we can use the identity x^{3a} + 1 = (x^a + 1)(x^{2a} - x^a + 1) to break the pointwise multiplications into smaller multiplication modulo x^a + 1 and x^{2a} - x^a + 1 respectively. If those multiplications are small enough to be quadratic time, splitting them up and using Chinese remaindering is actually faster than doing the one large multiplication.</div><div><br /></div><div>Multiplication modulo x^a + 1 takes time some constant multiple of a^2 and multiplication modulo x^{2a} - x^a + 1 takes four times as long again. The total is some constant multiple of 5a^2.<br /><br />Doing a single multiplication modulo x^{3a} + 1, on the other hand, would take time equal to some multiple of 9a^2. So splitting things up using identities can be a big win.</div><div><br /></div><h4>The sqrt2 trick</h4><div><br /></div><div>In the ring Z/pZ where p = 2^S + 1 the value w = 2^(2S/4) - 2^(S/4) is a square root of 2. And recall that 2 was a root of unity in the Schoenhage-Strassen algorithm. Thus w is also a root of unity.</div><div><br /></div><div>Making use of the root of unity w allows us to perform convolutions of twice the length using the Schoenhage-Strassen algorithm. And multiplication by powers of w is not so difficult. Any even power of w is just a power of 2, and any odd power is just the difference of two such powers.</div><div><br /></div><div>Ordinarily, making the length of an FFT convolution twice as big, multiplies the time taken by a factor of four, since the coefficients required to support the larger convolution also have to be twice as large. But the sqrt2 trick allows us to do it for just over twice the cost.</div><div><br /></div><h4>The matrix Fourier algorithm</h4><div><br /></div><div>One of the big problems with the Schoenhage-Strassen algorithm as described above is that it works with a lot of data. There are 2^n multiprecision FFT coefficients being dealt with at any one time. The pattern of access to these coefficients is also not very regular throughout the algorithm, and so once the entire convolution falls out of cache, performance can be very bad.</div><div><br /></div><div>The performance degradation can be reduced to some degree by switching to a recursive implementation of the FFT, which quickly gets down to blocks of coefficients that fit in cache. But it doesn't mitigate it entirely.</div><div><br /></div><div>One way of resolving this is the matrix Fourier algorithm. Instead of viewing the input as a linear array of coefficients, it breaks the single large FFT down into many smaller FFTs, as though the input coefficients were the rows and columns of a matrix.</div><div><br /></div><div>Two passes over the data are required, in order to handle lots of row FFTs and lots of column FFTs. But the overall saving is dramatic.</div><div><br /></div><div>Of course, combining this trick with all the other tricks above is really a lot of work. But we do this in Flint.</div><div><br /></div><div>In particular, we break the coefficients of the truncated Fourier transform up in such a way that each of the row and column FFTs has no wasted space.</div><div><br /><h4>Parallelisation</h4></div><div><br /></div><div>Now we have an obvious target for parallelisation. As we have broken the single large FFT into many smaller independent ones, these can be sent to different threads to perform.</div><div><br /></div><div>Of course, nothing is ever that simple. We must ensure that the temporary space allocated throughout the algorithm is not overwritten by different threads. The algorithm is also more efficient if some of the inward column FFT transforms for example, are combined with the relevant pointwise multiplications and outward IFFT column transforms.</div><div><br /></div><div>All of this also requires all the coefficients to be in memory in the right order, so that memory access is not complicated. Fortunately there are two kinds of FFT and IFFT, namely decimation in time and decimation in frequency. Picking the correct kinds at the right time allows things to be managed more or less efficiently.</div><div><br /></div><div>The matrix Fourier algorithm also requires multiplying by some additional roots of unity at the right points, and these need to be merged into one layer of the FFTs and IFFTs so that they don't become an additional cost.</div><div><br /></div><div>But all complications aside, the matrix Fourier algorithm is both cache efficient and a good target for parallelisation.</div><div><br /></div><div>As part of my OpenDreamKit work, I parallelised the Flint Schoenhage-Strassen code in precisely this way, maintaining all of the above optimisations. I made use of OpenMP for this.</div><div><br /></div><div>The results were quite good, with up to a 5.5x speedup on 8 cores for large integer multiplications (I also parallelised splitting the large integer up into FFT coefficients).</div><div><br /></div><div>You can see timings and a high level summary of the work that I did in the writeup I did for the final deliverable for this task of the OpenDreamKit project [1].</div><div><br /></div><br />[1] <a href="https://github.com/OpenDreamKit/OpenDreamKit/issues/120">https://github.com/OpenDreamKit/OpenDreamKit/issues/120</a>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-20462714940406152652017-02-24T08:18:00.005-08:002017-02-27T06:35:23.542-08:00Assembly superoptimisation in MPIR<br />As I mentioned in my previous blog, I have been involved in the OpenDreamKit project, funded by the EU through their Horizon2020 programme.<br /><br />Today I want to blog about a project I was involved with here at the University of Kaiserslautern, but which was carried out by two people we hired, namely Alex Best and Alex Kruppa (we made the joke that we are only hiring Alex's on H2020 and referred to them as Alex v1.0 and Alex v2.0).<br /><br /><h2>About MPIR</h2><br />MPIR stands for Multiple Precision Integers and Rationals. It is a fork of the GMP library [1] for multiple precision arithmetic for integers, rationals and floating point numbers.<br /><br />GMP and MPIR consist of three main components: (i) assembly optimised bignum arithmetic, (ii) highly optimised bignum algorithms implemented in C, (iii) high level interfaces.<br /><br />The way MPIR works is to provide assembly implementations of low level arithmetic primitives, such as addition, subtraction, shifting, multiplication and many other similar things, for each new microprocessor architecture that comes out.<br /><br />You may have heard of Intel's tick-tock cycle. They bring out a completely new microarchitecture in their tock cycle, and they shrink it and optimise it in their tick cycle. Every year or so, there is a new tick or tock.<br /><br />Starting in 2014, they introduced a new phase called their refresh cycle. So it's now tick-tock-refresh, since it is getting to be too complicated to do new microarchitectures every two or three years.<br /><br />What this means for MPIR is that every year or so there is a new tick, tock or refresh for Intel, and similar for AMD, that needs support at the assembly level.<br /><br />Over the years there have been many new instruction set technologies that have had to be supported, such as X86_64, MMX, 3DNOW!, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512, BMI, BMI2.<br /><br />It's surprising to many people that the difference between bignum code output by an optimising compiler like gcc and handwritten assembly code can be a factor of 4-12 in performance. Of course, if you already have assembly code for a prior, related architecture, the improvement you can get with handwritten assembly code is much less than this. However, the difference accumulates with time, as you forgo more and more potential improvements and support for more and more new instruction sets and technologies.<br /><br />We made the case in the OpenDreamKit grant that this ongoing maintenance to support new instruction sets requires investment to keep up with the latest processor technology. Each new microarchitecture requires as much as 3-6 months full time work to support!<br /><br /><h2>Superoptimisation</h2><div><br /></div><div>In addition to writing new assembly code to support each new microprocessor iteration, one can use superoptimisation to get up to another factor of two difference in performance (though often much less).</div><div><br /></div><div>Superoptimisation takes already written assembly code and explores all valid reorderings of the assembly instructions, that don't change the behaviour of the code, and looks for the fastest reordering.</div><div><br /></div><div>As typical assembly sequences can be dozens of lines long, this cannot be done by hand. There can be billions of valid reorderings.</div><div><br /></div><div>The reason this kind of rearrangement can make a difference is because the CPU uses a very simple algorithm to determine which instructions to put in which pipeline. There are also limitations on how many of certain types of instructions can be executed at the same time, e.g. because of a limited number of integer ports, etc.</div><div><br /></div><div>By rearranging the assembly instructions, we can sometimes cause the CPU scheduler to put the instructions in the pipeline in just the right order that resources are used optimally.</div><div><br /></div><div>If an assembly function, like a multiplication routine, is going to be used quadrillions of times, it is certainly well worth trying to get an optimal ordering, since this will save a large amount of CPU time for a lot of people.</div><div><br /></div><h2>The AJS Superoptimiser</h2><div><br /></div><div>Alex Best was the first of the two Alex's to work on writing a superoptimiser for MPIR that supported the recent AVX and BMI instruction sets.</div><div><br /></div><div>He began with an assembly JIT (Just-In-Time) library [2] written by Petr Kobalicek and improved it for use with MPIR, e.g. by supporting yasm syntax and removing numerous limitations.</div><div><br /></div><div>On top of this, he wrote a superoptimiser called AJS [3, 6] which cleverly works out which reorderings of a segment of assembly code will be valid and times them all to find the optimal one.</div><div><br /></div><div>AJS is tailored to work with MPIR assembly functions, but it could be adapted by a sufficiently talented individual to work with other projects if desired.</div><div><br /></div><div>AJS takes a large number of command line options which control things such as which lines of assembly code should be reordered, how the output is to be written, how timing is to be done, etc.</div><div><br /></div><div>After six months of work, Alex Kruppa took over AJS development. The basic superoptimiser worked, modulo some bugs, but it still couldn't be used because of an unexpected problem we encountered.</div><div><br /></div><div>In the good ole days, getting cycle accurate timing was trivial. x86 architectures had an instruction for this. But over the time, CPUs have become more and more complex, and the demands on them have become greater. We don't know whether gamers are to blame, or patent trolls, or Intel engineers, but cycle accurate timings these days, can only be obtained with a great deal of trouble.</div><div><br /></div><div>It literally took 3 or so months to solve the problem of getting cycle accurate timings on Intel processors. Some combination of fiddling with hyperthreading, address space layout randomisation, kernel options, frequency scaling, performance counters, kernel modules, stack address manipulation, SSE to AVX switching and various other tricks later, we finally got more or less cycle accurate timing on some machines we were superoptimising for.</div><div><br /></div><div>After this, Alex Kruppa was able to superoptimise for two recent Intel microarchitectures, namely Haswell and Skylake.</div><div><br /></div><div>He also did some optimisation (though not superoptimisation) for an older AMD architecture called Bulldozer.</div><div><br /></div><div>As this was all more work than could be accomplished in the 12 months total that we had available, we also relied heavily on outside volunteer effort. We are very thankful to Jens Nurmann in particular who was able to supply handwritten Skylake assembly for many functions which were easy to convert to the MPIR interface.</div><div><br /></div><div>Brian Gladman also helped to modify these function so they would work on Windows (which uses a different ABI, i.e. functions store their arguments in different registers on Windows).</div><div><br /></div><div>In some cases, GMP already had very fast or optimal assembly code for these processors, and where our internal interface is the same as theirs, we were able to use some of their functions in MPIR.</div><div><br /></div><h2>Results</h2><div><br /></div><div>The result of all this effort is a new version of MPIR which will be released in a couple of days, with much faster assembly optimised functions for Haswell, Skylake and Bulldozer architectures.</div><div><br /></div><div>We are also in the process of doing some optimisation for Broadwell, a tick to Skylake's tock.</div><div><br /></div><div>You can see tables of all the performance improvements that were obtained for Haswell, Skylake and Bulldozer on the final writeup for the OpenDreamKit project here [4].</div><div><br /></div><div>As can be seen, even over the previous assembly code that was being used for these architectures (which had been written for earlier, but related microarchitectures), we obtain as much as 20 or 30 percent improvement. This represents a real-world speedup that one can expect to see in most calls to MPIR on those architectures.</div><div><br /></div><div><br /></div><h2>Future work</h2><div><br /></div><div>Of course, we'd like to do much more for the three architectures we optimised for. There wasn't time to speed up division functions, for example, or Montgomery REDC, which are all assembly optimised in MPIR.</div><div><br /></div><div>And there are the Excavator, Steamroller, Broadwell, Kaby Lake, Xen, Cannon Lake and other architectures still to go.</div><div><br /></div><div>Volunteers are of course welcome. We need assembly experts who are interested in writing new assembly code, superoptimising it and maintaining support in MPIR for new architectures as they arise. If you can help, please volunteer on our development list, which you can find on our website [5].<br /><br /></div><div></div>[1] <a href="https://gmplib.org/">https://gmplib.org/</a><br />[2] <a href="https://github.com/asmjit">https://github.com/asmjit</a><br />[3] <a href="https://github.com/alexjbest/ajs">https://github.com/alexjbest/ajs</a><br />[4] <a href="https://github.com/OpenDreamKit/OpenDreamKit/issues/118">https://github.com/OpenDreamKit/OpenDreamKit/issues/118</a><br />[5] <a href="http://mpir.org/">http://mpir.org/</a><br />[6] <a href="https://github.com/akruppa/ajs">https://github.com/akruppa/ajs</a>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com1tag:blogger.com,1999:blog-7651115430416156636.post-8705710912250364002017-02-23T07:36:00.003-08:002017-02-27T06:34:01.628-08:00Integer factorisation in Flint (OpenDreamKit)<div class="separator" style="clear: both; text-align: center;"><a href="http://www.publicdomainpictures.net/pictures/60000/velka/numbers-1376742539qAa.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://www.publicdomainpictures.net/pictures/60000/velka/numbers-1376742539qAa.jpg" width="320" /></a></div><br />For the past eight months or so, I've been working part time as a contributor to the OpenDreamKit project [1], funded by the EU through their Horizon2020 initiative.<br /><br />One of the deliverables we had for that project was a parallel version of the quadratic sieve for factoring large integers (typically in the 20-90 decimal digit range).<br /><br />The executive summary is that this work has now been committed to the Flint repository [2] and is usable (see the qsieve_factor function in the qsieve module).<br /><br />The starting point was an old implementation of the quadratic sieve in Flint which only worked for small integers, and which was very fragile, written mostly by me.<br /><br />A Google Summer of Code student Nitin Kumar did some preliminary work on it. I've finished off that implementation and parallelised it, tuned it and made it more robust, as part of the OpenDreamKit project.<br /><br /><h2>Factoring integers</h2><div><br /></div>There are many practical algorithms for factoring integers:<br /><ul><li>trial division</li><li>SQUFOF</li><li>Euler's method</li><li>Fermat's method</li><li>One line factor</li><li>Lehman's algorithm</li><li>p-1 method</li><li>p+1 method</li><li>Elliptic Curve Method (ECM)</li><li>Pollard's rho</li><li>Quadratic sieve</li><li>General number field sieve</li><li>Special number field sieve</li></ul><div>In the range 20-90 decimal digits, there are two methods that one wishes to use. If the number has small factors that were not found by trial division or Pollard's Rho, there is the ECM method. But if one is fairly sure that all that remains are two large factors prime factors, i.e. n = pq, then one usually makes use of the quadratic sieve method.</div><div><br /></div><h2>Difference of squares</h2><div><br /></div><div>As for many factoring methods, the quadratic sieve ultimately expresses the number n to be factored</div><div>as a difference of squares n = a^2 - b^2 = (a - b)(a + b).</div><div><br /></div><div>As can be seen from the difference of squares formula, this factors n if a - b is not 1.</div><div><br /></div><div>The quadratic sieve makes use of a variant of this idea, namely to find two integers a and b such that a^2 = b^2 mod n.</div><div><br /></div><div>If a and b are not the same modulo n (and differ by more than 1 modulo n) then we have (a - b)(a + b) = wn for some integer w.</div><div><br /></div><div>This doesn't guarantee we factor n, but about half the time it will do so.</div><div><br /></div><h2>Dixon's method</h2><div><br /></div>Of course, trying random values a and b is impractical, but there is a way to make it work.<br /><br />Instead of looking for values a such that a^2 is a square of another integer b modulo n, we look for many values of a such that a^2 has lots of small prime factors. We then multiply these together to build up a square.<br /><br />For example, if<br /><br />a1^2 = 2*3*7^2 mod n<br />a2^2 = 2*7 mod n<br />a3^2 = 3*7 mod n<br /><br />then (a1*a2*a3)^2 = (2*3*7^2)^2 mod n, and we can take a = a1*a2*a3 and b = 2*3*7^2. Then we will have a^2 = b^2 mod n as required.<br /><br />Expressions such as the ones above are called relations, and the small primes that appear on the right side of those relations make up what is called a factor base.<br /><br />The idea is to pick a range, say [2, 10000] which the factor base primes are allowed to be in, and then try random values ai^2 modulo n to see if they factorise fully over the factor base.<br /><br />Note that if all the values a1, a2, etc are taken in the range [sqrt(n), sqrt(2n)) then their square will be in the range [n, 2n), which means that their squares can be reduced modulo n simply by subtracting n.<br /><br />In other words, we consider the values a1^2 - n, a2^2 - n and so on.<br /><br /><h2></h2><h2>The quadratic sieve</h2><div><br /></div><div>The quadratic sieve is a practical improvement on Dixon's method.</div><div><br /></div><div>Firstly, we note that if a1^2 - n is divisible by the factor base prime p, then so is (a1 + p)^2 - n. Secondly, we note that there is only one other value i less than p for which (a1 + i)^2 - n is divisible by p.</div><div><br /></div><div>In short, if I want to find all the values i in some range [-M, M] say for which i^2 - n is divisible by a factor base prime p, I just need to find the two values i in the range [0, p) which work, and all the others will be a multiple of p away from those two.</div><div><br /></div><div>This leads to a useful algorithm that can be implemented on a computer. First find the two square roots of n modulo p, to get the initial two values of i, then set an array up in memory, with all the values in the range [-M, M]. From the two initial values of i, stride off by steps of p in each direction and mark off all the other values of i for which i^2 - n will be divisible by p.</div><div><br /></div><div>If we do this for some big array V = [-M, M] for each of the factor base primes p in the factor base range, F = [2, N] say, then we can just look at all the entries in the array A and see which ones correspond to values that are highly divisible by factor base primes.</div><div><br /></div><div>In other words, we can find lots of values a1, a2, a3, etc, in V = [-M, M] such that a1^2 - n, a2^2 - n, a3^2 - n are fully divisible by factor base primes.</div><div><br /></div><div>This process of striding off by steps of p in a big array V = [-M, M] to cheaply mark off all entries corresponding to values divisible by p is known as sieving.</div><div><br /></div><h2>Improvements</h2><div><br /></div><div>There are many, many improvements that can be made to the naive quadratic sieve above. We list some of them here:</div><div><br /></div><h4>Multiple polynomials</h4><div><br /></div><div>The naive sieve essentially looks for solutions to x^2 - n = 0 mod p, where p is a factor base prime. But note that if x is just a little larger than sqrt(n) the value x^2 - n will be very small. But as soon as x gets too big, x^2 - n will also be large. The probability that it completely factors over the factor base F = [2, N] diminishes, and it becomes harder and harder to find relations.</div><div><br /></div><div>To get around this, we use polynomials other than x^2.</div><div><br /></div><div>There's a few different methods of doing this. The first method is to choose only those values of x for which x^2 - n is divisible by a known prime q, say. These lie in an arithmetic progression.</div><div><br /></div><div>We choose q to be larger than the factor base primes. Although this makes the corresponding values of x^2 larger, we know in advance that q^2 will be a factor, which we can remove, bringing things down to about the same size as for the naive x^2 - n case.</div><div><br /></div><div>By switching the value of q, we get to look for relations amongst fairly small values, for each different value of q.</div><div><br /></div><div>A second technique is to use polynomials (Ax + B)^2 - n for different values of A and B.</div><div><br /></div><div>If we choose the values A and B such that n = B^2 - AC for some integer c, then the polynomials become A^2x^2 + 2ABx + B^2 - B^2 + AC = A(Ax^2 + 2Bx + C). In other words, we are guaranteed a factor of A that we can remove.</div><div><br /></div><div>The game then becomes finding relations amongst small values of Ax^2 + 2Bx + C.</div><div><br /></div><div>There are heuristics for exactly how large the coefficient A should be to give the optimal speed in the algorithm.<br /><br /></div><h4></h4><h4>Knuth-Schroeppel method</h4><div><br /></div><div>Sometimes it turns out to be more efficient to find a factor of kn for some small multiple k than for the original number n we are trying to factor. The Knuth-Schroeppel algorithm helps find such a multiplier k.</div><div><br /></div><div>To avoid just finding a bad factorisation, we always take the greatest common divisor with n instead of kn in the final step of the algorithm. For example, if we have kn = a^2 - b^2, we take the greatest common divisor of a - b with n to avoid getting an extraneous factor of k in our final factorisation.</div><div><br /></div><h4>Self initialisation (hypercube method)</h4><div><br /></div><div>In switching the polynomials Ax^2 + 2Bx + C mentioned above, we have to find solutions to Ax^2 + 2Bx + C modulo p for each factor base prime. This can be quite expensive every time we switch polynomial.</div><div><br /></div><div>The self initialisation method saves additional time by giving us many cheap switches of B for each value of A that is used.</div><div><br /></div><div>The way this is done is to make the value A a product of small factor base primes. If there are s such factor base primes, then it turns out that there are 2^s values of B that can be used and which are cheap to compute. Moreover, the solutions modulo p that we need to compute for each factor base prime p can be computed from one another very cheaply, using some precomputed information.</div><div><br /></div><h4>Large prime variant</h4><div><br /></div><div>We can allow partial relations which are a product of factor base primes and one larger prime. We compute many of these partial relations, and whenever two are found with the same large prime, we multiply the partial relations together, which gives us the equivalent of one full relation.</div><div><br /></div><div>With some graph theoretic techniques, one can also allow two large primes per partial relation. However, this technique is only useful for very large factorisations. It is even possible to use three large primes, but this is typically only useful on very large clusters for extremely large factorisations.</div><div><br /></div><h4>Small prime variant</h4><div><br /></div><div>It turns out that a nontrivial proportion of the time spent sieving is used to sieve with very small factor base primes p. These hit the sieve much more often than larger primes and so take a disproportionate amount of time for the value they contribute to the sieving process.</div><div><br /></div><div>The small prime variant skips sieving with these primes and only checks divisibility by these primes for factorisations that are very close to being relations.</div><div><br /></div><h4>Block Lanczos linear algebra</h4><div><br /></div><div>The problem of finding a combination of relations to multiply together to give a square can be achieved by representing the problem as a linear algebra problem over GF2. If a prime occurs to an odd power in a relation, the appropriate location in the GF2 matrix is set to 1, otherwise it is set to 0. The problem then becomes one of looking for kernel vectors of this matrix over GF2.</div><div><br /></div><div>The Block Lanczos algorithm can be used to find such kernel vectors quite efficiently.</div><div><br /></div><div><h4>Bradford-Monagan-Percival method</h4></div><div><br /></div><div>This method is similar to the self-initialisation method, except that the coefficient A of the polynomial is chosen to have one large factor q, which is increased every time the polynomial is switched. This helps avoid duplicate relations in an easy way and lets one use the same product of small primes over and over in the coefficient A.</div><div><br /></div><h4>Carrier-Wagstaff method</h4><div><br /></div><div>This is another technique designed to help with the selection fo the coefficients A of the polynomials. The possible factor base primes that could make up the factors of the coefficient A are divided into two sets, those with odd indices and those with even indices in the array of factor base primes.</div><div><br /></div><div>All but one of the factors is taken from the first set, then the final prime is chosen from the other set such that the product gives the most optimal value of A.<br /><br /></div><h4></h4><h4>Other tricks</h4><div><br /></div><div>There are many other tricks that can be used to cut the time used to test whether something that looks like a relation really is a relation. There are also lots of strategies for combining large prime partial relations into full relations.</div><div><br /></div><div>There are also numerous tricks associated with the processor cache. Rather a lot of precomputed information is stored for each factor base prime. The sieve interval itself may also be larger than the processor cache. Things can be dramatically sped up by breaking such large blocks of information into smaller blocks and processing small blocks at a time. Each small sieve block is processed against each small block of factor base primes, and so on.</div><div><br /></div><div>These days, there are also many opportunities to use SIMD intrinsics, e.g. to search the sieve interval for highly composite values which might be relations.</div><div><br /></div><div>We also make use of precomputed inverses to do divisions and reductions modulo small values, e.g. by factor base primes.<br /><br /></div><h2></h2><h2>Parallelising the relation sieving</h2><div><br /></div><div>The relation sieving in the quadratic sieve can be relatively easily parallelised. As the sieving per polynomial is largely independent of the sieving for any other polynomial (with the exception of the file handling for the large prime partial relations), it is quite easy to parallelise the sieving portion of the quadratic sieve.</div><div><br /></div><div>In Flint we use OpenMP to parallelise the relation sieving. We combine the Carrier-Wagstaff and Bradford-Monagan-Percival (and of course the self-initialisation) methods to create polynomials that can be switched easily without too many duplicate relations. We then sieve for different polynomials in different threads.</div><div><br /></div><h2>Robustness</h2><div><br /></div><div>There are many points at which the quadratic sieve can either abort early, or fail, especially if tuning values aren't set quite right. In our Flint implementation, if we fail to find a factor of a number n, we restart the sieve with a larger factor base. This shouldn't ever happen in practice if the tuning values are set just right, but it is possible in theory.</div><div><br /></div><div>We also increase the size of the factor base if something goes wrong, for example, we ran out of polynomials before generating enough relations, and so on. Again, these problems should not happen in practice, but they do happen if wildly incorrect tuning values are supplied.</div><div><br /></div><h2>Results</h2><div><br /></div><div>The code to implement all of the above is now merged into Flint and has been really thoroughly tested and tuned for number up to about 90 digits.</div><br />In summary, the implementation is relatively competitive from about 130 bits onward.<br /><br />Some tables giving timings for one and multiple threads are given in our final report for the EU [3] which is included at the start of the GitHub issue for that project.<br /><br /><h2>Future work</h2><div><br /></div><div>It takes about 10 years to really implement a world class quadratic sieve implementation. We would have a long way to go if our aim was to have world beating performance. But for a general purpose library, what we have right now is actually pretty reasonable. There are lots of things we could improve. These include:</div><div><br /></div><div><ul><li>Further cache efficient handling of large arrays in the sieve</li><li>Implementing double and triple large prime variants</li><li>Parallelising the file handling for large relations</li><li>Implementing a quadratic sieve which handles small factorisations (50-130 bits) efficiently</li></ul></div><br /><br />[1] <a href="http://opendreamkit.org/">http://opendreamkit.org/</a><br />[2] <a href="https://github.com/wbhart/flint2">https://github.com/wbhart/flint2</a><br />[3] <a href="https://github.com/OpenDreamKit/OpenDreamKit/issues/119">https://github.com/OpenDreamKit/OpenDreamKit/issues/119</a>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-26625297822531998892017-01-09T23:57:00.000-08:002017-01-09T23:57:19.721-08:00Singular and JuliaIn my previous blog, I mentioned the new computer algebra system, Oscar that is being developed [1].<br /><br />Later this week, a dozen or more Oscar people are converging in Kaiserslautern to discuss the next steps in the part of Oscar that is making use of the programming language Julia. Specifically, we will be discussing some of the technical issues in integrating Polymake, Singular and GAP with the Antic "cornerstone", which is written entirely in Julia and C.<br /><br />One of the agenda items is a demonstration of a new Julia package which Oleksandr Motsak and I have written, called Singular.jl [2]. It's part of a two pronged strategy with regard to the integration of Singular into Oscar.<br /><br />The first part of that strategy will be to write an interpreter for the Singular programming language in Julia, so as to take advantage of JIT compilation in Singular. The second part of the strategy is Singular.jl, which I'll describe in detail here.<br /><br />Singular.jl consists of two parts. The first is a low level interface to Singular [3]. It makes use of Julia's ability to interface to C++ (even templated code) in real time [4]. Julia has the ability to mix C++ code with Julia code and have it all compile down to LLVM instructions for execution by the JIT. It makes use of the Clang library to compile arbitrary C++ code.<br /><br />This low-level interface to Singular preserves Singular type and function names, and does not make use of garbage collection at the Julia level. Calling a Singular kernel function in this way should have no significant overhead as compared with calling it in a C++ program. You can see this interface layer in the libsingular directory of Singular.jl.<br /><br />On top of this low-level interface is a garbage collected high level interface, which is compatible with the Nemo generics system. In this way, we make Singular functionality, including the Groebner basis engine, directly accessible to Nemo [5] and the other parts of the Julia/Antic ecosystem.<br /><br />Nemo provides various generic mathematical structures and algorithms, e.g. dense linear algebra, univariate polynomials (and more recently multivariate), residue rings, fraction fields, power series and so on. By implementing the interface required by Nemo, other packages can make use of all this generic functionality over ring implementations provided by those packages (hardly a novel concept for computer algebra systems, but an important one nonetheless).<br /><br />Eventually we will also allow Singular to build polynomial rings over rings and fields provided by Nemo and other systems. Singular has this capacity, due to work of Hannes Schoenemann and others, and Oleksandr Motsak already prototyped making Nemo rings available as coefficient rings to Singular as part of the due diligence we did for the big grant we applied for.<br /><br />So far in Singular.jl we have high level Julia wrappers for Singular integers, rationals, prime fields, integer residue rings, finite fields, polynomials, ideals, modules, vectors, matrices and resolutions.<br /><br />It's time to see Singular.jl in action. Here is how it looks when Singular.jl is started up. It has a dependency on Nemo, so you see the Nemo.jl welcome message.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-dnFEFTExTrc/WHSEb0CiF5I/AAAAAAAABvk/Kp7WnYWaVs4RoXGDvoEyq0aDIFjGdthsQCLcB/s1600/singular.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="99" src="https://3.bp.blogspot.com/-dnFEFTExTrc/WHSEb0CiF5I/AAAAAAAABvk/Kp7WnYWaVs4RoXGDvoEyq0aDIFjGdthsQCLcB/s640/singular.PNG" width="640" /></a></div><br /><br />We start with creating the Singular integer ring with a Nemo residue ring over it. This is specialised by Singular.jl to actually make use of a Singular residue ring behind the scenes.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-BhV-b5boSNw/WHSFDrlcSRI/AAAAAAAABvo/CtJSN7Qq8eQNggpPc8-kePxqi7U8Xy9-QCLcB/s1600/Residue.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="208" src="https://2.bp.blogspot.com/-BhV-b5boSNw/WHSFDrlcSRI/AAAAAAAABvo/CtJSN7Qq8eQNggpPc8-kePxqi7U8Xy9-QCLcB/s640/Residue.PNG" width="640" /></a></div><br /><br />Now let's create a Singular polynomial ring, some polynomials and do a computation with them. Because I didn't specify an ordering, it defaults to the Singular dp ordering (degrevlex) and the ordering, C, for module generators.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-QfKSt-dXv2g/WHSGlBwV0_I/AAAAAAAABv0/9PvwuqkAyg0b7iw5_g4cVJyqllDxetwAQCLcB/s1600/poly.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://3.bp.blogspot.com/-QfKSt-dXv2g/WHSGlBwV0_I/AAAAAAAABv0/9PvwuqkAyg0b7iw5_g4cVJyqllDxetwAQCLcB/s640/poly.PNG" width="640" /></a></div><br />One can specify an optional parameter (ordering) to change the polynomial ordering. (Later on, we may separate the module ordering so that it is not needed unless creating module elements. At present, a suitable default is selected automatically.)<br /><br />Next, we create some ideals over this polynomial ring, count the number of generators and manipulate the generators, using Julia array syntax (which can be overloaded for any purpose).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-jfNynPuNgFU/WHSIVwXmLaI/AAAAAAAABv8/e_LowND2EMMdIlq4KyeFlB7VX18d3-EIwCLcB/s1600/ideal.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="420" src="https://2.bp.blogspot.com/-jfNynPuNgFU/WHSIVwXmLaI/AAAAAAAABv8/e_LowND2EMMdIlq4KyeFlB7VX18d3-EIwCLcB/s640/ideal.PNG" width="640" /></a></div><br />Next, let's create a Singular ideal and take its standard basis (Groebner basis), using the Singular std command. Singular can compute Groebner bases over all the basic coefficient rings that it supports internally. (The functionality is there to compute them over externally supplied coefficient fields, though it is considered less well tested at this point.)<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0wbD8q-C7co/WHSJFBQa5OI/AAAAAAAABwA/OTBVlJp4vtQePt2fuX6KQEN8rztK6S_ZACLcB/s1600/std.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="138" src="https://1.bp.blogspot.com/-0wbD8q-C7co/WHSJFBQa5OI/AAAAAAAABwA/OTBVlJp4vtQePt2fuX6KQEN8rztK6S_ZACLcB/s640/std.PNG" width="640" /></a></div><br />The next most logical thing to do is to compute the syzygy matrix of an ideal. This displays by default as a linear combination of module generators, but it is also possible to convert to a matrix representation.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-r5jlLGbPWRM/WHSKKB08HqI/AAAAAAAABwM/05d7Kb5v2LMAmCi8Cga7hTCc46BGIqeZACLcB/s1600/syz.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="204" src="https://1.bp.blogspot.com/-r5jlLGbPWRM/WHSKKB08HqI/AAAAAAAABwM/05d7Kb5v2LMAmCi8Cga7hTCc46BGIqeZACLcB/s640/syz.PNG" width="640" /></a></div><br />One of the things we need to improve in the near future is the formatting for the matrix printing. For now, we just did the easiest thing in Singular.jl.<br /><br />Of course, one can do basic ideal arithmetic.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-PIXyJs_bZZE/WHSMQ5TqzOI/AAAAAAAABwc/apMVRmQe5pc3l6v58Ap_yNEtNRFrXRfhwCLcB/s1600/idealarith.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://1.bp.blogspot.com/-PIXyJs_bZZE/WHSMQ5TqzOI/AAAAAAAABwc/apMVRmQe5pc3l6v58Ap_yNEtNRFrXRfhwCLcB/s640/idealarith.PNG" width="640" /></a></div><br />The next thing one might wish to do is compute resolutions. At present, only the Schreyer resolution is available in Singular.jl. We'll add La Scala and minimal resolutions soon.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-yLVdlwuJbE4/WHSOBp5wtuI/AAAAAAAABwk/VS3sqDg7htM_PloKWd7rZMnkR4ZsqWoRgCLcB/s1600/res.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="206" src="https://3.bp.blogspot.com/-yLVdlwuJbE4/WHSOBp5wtuI/AAAAAAAABwk/VS3sqDg7htM_PloKWd7rZMnkR4ZsqWoRgCLcB/s640/res.PNG" width="640" /></a></div><br />One can of course access all the modules in the resolution, precisely as one would expect.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-QfM1Wns9Q9A/WHSOvfoWedI/AAAAAAAABwo/8kf6PVXrCFMjdoC3rALHMI4gPLMgfK4uACLcB/s1600/resmod.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="360" src="https://2.bp.blogspot.com/-QfM1Wns9Q9A/WHSOvfoWedI/AAAAAAAABwo/8kf6PVXrCFMjdoC3rALHMI4gPLMgfK4uACLcB/s640/resmod.PNG" width="640" /></a></div><br />And of course, doing arithmetic on module elements works just as expected.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-jVx5KwWIuY4/WHSPad2_CZI/AAAAAAAABw0/zkly3CUjhTsvqdJFoo34G0zET1yeOlWzwCLcB/s1600/modarith.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="160" src="https://1.bp.blogspot.com/-jVx5KwWIuY4/WHSPad2_CZI/AAAAAAAABw0/zkly3CUjhTsvqdJFoo34G0zET1yeOlWzwCLcB/s640/modarith.PNG" width="640" /></a></div><br />And naturally, one can compute standard bases and resolutions of modules.<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-rUM96h-ojLs/WHSQM_x7ppI/AAAAAAAABxE/Koq71j4kxP8CSOTA09qAwYQgqo0uVbtJgCLcB/s1600/modres.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="https://4.bp.blogspot.com/-rUM96h-ojLs/WHSQM_x7ppI/AAAAAAAABxE/Koq71j4kxP8CSOTA09qAwYQgqo0uVbtJgCLcB/s640/modres.PNG" width="640" /></a></div><br />Finally, let's have a demonstration of a computation using the new generic multivariate polynomial code in Nemo (only available in the mpoly development branch at present), over a Singular coefficient ring. Here Singular is not computing the GCD, but Nemo is. And of course it is faster than the native functionality in Singular itself.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-74_ZTu6q52M/WHSTEA7zqfI/AAAAAAAABxQ/XyATXv1-vEEQK_-Oy2HfwPr7XziDT24VgCLcB/s1600/gcd.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="418" src="https://1.bp.blogspot.com/-74_ZTu6q52M/WHSTEA7zqfI/AAAAAAAABxQ/XyATXv1-vEEQK_-Oy2HfwPr7XziDT24VgCLcB/s640/gcd.PNG" width="640" /></a></div><br />That's all for now. In coming blogs I'll talk more about the Singular interpreter and make updates about Oscar development as information comes to hand.<br /><br />[1] <a href="https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html">https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html</a><br />[2] <a href="https://github.com/wbhart/Singular.jl">https://github.com/wbhart/Singular.jl</a><br />[3] <a href="https://www.singular.uni-kl.de/">https://www.singular.uni-kl.de/</a><br />[4] <a href="https://github.com/Keno/Cxx.jl">https://github.com/Keno/Cxx.jl</a><br />[5] <a href="https://github.com/wbhart/Nemo.jl">https://github.com/wbhart/Nemo.jl</a>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com1tag:blogger.com,1999:blog-7651115430416156636.post-45198403491828632182016-11-20T12:49:00.002-08:002016-11-24T03:33:33.657-08:00New Computer Algebra System : OSCAROn Friday we found out the exciting news that the German funding agency (DFG) has approved a massive transregional grant here in Germany, that the computer algebra group here at the University of Kaiserslautern has been heavily involved with. There was much celebration in the Department of Mathematics where I work when we received the news!<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-U_Uau74L3hE/WDIDdRiHUPI/AAAAAAAABuo/ZlTuKjH0RBEsIR3kTrS1gHpAN87cXzmBACLcB/s1600/Celebration.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://1.bp.blogspot.com/-U_Uau74L3hE/WDIDdRiHUPI/AAAAAAAABuo/ZlTuKjH0RBEsIR3kTrS1gHpAN87cXzmBACLcB/s320/Celebration.JPG" width="320" /></a></div><br /><br />The official announcement was made on the DFG website on Monday. If you can read German, it is <a href="http://www.dfg.de/service/presse/pressemitteilungen/2016/pressemitteilung_nr_51/index.html">here</a>.<br /><br />The grant is called "Symbolic Tools in Mathematics and their Application" and is a collaboration between multiple universities here in Germany: Aachen, Kaiserslautern and SaarbrÃ¼cken, with additional external people at Siegen, Berlin, Stuttgart and TÃ¼bingen, with a total of about twenty extremely experienced PIs. The grant is initially for four years, but can be renewed up to twice, for a total of 12 years (assuming we get renewed).<br /><br />The exciting news from my point of view is project B1 of the grant, which we have internally been calling project B1 : Central Software Project. This is a collaboration to produce a visionary new Open Source computer algebra system. It will be called OSCAR : Open Source Computer Algebra Resource and is to be built on four "cornerstones", namely the <a href="http://gap-system.org/">Gap System</a>, <a href="https://www.singular.uni-kl.de/">Singular</a>, <a href="https://polymake.org/doku.php">Polymake</a> and ANTIC (the internal name for a new project I've been involved with, under Claus Fieker; see below).<br /><br /><h3>The Principal Investigators</h3><br />Before I get to any details, in no particular order, let me introduce the PI's for the Central Software Project.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-BhRSLYRp6KA/WDHmuUODpZI/AAAAAAAABt4/IHyAl9lUD0E4TJEFP4yYar2OctC1O50BQCLcB/s1600/barakat.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="https://2.bp.blogspot.com/-BhRSLYRp6KA/WDHmuUODpZI/AAAAAAAABt4/IHyAl9lUD0E4TJEFP4yYar2OctC1O50BQCLcB/s1600/barakat.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table>Prof. <a href="http://www.mathematik.uni-kl.de/~barakat/en/publications.html">Mohamed Barakat</a> is behind the <a href="http://homalg-project.github.io/">HomAlg</a> project. HomAlg is a GAP package that allows one to do homological algebra from a categorical point of view. (You may also know Mohamed Barakat's students Sebastian Gutsche and Sebastian Posur, who worked on <a href="https://github.com/homalg-project/CAP_project">HomAlg/CAP</a>.)<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-agiSNeTOqVc/WDHqXesY8AI/AAAAAAAABuE/F8pYXYVI91Mqp9Fjjf0foWJ2zVEvHQ0TgCLcB/s1600/Frank.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="198" src="https://3.bp.blogspot.com/-agiSNeTOqVc/WDHqXesY8AI/AAAAAAAABuE/F8pYXYVI91Mqp9Fjjf0foWJ2zVEvHQ0TgCLcB/s200/Frank.jpg" width="200" /></a></div><div><br /></div><div>Dr. <a href="http://www.math.rwth-aachen.de/~Frank.Luebeck/preprints/index.html?LANG=en">Frank LÃ¼beck</a> is one of the main authors of the <a href="http://www.math.rwth-aachen.de/homes/CHEVIE/">Chevie</a> project, which is a computer algebra project for computing with generic character tables of Lie groups, Coxeter groups, Iwahori-Hecke algebras and others. It makes use of the GAP system. Over the years, Frank has been a major contributor to GAP.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iYvOqNcBhDo/WDHsFFFHW-I/AAAAAAAABuM/os_0ngPbFggnHy_2ESJN_gvWPQekGRJtgCLcB/s1600/joswig.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-iYvOqNcBhDo/WDHsFFFHW-I/AAAAAAAABuM/os_0ngPbFggnHy_2ESJN_gvWPQekGRJtgCLcB/s1600/joswig.jpg" /></a></div><div><br /></div><div>Prof. <a href="http://page.math.tu-berlin.de/~joswig/joswig-publications.pdf">Michael Joswig</a> is one of the main authors of the <a href="https://polymake.org/doku.php">Polymake</a> system. This is an Open Source project for research into polyhedral geometry. Michael Joswig is the Einstein Professor for Discrete Mathematics/Geometry at TU Berlin.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-xRDtC1CbTpU/WDHuJjiKGFI/AAAAAAAABuU/t5DX_75Gp6k_SOp-PZ8rXwXZgar20lJAQCLcB/s1600/claus.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://2.bp.blogspot.com/-xRDtC1CbTpU/WDHuJjiKGFI/AAAAAAAABuU/t5DX_75Gp6k_SOp-PZ8rXwXZgar20lJAQCLcB/s200/claus.jpg" width="152" /></a></div><div>Prof. <a href="http://www.mathematik.uni-kl.de/agag/mitglieder/professoren/prof-dr-claus-fieker/publications/">Claus Fieker</a> is one of the original authors of the <a href="http://page.math.tu-berlin.de/~kant/kash.html">KASH/KANT</a> system, for algebraic number theory, which became part of Magma. He is also one of the main authors of <a href="https://github.com/thofma/Hecke.jl">Hecke</a>, a new Open Source algebraic number theory project based on <a href="https://github.com/nemocas/Nemo.jl">Nemo</a>. (Some people may also know his postdoc, Tommy Hofmann, who is a coauthor with him of Hecke and of Nemo.)</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-vqqqHpveGYs/WDHxB0HfFpI/AAAAAAAABuc/RiJC_Jc8M9ENlEgcRiN9OXkS1PpJ2CYewCEw/s1600/wolfram.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://3.bp.blogspot.com/-vqqqHpveGYs/WDHxB0HfFpI/AAAAAAAABuc/RiJC_Jc8M9ENlEgcRiN9OXkS1PpJ2CYewCEw/s200/wolfram.jpg" width="198" /></a></div><div><br />Prof. <a href="https://scholar.google.de/citations?user=2umti6YAAAAJ&hl=en"><span id="goog_1440951967"></span>Wolfram Decker<span id="goog_1440951968"></span></a> is the coordinator of a prior <a href="http://www.computeralgebra.de/">Priority Project</a> funded by the DFG on Computer Algebra, which is just concluding. Wolfram Decker is one of the people who is directly responsible for directing the <a href="https://www.singular.uni-kl.de/">Singular</a> project and is a key member of the Algebra, Geometry and Computer Algebra group in Kaiserslautern.<br /><br /><br /><br /><br /><br /><br />In this blog, I'm only going to speak about the Central Software Project. But it is important to realise that the grant is much, much larger than this one project (it is just one of about 23 funded, interrelated projects in the grant). The Central Software Project should be thought of as providing the technical infrastructure for a very large mathematics grant in the area of computer algebra, although there are other projects within the grant that will contribute in a significant way to OSCAR. I'll maybe talk about some of them in later blogs.<br /><br />The entire grant application had more than 100 pages in the preproposal and 400 pages in the full proposal, and took years to put together! At the final referee stage, the grant had 10 eminent referees, most from overseas.<br /><br /><h3>What is OSCAR?</h3><br />So what is OSCAR all about? Again, I can only speak about it in terms of what I understand from the PI's, and my knowledge gets greater the closer you get to the bit I will be directly involved with.<br /><br />The basic idea is to develop a new Computer Algebra System, covering group theory, algebraic geometry/commutative algebra, polyhedral geometry and number theory. Our aim is to directly compete with the closed source system, Magma, hence the focus on many areas where Magma has been particularly influential.<br /><br />All of the cornerstone systems (except ANTIC, which is quite new) have developed significant projects and communities in their own right, over decades, and it is our hope to combine the strengths of all these systems into a single system, which is highly integrated and easy to use. And we hope to do all this without significant disruption to those existing ecosystems. OSCAR will be a single system, which builds on all four of those existing systems and which tightly integrates those systems.<br /><br />One major motivation for OSCAR is mathematical applications where it is necessary to do cross-disciplinary research, for example, where one might need a multigraded, equivariant Cox ring of a toric variety over a number field, or graphs of groups in division algebras, or matrix groups over a polynomial ring over a number field. To perform such exotic calculations, it is quite often not enough to simply interface systems such as GAP, Singular and ANTIC which separately provide either matrix groups, polynomials rings or number fields, say. One needs a very tight integration, especially if one wants such computations to be practical.<br /><br />Some of us believe that it is the monolithic integration of Magma that has allowed it to be successful in the areas we hope to impact. Therefore, we in the Open Source community need to get really smart about integration in order to compete.<br /><br />I personally will be working on both low and mid-level integration of the four cornerstone systems, using the programming language <a href="http://julialang.org/">Julia</a>. But there will be other efforts too, for example, there will be a coordinated serialisation effort, a parallelism project, new packages, interfaces and technologies in the cornerstone systems, and so on.<br /><br />There will be a number of people other than myself hired to work on OSCAR. Let me mention two that will be associated with the Central Software Project specifically.<br /><br />Dr. Reimer Behrends, who shares an office with me, and who is one of the main architects and authors of HPC-GAP and HPC-Singular will be working on aspects of parallelisation in OSCAR.<br /><br />As mentioned already, Sebastian Gutsche will be working on both GAP and Polymake integration with the other cornerstones. He will be doing this using Julia, and in other ways.<br /><br /><h3>How am I involved in this?</h3><div><br /></div>So, the rest of what I'm going to write is just about the part I have been involved with. The OSCAR website when it goes live will be the best source of information on the direction the overall project is going.<br /><br />Some of you may have heard of my work on the <a href="https://github.com/nemocas/Nemo.jl">Nemo</a> project (joint work with Claus Fieker, Tommy Hofmann, Fredrik Johansson and others). This is a computer algebra package written in the <a href="http://julialang.org/">Julia</a> programming language. The main idea behind this, from my perspective, is to demonstrate two things:<br /><br />1) It is possible to integrate multiple packages, such as Flint, Arb, Singular, Hecke, Antic (the C library -- not the ANTIC cornerstone mentioned above) and so on, in an efficient way, using the Julia programming language. And we've had some early success; already, we have demonstrated that it is possible to build Singular polynomial rings over coefficient rings provided by Nemo itself, or by Antic or Hecke, etc.<br /><br />2) It is possible to develop extremely <a href="http://nemocas.org/benchmarks.html">efficient implementations</a> of generic algorithms over generic rings (rather than the specific rings that we have highly optimised implementations for, e.g. in the Flint system). This is possible in the Julia language, due to its innovative type system and JIT (Just-In-Time) compiler.<br /><br />(It should be noted that Polymake has a very sophisticated form of runtime specialisation, and what may be thought of as an early kind of JIT compiler.)<br /><br />I personally believe that high level algorithms implemented in an old-style interpreter are sometimes not enough to get a practical implementation for some research questions. Sometimes, the difference between a practical implementation and failure, is not bottlenecks in low level C code, but generic algorithms implemented in a high level, interpreted language. But unfortunately, we don't have enough skilled, low-level developers to reimplement such things in a low-level language every time they appear!<br /><br />So one of the things I am particularly interested in, is applications where a "mid-level" language like Julia can be used to speed up generic algorithms. This has been one of the ideas behind my work on Nemo, so far, and the reason why I chose to develop it in the Julia programming language. Julia facilitates high-level constructs, with low-level performance.<br /><br />Already, on top of Nemo, Prof. Claus Fieker and Dr. Tommy Hofmann, have built the <a href="https://github.com/thofma/Hecke.jl">Hecke</a> system, also written in the Julia programming language, for doing computations in orders of number fields and class group computations. This is a culmination of efforts of Claus Fieker's project under the Priority Project, which I mentioned above.<br /><br /><h3>What is the ANTIC cornerstone?</h3><br />So with all of that background in mind, I can now describe the ANTIC cornerstone of the Central Software Project. ANTIC is the internal name we have had for the combination of <a href="http://mpir.org/">GMP/MPIR</a>, <a href="http://www.mpfr.org/">MPFR</a>, <a href="http://flintlib.org/">Flint</a>, <a href="https://github.com/wbhart/antic">Antic</a>, <a href="http://arblib.org/">Arb</a>, <a href="http://nemocas.org/">Nemo</a> and <a href="http://thofma.github.io/Hecke.jl/latest/">Hecke</a>, that we have been busy with integrating and in some cases, developing under the Priority Programme. It will now become one of the four cornerstones of the Central Software Project, and will essentially be responsible for providing algebraic number fields and related things to the new Computer Algebra System, OSCAR. <br /><br />We now begin the task of doing our part to help with the integration of ANTIC with the other cornerstones of the OSCAR system, namely Singular, Polymake and GAP.<br /><br />There are already preexisting and independent projects underway to integrate Flint and Antic (the C library) into GAP as packages, which we aim to support in whatever way we can.<br /><br />And within our collaboration, Michael Joswig has already suggested projects to me that will require integration of Antic with Polymake, and work is already quite advanced in the direction of integrating Singular and Nemo/Hecke. And there have already been extensive discussions about future integration of ANTIC with GAP, probably via Nemo and/or Hecke.<br /><br />The ANTIC project achieves its integration, and some of its generic algorithms are directly implemented in, the Julia programming language. We aim to contribute the expertise we have been developing here in Kaiserslautern, in the Julia language, to the wider integration of the four cornerstones in OSCAR.<br /><br /><h3>Will you join us?</h3><br />Obviously OSCAR is a massive undertaking. We hope that the mathematics community and Computer Algebra communities at large will support our efforts.<br /><br />OSCAR is an Open Source project, and we hope that it will attract many contributors over the years and that it will support many Masters and PhD projects!<br /><br />The really important thing is that OSCAR is motivated by a large number of serious mathematical projects across computer algebra. These applications will drive development of the project and I believe, lead to a high quality, well-tested project, leading to many substantial contributions and publications for a long time to come.<br /><br />Join us if you are interested. In coming days we will very quickly be setting up websites and infrastructure. In the meantime, feel free to contact me or any of the other people you know who are involved with this project.<br /><br />And join us in celebrating what is a fantastic investment in Open Source computer algebra, that we hope will be of great benefit for the mathematical community at large!<br /><br /><h3>Further Information</h3><div><br />I hope to keep people up-to-date with what I'm personally working on, through this blog. I'll also link to other press releases, blogs and articles about OSCAR and our grant, as I become aware of them. So follow this blog if you are interested in hearing more as things develop.<br /><br />Here is the University of Kaiserslautern <a href="http://www.uni-kl.de/aktuelles/news/news/detail/News/grosser-erfolg-fuer-die-tu-kaiserslautern-neuer-dfg-sonderforschungsbereich-in-der-mathematik/">press release</a> (German).</div></div><br /><div class="separator" style="clear: both; text-align: center;"></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com1tag:blogger.com,1999:blog-7651115430416156636.post-84087010165655719922012-05-22T08:03:00.000-07:002012-05-22T16:17:12.254-07:00Reduced Binary Quadratic FormsOver the past few weeks I've been writing some code for computing reduced binary quadratic forms:<br /><div><br /></div><div>(a, b, c) : = ax^2 + bxy + cy^2</div><div><br /></div><div><div>The discriminant of a form is d = b^2 - 4ac. The code I wrote works for d < 0.</div><div><br class="Apple-interchange-newline" /></div>Such a form is reduced if |b| <= a <= c and |b| >= 0 if a = c or a = |b|.</div><div><br /></div><div>You can see the code here:</div><div><br /></div><div><a href="https://github.com/wbhart/flint2/tree/qfb/qfb">https://github.com/wbhart/flint2/tree/qfb/qfb</a></div><div><br /></div><div>There are two versions of the function qfb_reduced_forms.<br /><br /><h3> <b><u>Version 1</u></b></h3><br />The first version works by iterating over all possible values "a" (it's a theorem that |b| <= a <= sqrt(|d|/3)) and c <= (1 - d)/4) and searching for valid triples (a, b, c).</div><div><br /></div><div>To find valid triples, we note that if d = b^2 - 4ac then d = b^2 mod 4a. So we simply solve this quadratic congruence and search for all roots "b" in the correct range.</div><div><br /></div><div>To speed things up, we factorise all the different possible values "a" by sieving. This makes the square root code much faster, as it doesn't have to factorise 4a before computing square roots.</div><div><br /></div><div>This approach requires softly O(|d|^{1/4}) primes. There are about O(|d|^{1/2}) forms on average, and indeed the run time is softly O(|d|^{1/2}) (subject to the Generalised Riemann Hypothesis -- needed to guarantee we can find quadratic nonresidues quickly enough for the Tonelli-Shanks modular square root algorithm).</div><div><br /><br /><h3> <b><u>Version 2</u></b></h3><div><b><u><br /></u></b></div></div><div>The second approach iterates over all possible values |b| (the same bound applies as for "a").</div><div><br /></div><div>This time we factorise (d - b^2)/4 for each "b" into products "ac". Again, we do this by sieving (this time quadratic sieving, which first finds square roots of "d" mod various primes "p"). </div><div><br /></div><div>However, we must sieve with primes "p" dividing values (d - b^2)/4. In other words we need softly O(|d|^{1/2}) primes. On a 64 bit machine, this exhausts the precomputed primes at about |d| = 100000000, so this technique is a bit limited.</div><div><br /></div><div>The advantage of this technique is that it is about 50% faster as implemented. It still takes time softly O(|d|^{1/2}) though (again subject to GRH).</div><div><br /><br /><h3> <u>Timings</u></h3><div><u><br /></u></div></div><div>Magma computes reduced binary quadratic forms, so I did two comparisons:</div><div><br /></div><div>Comparison A) Compute all reduced forms for discriminants 0 > d > -1000000.</div><div><br /></div><div>Comparison B) Compute all reduced forms for disciminants -1000000000 > d > -1000000100.</div><div><br /></div><div>Here are the timings:<br /><br /></div><table border="1" bordercolor="#FFCC00" cellpadding="3" cellspacing="3" style="background-color: #ffffcc; width: 400px;"> <tbody><tr> <td>Comparison</td> <td>flint/qfb</td> <td>Magma</td> </tr><tr> <td>A</td> <td>118s</td> <td>947s</td> </tr><tr> <td>B</td> <td>1s</td> <td>120s</td> </tr></tbody></table><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-18573910133091948142011-10-23T08:23:00.000-07:002011-10-23T10:21:57.262-07:00BSDNT - interludeYou will notice that I have not worked on BSDNT for about a year now. Well, I'm thinking of restarting the project soon. I did complete two new revisions v0.25 and v0.26 since I stopped blogging. The first of these added random functions for a single word. They generate single words which have an increased probability of triggering corner cases, e.g. by a sparse binary representation. The second of these updates is a bsdnt_printf function. This is like printf but adds a %w format specifier for printing single words. There is also a %m for printing a len_t and %b for printing a bits_t. <div><br /></div><div>I likely won't get much more done on BSDNT until early next year, but here is what I am planning:<div><br /></div><div>1) I am tremendously grateful to Dr. Brian Gladman for his work on an MSVC version of the library. However, I started to struggle to keep up with this side of things more than I thought. Microsoft's MSVC doesn't support inline assembler in 64 bit x86. This means the entire plan of the MSVC version has to be different. It seems like far too much effort to combine both sets of code into a single library. I've therefore decided (sorry Brian) to ditch the MSVC code from my copy of BSDNT. </div><div><br /></div><div>2) I wasn't happy with the interface of the division code. There are a few issues to consider. </div><div><br /></div><div>The first issue is chaining. Obviously carry-in occurs at the left rather than the right. But for general division functions should the carry-in be a single limb or multiple limbs. It seems like the remainder after division is going to be m limbs and so the carry-in should be also. It is not clear what is better here. Internally, the algorithms deal with just a single carry-in limb because they use 2x1 divisions to compute the quotient digits. Perhaps chaining just means that we consider the first digit of the remainder to be carry-in for the next division.</div><div><br /></div><div>Another problem associated with this is that when reading the carry-in from the array, if the carry-in happens to be zero then the array entry may not exist in memory. This means the code has to always check if the carry-in should be zero or not before proceeding.</div><div><br /></div><div>The other issue to consider is shifting for normalisation. One assumes that the precomputed inverse is computed from a normalised value (the first couple of limbs of the divisor). Now, it is not necessary to shift either dividend or divisor ahead of time. One can still perform the subtractions that occur in division, on unshifted values. One does need to shift limbs of the dividend in turn however, as the algorithm proceeds, in order to compute the limbs of the quotient. But this shifting can occur in registers and need not be written out anywhere. This is implemented, but currently every limb gets shifted twice. This can be cut down to a single shift. </div><div><br /></div><div>3) The PRNGs are currently quite hard to read. They have numerous macros to access their context objects. They are extremely flexible, but possibly overengineered. I'd like to simplify their implementations somewhat.</div><div><br /></div><div>4) The configure script is a little overengineered. The idea of supporting lots of compilers is nice. But in reality GCC should exist almost everywhere. The original concept of BSDNT was to use inline assembly for architecture support. This gets around issues with global symbol prefixes and wotnot. It also makes the library really simple to read. Even on Windows 64 there is MinGW64 and this is the only setup I aim to target in that direction. </div><div><br /></div><div>I hope to deal with all of these issues before proceeding with development of BSDNT. Give me some time as I am busy until about the end of the year. However, I do plan to continue development of BSDNT after sorting out these issues, because I think that fundamentally what we have is very solid.</div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html">v0.24 = nn_bitset/clear/test and nn_test_random</a></div></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-80486372547386599722010-11-20T15:01:00.000-08:002011-10-23T10:17:29.143-07:00BSDNT - v0.24 nn_bitset/clear/test and nn_test_random<div>In today's update we make a long overdue change to bsdnt, again to improve our testing</div><div>of the library.</div><div><br /></div><div>We are going to add a function for generating random bignums with sparse binary </div><div>representation. We'll also add some other random functions based on this primitive.</div><div><br /></div><div>Using test integers with sparse binary representation in our test code will push our</div><div>functions harder because it will test lots of corner cases such as words that are all</div><div>zero, in the middle of routines, and so on. As it is currently, we'd be extremely</div><div>lucky for the random word generator we've been using to generate an all zero word, or</div><div>a word with all bits set to one for that matter.</div><div><br /></div><div>The first step to generating such test randoms is to write a function for setting a </div><div>given bit in an integer. This will be an nn_linear function despite it not actually</div><div>taking linear time. In fact, it will take essentially constant time. However, it is an</div><div>nn type function, so it belongs in an nn module.</div><div><br /></div><div>The routine is very straightforward. Given a bit index b, starting from 0 for the least</div><div>significant bit of a bignum, it will simply use a logical OR to set bit b of the bignum.</div><div><br /></div><div>Rather than construct a bignum 2^b and OR that with our number, we simply determine</div><div>which word of the bignum needs altering and create an OR-mask for that word.</div><div><br /></div><div>Computing which word to adjust is trivial, but depends on the number of bits in a word.</div><div>On a 64 bit machine we shift b to the right by 6 bits (as 2^6 = 64), and on a 32 bit</div><div>machine we shift b to the right by 5 bits (2^5 = 32). This has the effect of dividing</div><div>b by 64 or 32 respectively (discarding the remainder). This gives us the index of the</div><div>word we need to adjust. </div><div><br /></div><div>Now we need to determine which bit of the word needs setting. This is given by the </div><div>remainder after dividing b by 64 or 32 respectively, and this can be determined by</div><div>logical AND'ing b with 2^6-1 or 2^5-1 respectively. This yields a value c between 0 and</div><div>63 (or 31) inclusive, which is a bit index. To turn that into our OR-mask, we simply </div><div>compute 2^c (by shifting 1 to the left by c bits).</div><div><br /></div><div>Now that we have our OR-mask and the index of the word to OR it with, we can update the</div><div>required bit. We call this function nn_bit_set.</div><div><br /></div><div>While we are at it we create two other functions, nn_bit_clear and nn_bit_test.</div><div><br /></div><div>It's now straightforward to write test functions which randomly set, clear and test</div><div>bits in a random bignum.</div><div><br /></div><div>Next we create a random bignum generator which sets random bits of a bignum. In order</div><div>to do this, we simply choose a random number of bits to set, from 0 to the number of words</div><div>in the bignum, then we set that many bits at random. </div><div><br /></div><div>We call this function nn_test_random1.</div><div><br /></div><div>We now add a second random bignum generator. It works by creating two bignums using the </div><div>function nn_test_random1 and subtracting one from the other. This results in test randoms </div><div>with long strings of 1's and 0's in its representation. </div><div><br /></div><div>We call this function nn_test_random2.</div><div><br /></div><div>Finally, we create a function nn_test_random which randomly switches between these two </div><div>algorithms and our original nn_random algorithm to generate random bignums. We switch all</div><div>our test code to use nn_test_random by changing the function randoms_of_len to use it.</div><div><br /></div><div>At this point we can have greater confidence that our functions are all working as they</div><div>are supposed to be, as our test code has been suitably fortified at last! (Well, they are</div><div>working now, after I spent a day hunting down the bugs that these new randoms found - no,</div><div>I am not kidding. That's how good at finding bugs this trick is!)</div><div><br /></div><div>Today's code is found here: <a href="https://github.com/wbhart/bsdnt/tree/v0.24">v0.24</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html">v0.23 - sha1 and PRNG tests</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2011/10/bsdnt-interlude.html">BSDNT - interlude</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com1tag:blogger.com,1999:blog-7651115430416156636.post-44009540586925636562010-11-12T07:30:00.000-08:002010-11-20T15:07:09.220-08:00BSDNT - v0.23 sha1 and PRNG tests<div>In a recent update we added three PRNGs (pseudo random number </div><div>generators). What we are going to do today is add test code for </div><div>these.</div><div><br /></div><div>But how do you test a pseudo random generator? It's producing </div><div>basically random values after all. So what does it matter if the </div><div>output is screwed up!?</div><div><br /></div><div>Well, it does matter, as shown by the problems on 32 bit machines </div><div>which I wrote about in the PRNG blog post. It would also matter if </div><div>the PRNGs were broken on some platform such that they always output </div><div>0 every time!</div><div><br /></div><div>There's a few ways of testing PRNGs. One is simply to run them for a </div><div>given number of iterations, write down the last value it produces and </div><div>check that it always does this.</div><div><br /></div><div>The method we are going to use is slightly more sophisticated. We are </div><div>going to hash a long series of outputs from the PRNGs, using a hash </div><div>function, and check that the hash of the output is always the same. </div><div><br /></div><div>Basically, our test code will take a long string of words from the </div><div>PRNGs, write them to an array of bytes, then compute the sha1 hash of</div><div>that array of bytes. It'll then compare the computed hash with a hash</div><div>we've computed previously to ensure it has the same value as always. </div><div><br /></div><div>Moreover, we'll set it up so that the hash is the same regardless of </div><div>whether the machine is big or little endian. </div><div><br /></div><div>The hash function we are going to use is called sha1. Specifically, </div><div>we'll be using an implementation of the same written by Brian Gladman </div><div>(he also supplied the new test code for the PRNGs for today's update).</div><div><br /></div><div>The first step is to identify whether the machine is big or little </div><div>endian. This refers to the order of bytes within a word in physical </div><div>memory. On little endian machines (such as x86 machines) the least </div><div>significant byte of a word comes first. On big endian machines the </div><div>order is the other way around. Thus the number 0x0A0B0C0D would have </div><div>the byte 0x0D stored first on a little endian machine, but 0x0A stored </div><div>first on a big endian machine.</div><div><br /></div><div>Endianness can be identified by architecture, or it can be identified</div><div>with a short program. We choose to use the latter method as it should </div><div>be hard to fool. At configure time a short C program will run that will </div><div>place bytes into a four byte array, then read that array as a single</div><div>32 bit number. We then compare the value to a 32 bit value that would</div><div>be stored in the given way on a little endian machine. If it compares</div><div>equal, then the machine must be little endian. If not we compare with</div><div>a number that would be stored in the given way on a big endian machine.</div><div><br /></div><div>If the machine doesn't match either order, then it must be a very rare</div><div>machine with mixed endianness, which we don't support in bsdnt.</div><div><br /></div><div>The configure script will write some defines out to config.h which </div><div>then allow bsdnt modules to identify whether the machine is little or </div><div>big endian at compile time, i.e. at zero runtime cost.</div><div><br /></div><div>Now to discuss the sha1 hashing scheme. </div><div><br /></div><div>A hashing scheme simply takes a piece of data and computes from it a</div><div>series of bits which serve to "identify" that piece of data. If </div><div>someone else has access to the same hashing algorithm and a piece of</div><div>data which purports to be an exact copy of the original, then they </div><div>can verify its identity by computing its hash and comparing.</div><div><br /></div><div>Of course a hash is only valuable in this sense if it is much shorter</div><div>than the piece of data itself (otherwise you'd just compare the </div><div>actual data itself). </div><div><br /></div><div>A very simple hashing scheme might simply add all the words in the </div><div>input to compute a hash consisting of a single word. </div><div><br /></div><div>A secure hashing scheme has at least two other properties. </div><div><br /></div><div>i) It shouldn't be possible to determine the original data from its </div><div>hash. (The data may be secret and one may wish to provide for the</div><div>independent verification of its authenticity by having the recipient</div><div>compare the hash of the secret data with a publicly published value.</div><div>Or, as is sometimes the case, the hash of secret data, such as a</div><div>password, might be transmitted publicly, to compare it with a </div><div>pre-recorded hash of the data.)</div><div><br /></div><div>ii) It must be computationally infeasible to substitute fake data</div><div>for the original such that the computed hash of the fake data is the </div><div>same as that of the original data.</div><div><br /></div><div>Of course, if the hash is short compared to the data being hashed, </div><div>then by definition many other pieces of data will have the same hash.</div><div>The only requirement is that it should be computationally infeasible</div><div>to find or construct such a piece of data.</div><div><br /></div><div>The first step in the SHA1 algorithm is some bookkeeping. The </div><div>algorithm, as originally described, works with an input message which</div><div>is a multiple of 16 words in length. Moreover, the last 64 bits are </div><div>reserved for a value which gives the length of the original message in </div><div>bits.</div><div><br /></div><div>In order to facilitate this, the original message is padded to a </div><div>multiple of 16 words in length, with at least enough padding to allow</div><div>the final 64 bits to be part of the padding, and to not overlap part </div><div>of the message.</div><div><br /></div><div>The padding is done by first appending a single binary 1 bit, then</div><div>binary zeroes are appended until the message is the right length.</div><div>Then of course the length in bits of the original message is placed</div><div>in the final 64 bits of the message.</div><div><br /></div><div>The hashing algorithm itself performs a whole load of prescribed</div><div>twists and massages of the padded message.</div><div><br /></div><div>For this purpose some functions and constants are used. </div><div><br /></div><div>Given 32 bit words B, C and D there are four functions:</div><div><br /></div><div>1) f0(B, C, D) = (B AND C) OR ((NOT B) AND D)</div><div><br /></div><div>2) f1(B, C, D) = B XOR C XOR D</div><div><br /></div><div>3) f2(B, C, D) = (B AND C) OR (B AND D) OR (C AND D)</div><div><br /></div><div>4) f3(B, C, D) = B XOR C XOR D </div><div><br /></div><div>and four corresponding 32 bit constants:</div><div><br /></div><div>1) C0 = 0x5A827999</div><div><br /></div><div>2) C1 = 0x6ED9EBA1</div><div><br /></div><div>3) C2 = 0x8F1BBCDC</div><div><br /></div><div>4) C3 = 0xCA62C1D6</div><div><br /></div><div>To begin the algorithm, we break the padded message up into 16 word </div><div>blocks M1, M2, M3, i.e. each Mi is 16 words of the padded message. </div><div><br /></div><div>Each block is processed via a set of steps, and an "accumulated" hash </div><div>of 160 bits, consisting of five 32 bit words (the final hash we are </div><div>after) is computed: H0, H1, H2, H3, H4.</div><div><br /></div><div>The algorithm starts by initialising the five "hash words" to the </div><div>following values:</div><div><br /></div><div>H0 = 0x67452301, H1 = 0xEFCDAB89, H2 = 0x98BADCFE, H3 = 0x10325476 </div><div>and H4 = 0xC3D2E1F0.</div><div><br /></div><div>Each block of 16 words, Mi, of the padded message is then used in </div><div>turn to successively transform these five words, according to the</div><div>following steps:</div><div><br /></div><div>a) Break Mi into 16 words W0, W1, ..., W15.</div><div><br /></div><div>b) For j = 16 to 79, let Wj be the word given by</div><div><br /></div><div>Wj = S^(-1)(W{j-3}) XOR W{j-8} XOR W{j-14} XOR W{j-16}),</div><div><br /></div><div>where S^n(X) means to rotate the word X to the left through n bits </div><div>(a negative n means right rotation).</div><div><br /></div><div>c) Make a copy of the hashing words:</div><div><br /></div><div>A = H0, B = H1, C = H2, D = H3, E = H4</div><div><br /></div><div>d) For j = 0 to 79 repeat the following set of transformations in </div><div>the order given:</div><div><br /></div><div>TEMP = S^5(A) + f{j/20}(B, C, D) + E + Wj + C{j/20},</div><div><br /></div><div>E = D, D = C, C = S^30(B), B = A, A = TEMP,</div><div><br /></div><div>where j/20 signifies "floor division" by 20, and where f and C are </div><div>the above-defined functions and constants.</div><div><br /></div><div>e) Update the hashing words according to the following:</div><div><br /></div><div>H0 = H0 + A, H1 = H1 + B, H2 = H2 + C, H3 = H3 + D, H4 = H4 + E.</div><div><br /></div><div>Note that steps a-e are repeated for each block of 16 words, Mi in </div><div>the padded message, further manipulating the five words with each run. </div><div>The resulting five words H0, H1, H2, H3, H4 after all the words of the</div><div>padded message have been consumed, constitutes the sha1 hash of the </div><div>original message.</div><div><br /></div><div>The function to compute the sha1 hash of a message is given in the</div><div>files sha1.c and sha1.h in the top level directory of the source</div><div>tree. </div><div><br /></div><div>A new test file t-rand.c is added in the test directory. It contains</div><div>the sha1 hash of a large number of words as output by our three</div><div>PRNGs. If a user of bsdnt has the same hash for the PRNGs when run</div><div>on their machine, then they can have a very high level of confidence</div><div>that they are working as expected.</div><div><br /></div><div>Note that the sha1 algorithm is known as a secure hashing algorithm,</div><div>which means that in theory it can be used to hash very important</div><div>data so that the recipient can independently confirm the data hasn't</div><div>been tampered with (by computing the hash of the value and making</div><div>sure it matches some published value). </div><div><br /></div><div>We don't explain how sha1 actually works. The mysterious constants</div><div>are not so mysterious. C0 is the square root of 2 in hexadecimal, C1 is</div><div>the square root of 3, C2 the square root of 5, C3 the square root of 10.</div><div><br /></div><div>I don't know the meaning of the functions f0-f3. </div><div><br /></div><div>What is worth noting is that in recent years, people have figured out </div><div>how to produce sha1 hash collisions (two messages with the same hash). </div><div>I don't pretend to be an expert in such things.</div><div><br /></div><div>All we care about here is that a broken PRNG really can't pretend to</div><div>be working, and for that, sha1 works a treat.</div><div><br /></div><div>Disclaimer: use the information in this post at your OWN RISK!! We </div><div>make no representations as to its correctness. The same goes for </div><div>bsdnt itself. Read the license agreement for details.</div><div><br /></div><div>Warning: cryptography is restricted by law in many countries including</div><div>many of those where the citizens believe it couldn't possibly be so. </div><div>Please check your local laws before making assumptions about what you </div><div>may do with crypto.</div><div><br /></div><div>The code for today's article is here: <a href="https://github.com/wbhart/bsdnt/tree/v0.23">v0.23</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html">v0.22 - Windows support</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html">v0.24 - nn_bitset/clear/test and nn_test_random</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-61909347035112691602010-11-11T05:58:00.000-08:002010-11-12T07:38:14.002-08:00BSDNT - v0.22 Windows support<div>Today's update is a massive one, and comes courtesy of Brian Gladman. At last we add </div><div>support for MSVC 2010 on Windows. </div><div><br /></div><div>In order to support different architectures we add architecture specific files in the arch </div><div>directory. There are three different ways that architectures might be supported:</div><div><br /></div><div>* Inline assembly code</div><div><br /></div><div>* Standalone assembly code (using an external assembler, e.g. nasm)</div><div><br /></div><div>* Architecture specific C code</div><div><br /></div><div>Windows requires both of the last two of these. On Windows 64 bit, MSVC does not support </div><div>inline assembly code, and thus it is necessary to supply standalone assembly code for this</div><div>architecture. This new assembly code now lives in the arch/asm directory.</div><div><br /></div><div>On both Windows 32 and 64 bit there is also a need to override some of the C code from the base</div><div>bsdnt library with Windows specific code. This lives in the arch directory.</div><div><br /></div><div>Finally, the inline assembly used by the base bsdnt library on most *nix platforms is now in the</div><div>arch/inline directory.</div><div><br /></div><div>In each case, the os/abi combination is specified in the filenames of the relevant files. For</div><div>example on Windows 32, the files overriding code in nn_linear.c/h are in arch/nn_linear_win32.c/h.</div><div>(Note win32 and x64 are standard Windows names for 32 and 64 bit x86 architectures, respectively.)</div><div><br /></div><div>If the code contains architecture specific code (e.g. assembly code) then the name of the file</div><div>contains an architecture specifier too, e.g. arch/inline/nn_linear_x86_64_k8.h for code specific</div><div>to the AMD k8 and above.</div><div><br /></div><div>It's incomprehensible that Microsoft doesn't support inline assembly in their 64 bit compiler</div><div>making standalone assembly code necessary. It would be possible to use the Intel C compiler on </div><div>Windows 64, as this does support inline assembly. But this is very expensive for our volunteer </div><div>developers, so we are not currently supporting this. Thus, on Windows 64, the standalone </div><div>assembly is provided in the arch/asm directory as just mentioned.</div><div><br /></div><div>Brian has also provided MSVC build solution files for Windows. These are in the top level source</div><div>directory as one might expect.</div><div><br /></div><div>There are lots of differences on Windows that requires functions in our standard nn_linear.c, </div><div>nn_quadratic.c and helper.c files to be overridden on Windows.</div><div><br /></div><div>The first difference is that on 64 bit Windows, the 64 bit type is a long long, not a long. This</div><div>is handled by #including a types_arch.h file in helper.h. On most platforms this file is empty.</div><div>However, on Windows it links to an architecture specific types.h file which contains the</div><div>requisite type definitions. So a word_t is a long long on Windows. </div><div><br /></div><div>Also, when dealing with integer constants, we'd use constants like 123456789L when the word type</div><div>is a long, but it has to become 123456789LL when it is a long long, as on Windows 64. To get </div><div>around this, an architecture specific version of the macro WORD(x) can be defined. Thus, when</div><div>using a constant in the code, one merely writes WORD(123456789) and the macro adds the correct</div><div>ending to the number depending on what a word_t actually is. </div><div><br /></div><div>Some other things are different on Windows too. The intrinsic for counting leading zeroes is </div><div>different to that used by gcc on other platforms. The same goes for the function for counting</div><div>trailing zeroes. We've made these into macros and given them the names high_zero_bits and</div><div>low_zero_bits respectively. The default definitions are overridden on Windows in the architecture</div><div>specific versions of helper.h in the arch directory.</div><div><br /></div><div>Finally, on Windows 64, there is no suitable native type for a dword_t. The maximum sized</div><div>native type is 64 bits. Much of the nn_linear, and some of the nn_quadratic C code needs to </div><div>be overridden to get around this on Windows. We'll only be using dword_t in basecase algorithms</div><div>in bsdnt, so this won't propagate throughout the entire library. But it is necessary to </div><div>override functions which use dword_t on Windows.</div><div><br /></div><div>Actually, if C++ is used, one can define a class called dword_t and much of the code can</div><div>remain unchanged. Brian has a C++ branch of bsdnt which does this. But for now we have C code</div><div>only on Windows (otherwise handling of name mangling in interfacing C++ and assembly code </div><div>becomes complex to deal with). </div><div><br /></div><div>Brian has worked around this problem by defining special mul_64_by_64 and div_128_by_64 </div><div>functions on 64 bit Windows. These are again defined in the architecture specific version of</div><div>helper.h for Windows 64.</div><div><br /></div><div>Obviously some of the precomputed inverse macros need to be overridden to accomodate these</div><div>changes, and so these too have architecture specific versions in the Windows 64 specific version </div><div>of the helper.h file.</div><div><br /></div><div>In today's release we also have a brand new configure file for *nix. This is modified to handle</div><div>all the changes we've made to make Windows support easy. But Antony Vennard has also done </div><div>some really extensive work on this in preparation for handling standalone assembly on arches </div><div>that won't handle our inline assembly (and for people who prefer to write standalone assembly </div><div>instead of inline assembly).</div><div><br /></div><div>The new configure file has an interactive mode which searches for available C compilers (e.g.</div><div>gcc, clang, icc, nvcc) and assemblers (nasm, yasm) and allows the user to specify which to use.</div><div>This interactive feature is off by default and is only a skeleton at present (it doesn't actually</div><div>do anything). It will be the subject of a blog post later on when the configure file is finished.</div><div><br /></div><div>The code for today is found at: <a href="https://github.com/wbhart/bsdnt/tree/v0.22">v0.22</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html">v0.21 - PRNGs</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html">v0.23 - sha1 and prng tests</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-32838561991805783882010-10-31T16:02:00.000-07:002010-11-11T06:02:34.141-08:00BSDNT - v0.21 PRNGsIn this update we are going to replace the el cheapo random generator in<br />bsdnt with something of higher quality.<br /><br />Some time ago, Brian Gladman brought to my attention the fact that on 32<br />bit machines, the test code for bsdnt actually caused Windows to hang!<br /><br />The issue required some sleuthing work on Brian's part to track down.<br />He eventually discovered the cause of the problem, and it was, oddly<br />enough, the pseudo-random number generator (PRNG) I had used.<br /><br />Brian suspected the PRNG immediately because of his past experience as a<br />professional cryptographer. In fact, it turns out that PRNG's of the<br />kind that I had used, aren't particularly good even if they don't have<br />bugs!<br /><br />The particular kind of PRNG I had used is called a linear congruential<br />PRNG. They start with the PRNG initialised to some random seed value,<br />n = seed. Then each time they are called, they apply the transformation<br />n = (n*c1 + c2) % p for some explicit constants c1, c2 and a large<br />enough "prime" p.<br /><br />One can take c2 = 0 in the transformation and it is also common to see<br />p = 2^b for some b (e.g. b = WORD_BITS, and yes, I know p = 2^b is not<br />usually prime). When p = 2^b it is usually the case that the top half of<br />the bits output have reasonable random properties, but the bottom half<br />do not. In this case it is acceptable to stitch two LC PRNG's together to<br />give the full number of bits.<br /><br />When p is an actual prime, these PRNG's are called prime modulus linear<br />congruential PRNG's and they aren't too bad when implemented properly.<br />They still fail to meet some standards of random quality.<br /><br />To return a whole word of random bits, one either needs to use a prime<br />p that is larger than a word, which is usually impractical, or again<br />stitch the output of two prime modulus LC PRNG's together.<br /><br />However, one needs to be careful, as the period of the generator is p-1,<br />so if one is on a 32 bit machine, it doesn't do to use a prime p just<br />over half the size of a word (the first mistake I made), otherwise the<br />period is just over 65536. That isn't too good for generating random<br />values for test code!<br /><br />But how was my LC PRNG causing Windows to crash!? The reason was that<br />in some of the test functions we required bignums that didn't overflow<br />when summed together. This of course depends almost entirely on the top<br />few bits of the bignums being added together.<br /><br />The problem was that in the expression n = (n*c1 + c2) % p, I was using<br />values of c1 and c2 which were not reduced mod p. It turns out that this<br />is crucial to correct operation. It might seem that the result ends up<br />being reduced mod p anyway, and indeed it would be if n*c1 fit in a word.<br />However, because it doesn't you actually get ((n*c1 + c2) % 2^32) % p<br />which causes the binary representation of the value generated to be quite<br />regular.<br /><br />Anyhow, on Windows (and probably on other 32 bit machines) the test code<br />generates length 90 bignums over and over at some point, looking in vain<br />to find pairs of such bignums which when added do not overflow. As these<br />are garbage collected at the end of the test function, memory starts<br />filling up with the orphaned bignums that are discarded by the test code<br />as it looks for appropriate values. This eventually overwhelms the heap<br />allocator on Windows and causes the entire OS to crash!<br /><br />The problem of writing decent PRNG's has been studied extensively, and one<br />expert in the subject is George Marsaglia. He famously turned up on a<br />usenet forum in January of 1999 and dumped not one, but piles of fast, high<br />quality PRNG's which do not suffer from the problems that other PRNG's do.<br /><br />Amazingly, many of the PRNG's in common usage today are either written by<br />George, or based on ones he wrote. So he's some kind of legend!<br /><br />Anyhow, we will make use of two of his PRNG's, Keep It Simple Stupid (KISS)<br />and Super KISS (SKISS) and a third PRNG called Mersenne Twister, due to<br />Makoto Matsumoto and Takuji Nishimura in 1997.<br /><br />George's generators are in turn based on some simpler PRNG's. He begins by<br />defining a linear congruential generator, with c1 = 69069 and c2 = 1234567.<br />This is taken p = mod 2^32 (yes, it's not prime). This has good properties<br />on its top 16 bits, but not on its bottom 16 bits, and for this reason had<br />been widely used before George came along. This generator has period 2^32.<br /><br />Next he defines a pair of multiply with carry (MWC) generators. These are<br />of the form n = c1*lo(n) + hi(n) where lo(n) is the low 16 bits of n, hi(n)<br />is the high 16 bits and c1 is an appropriately chosen constant.<br /><br />He stitches together a pair of these MWC PRNG's mod 2^16 to give 32 random<br />bits. For simplicity we'll refer to this combined random generator as MWC.<br />This has a period of about 2^60.<br /><br />Thirdly he defines a (3-)shift-register generator (SHR3). This views the<br />value n as a binary vector of 32 bits and applies linear transformations<br />generated from 32 x 32 matrices L and R = L^T according to<br />n = n(I + L^17)(I + R^13)(I + L^5) where I is the 32 x 32 identity matrix.<br /><br />In order to speed things up, special transformations are chosen that can<br />be efficiently implemented in terms of XOR and shifts. This is called an<br />Xorshift PRNG. We'll just refer to it as SHR3.<br /><br />Now given appropriate seed values for each of these PRNG's Marsaglia's<br />KISS PRNG is defined as MWC ^ CONG + SHR3. This generator passes a whole<br />slew of tests and has a period of 2^123. In this update we make it the<br />default random generator for bsdnt.<br /><br />Super KISS is a random generator defined by George later in 2009. It gives<br />immense periods by adding together the output of three PRNG's, one with a<br />massive order. It is basically defined by SKISS = SUPR + CONG + SHR3.<br /><br />Here, the new generator SUPR is based on a prime p = a*b^r + 1 such that<br />the order of b mod p has magnitude quite near to p - 1.<br /><br />It starts with a seeded vector z of length r, all of whose entries are<br />less than b and an additional value c which is less than a.<br /><br />One then updates the pair (z, c) by shifting the vector z to the left by<br />one place and setting the right-most entry to (b - 1) - ((a*z1 + c) mod b)<br />where z1 is the entry shifted out at the left of z. Then c is set to t/b.<br /><br />Naturally in practice one uses b = 2^32 so that all the intermediate<br />reductions mod b are trivial.<br /><br />As with most generators which have massive periods the "state" held by this<br />generator is large. It requires data mod p for a multiprecision p.<br /><br />Note the similarity with the MWC generator except for the "complement" mod<br />b that occurs. This is called a CMWC (Complemented-Multiply-With-Carry)<br />generator.<br /><br />George proposed using the prime p = 640*b^41265+1, where the order of b is<br />5*2^1320481. The period of the CMWC generator is then greater than<br />2^1300000.<br /><br />Of course, at each iteration of the algorithm, 41265 random words are<br />generated in the vector. Once these are exhausted, the next iteration of<br />the algorithm is made.<br /><br />The algorithm SUPR in the definition of SKISS is thus just a simple<br />array lookup to return one of the words of the vector z. Each time SKISS<br />is run, the index into the array is increased until all words of the array<br />are exhausted, at which point the CMWC algorithm is iterated to refill the<br />array.<br /><br />We now come to describing the Mersenne twister.<br /><br />It is based on the concept of a feedback shift register (FSR). An FSR shifts<br />its value left by 1 bit, feeding at the right some linear combination of the<br />bits in its original value. The Mersenne twister is conceptually a<br />generalisation of this.<br /><br />The difference with the Mersenne twister is that the "feedback" is effected<br />by a certain "twist". This is effected by applying a "linear transformation"<br />A of a certain specific form, with multiplication by A having addition<br />replaced by xor in the matrix multiplication. The twist can be described<br />more straightforwardly, and we give the more straightforward description<br />below.<br /><br />One sets up a Mersenne twister by picking a recurrence degree n, a "middle<br />word" 1 <= m <= n and a number of bits for a bitmask, 0 <= r <= 32. One <div>picks these values so that p = 2^(n*w - r) - 1 is a Mersenne prime (hence </div><div>the name of this PRNG). Given a vector of bits a = [a0, a1, ..., a{w-1}] of length </div><div>w and a sequence x of words of w bits, the Mersenne twister is defined by a </div><div>recurrence relation x[k+n] = x[k+m] ^ ((upper(x[k]) | lower(x[k+1])) A) </div><div>where upper and lower return the upper w - r and lower r bits of their </div><div>operands, and where A is the "twist" spoken of and defined below, in terms of </div><div>a. Of course ^ here is the xor operator, not exponentiation. For a vector X of w </div><div>bits, XA is given by X>>1 if X[0] == 0 otherwise it is given by (X>>1) ^ a.<br /><br />Some theory is required to find an A such that the Mersenne twister will have<br />maximum theoretical period 2^(n*w - r) - 1.<br /><br />To finish off, the Mersenne twister is usually "tempered". This tempering<br />simply mangles the bits in a well understood way to iron out some of the<br />known wrinkles in the MT algorithm.<br /><br />Only a couple of sets of parameters are in common use for Mersenne twisters.<br />These are referred to as MT19937 for 32 bit words and MT19937-64 for 64 bit<br />words.<br /><br />As with all PRNG's, there is a whole industry around "cracking" these things.<br />This involves starting with a short sequence of values from a PRNG and<br />attempting to find the starting constants and seed values.<br /><br />Obviously, in crytographic applications, there is not much point generating<br />"secure" keys with a PRNG with a single source of entropy. Even if your key<br />is generated by multiplying primes of many words in length, if those words<br />were generated from a PRNG seeded from the current time, it may only take<br />a few iterations and a good guess as to which PRNG you used, to determine<br />the constants used in the PRNG and thus your entire key. And that's<br />irrespective of which constants you chose in your PRNG!<br /><br />So if you are doing crypto, you need to take additional precautions to<br />generate secure keys. Just seeding a PRNG from the time probably won't cut<br />it!<br /><br />Some PRNG's are more "secure" than others, meaning that knowing a<br />few output values in a row doesn't give terribly much information about<br />which values may follow. But if you rely on a PRNG to be secure, you<br />are essentially betting that because you don't know how to get the<br />next few values and nor does anyone else that has written about the<br />subject, then no one at all knows. Of course one needs to ask oneself<br />if they would tell you if they did.<br /><br />Another assumption one should never make is that no one has the computing<br />power to brute force your PRNG.<br /><br />Some PRNG's are designed for cryptographic applications, and maybe one can<br />believe that these are "safe" to use, for some definition of safe.<br /><br />Anyhow, we only care about random testing at this point. In today's update<br />32 and 64 bit KISS, SKISS and MT PRNG's are added in the directory rand.<br />Our randword, randinit, and randclear functions are all replaced with<br />appropriate calls to KISS functions.<br /><br />There is also an option to change the default PRNG used by bsdnt. Is it my<br />imagination or does the test code now run faster, even on a 64 bit machine!<br /><br />At some point we will add some tests of the new PRNG's. These will compare<br />the outputs with known or published values to check that they are working as<br />designed for a large number of iterations.<br /><br />Brian Gladman contributed to this article and also did most of the work<br />in implementing Marsaglia's PRNG's in bsdnt. The Mersenne twisters were<br />originally written by Takuji Nishimura and Makoto Matsumoto and made available<br />under a BSD license. Brian did most of the work in adapting these for bsdnt.<br /><div><br /></div><div>The code for today's update is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.21">v0.21</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html">v0.20 - redzones</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html">v0.22 - Windows support</a></div></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-69217631593983886482010-10-30T12:12:00.000-07:002010-10-31T16:11:18.134-07:00BSDNT - v0.20 redzones<div>In this update we implement another improvement to the test code in bsdnt. I don't know</div><div>what the correct name is, but I call them redzones.</div><div><br /></div><div>The basic idea is this: suppose you have a function nn_blah say, and it writes to an nn_t</div><div>b say. If it writes well beyond the allocated space for b, then almost certainly a </div><div>segfault will occur. But what if it only writes a word or two before the beginning or</div><div>after the end of the allocated space? Very likely this will cause a segfault only on</div><div>some systems, depending on the granularity of the heap allocator and depending</div><div>on what other bsdnt data might be in the overwritten space!</div><div><br /></div><div>So what if we could detect this kind of error? Well, that is what redzones hope to do. </div><div>Essentially if an nn_t b is allocated with m words of space, when redzones are turned on</div><div>it allocates m + 2C words of space for some small constant C. It then fills the first</div><div>and last C words of b with known words of data (usually some recognisable pattern of bits).</div><div><br /></div><div>When the garbage collector cleans up, it examines the redzones to ensure that they have</div><div>not been altered. If they have, they raise an error.</div><div><br /></div><div>The nn_t b is set to point just after the first C words, which contain the redzone, and in </div><div>every other respect act like a normal nn_t. The user needn't know that an extra C words</div><div>of data were allocated immediately before and after the length m nn_t they requested.</div><div>Nor do they need to be aware of the checking that goes on when the nn_t is finally cleaned</div><div>up, that the redzones haven't been touched.</div><div><br /></div><div>Of course it's nice to be able to turn redzones off sometimes, when testing the library. </div><div>Therefore I've added a configure option -noredzones which turns off redzones if they are </div><div>not required. This works by setting a #define WANT_REDZONES 0 in config.h. The </div><div>memory allocator for nn_t's and the garbage collector both operate differently if redzones </div><div>are turned on.</div><div><br /></div><div>At present, the only way to allocate memory for nn_t's in test code is to use </div><div>randoms_of_len, so it is convenient to rewrite this to call a function alloc_redzoned_nn </div><div>instead of malloc, and for the garbage collector to call free_redzoned_nn. These new </div><div>functions are defined in test.c. </div><div><br /></div><div>The only difference when WANT_REDZONES is set in config.h is that REDZONE_WORDS, which is defined in test.h is changed from 0 to 4 words (meaning 4 redzone words are to be</div><div>allocated at each end of a redzoned nn_t). Having redzones of length 0 is the same as not </div><div>having them at all. So this makes the functions easy to write.</div><div><br /></div><div>Also in test.h REDZONE_BYTE is defined to the hexaecimal byte 0xA which has binary bit</div><div>pattern 1010, i.e. alternating one's and zeroes. This is the value that is placed into the </div><div>redzones byte-by-byte before the nn_t is used. At the end, when they are cleaned up, the</div><div>garbage collector examines the redzones to ensure they are still filled with these bytes.</div><div><br /></div><div>Fortunately checking redzones does not dramatically slow down our test code, and no new</div><div>test failures result. This means it is highly likely that our nn_t functions do not overwrite</div><div>their bounds. To check that the new redzones code works, it is a simple matter of mocking</div><div>up a broken function which overwrites its bounds. The new code complains loudly as it </div><div>should, unless redzones are switched off at configure time.</div><div><br /></div><div>The code for today's update is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.20">v0.20</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html">v0.19 - asserts</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html">v0.21 - prngs</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-62129960439407005082010-10-25T19:38:00.000-07:002010-10-30T12:25:36.275-07:00BSDNT - v0.19 asserts<div>About a week ago I got enthused to work on another coding project I've been </div><div>wanting to experiment with for a long time. I discovered that it was highly</div><div>addictive and I just couldn't put it down. It's also given me some interesting</div><div>ideas for a higher level interface to bsdnt. But more on that later when we </div><div>start working on it.</div><div><br /></div><div>Unfortunately in that week of time there have been no bsdnt updates.</div><div><br /></div><div>Moreover, on the weekend my main computer died (physical hard drive failure).</div><div>I pulled out my backup machine and Windows wanted to install 3 months of </div><div>"important updates". Naturally this caused the machine to crash, I was unable</div><div>to recover to the restore point it set, the startup repair didn't work and</div><div>the only solution was a format and reload. </div><div><br /></div><div>*Fifteen and a half hours* later I had reinstalled Windows and it had finally </div><div>finished putting the approximately 165 "important updates" back on!!</div><div><br /></div><div>Unfortunately all the bsdnt blog articles I had meticulously prepared in </div><div>advance were lost. Thus I am regenerating what I can from the diff between</div><div>revisions of bsdnt. Sorry if they end up being shorter than past updates.</div><div><br /></div><div>Fortunately I did not lose any of the code I wrote, as that was backed up in</div><div>a git repo on an external server!</div><div><br /></div><div>Anyhow, in this update we make a very simple change to bsdnt, again in an</div><div>attempt to improve the test quality of the library. We add asserts to the </div><div>code. </div><div><br /></div><div>An assert is a check that is made at runtime in live code to test if some</div><div>predefined condition holds. If the assert fails, an error message is printed</div><div>specifying the line of code where the assert is located and what the</div><div>condition was that failed.</div><div><br /></div><div>Now, I am not personally a great believer in asserts. As they are runtime</div><div>checks, they require computing cycles, which is just a no-no for a bignum</div><div>library. The other option is to turn them off when not testing code. However,</div><div>this simply leads to the asserts rarely being run when they are needed.</div><div><br /></div><div>The other problem with asserts is that they pollute the code, making the</div><div>source files longer and appear more complex.</div><div><br /></div><div>However, there is one situation where I believe they can be very helpful, </div><div>and that is in checking the interface of functions within a library and that</div><div>it is being respected both in intra-library calls and by the test code for</div><div>the library.</div><div> </div><div>Specifically, assert are useful for checking that valid inputs have been </div><div>passed to the functions, e.g. you might have a restriction that a Hensel</div><div>modulus be odd. Adding an assert allows you to test that all the moduli</div><div>you pass to the function in your test runs are in fact odd.</div><div><br /></div><div>The main advantage in putting asserts into the code is that it forces you</div><div>to think through what all the conditions should be that you assert. In </div><div>adding asserts to the code in bsdnt I discovered one function in which the</div><div>test code was pushing the code to do things I didn't write it to cover.</div><div>This forced me to either rewrite the test, or drop that as a condition (I</div><div>think I chose the former for consistency with other related functions in</div><div>bsdnt).</div><div><br /></div><div>Of course we do not want to consume cycles when the library is run by the</div><div>end user, and so we make asserts optional. This is done using a configure</div><div>switch. By default the macro WANT_ASSERT is set to 0 in a file config.h by</div><div>configure. However, if the user passes the option -assert to configure, it</div><div>sets the value of this define to 1. </div><div><br /></div><div>A macro ASSERT is then defined in helper.h which is either an empty macro</div><div>in the default case or is an alias for the C assert function if WANT_ASSERT</div><div>is set to 1 in config.h.</div><div><br /></div><div>Of course we have to remember to turn asserts on to run the test code, and</div><div>this really highlights their main weakness. As I mentioned, the asserts I </div><div>added did clarify the interface, but I don't believe they showed up any </div><div>bugs in bsdnt. With this expectation, asserts can be a useful tool.</div><div><br /></div><div>The code for today's update is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.19">v0.19</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html">v0.18 - printx_word, nn_printx</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html">v0.20 - redzones</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-22170803259651997852010-10-16T14:51:00.000-07:002010-10-25T19:52:41.907-07:00BSDNT - v0.18 printx_word, nn_printx<div>It is time we improved our test code again. We'll spend a few days updating</div><div>things to make improvements in the way we test.</div><div><br /></div><div>Today's update is quite straightforward. We currently have no way of printing</div><div>nn_t's. This is quite inconvenient when it comes to the test code, where</div><div>little to no diagnostic information is printed at all. In particular, we </div><div>aren't printing out any of the multiple precision integers for examination</div><div>when a test fails.</div><div><br /></div><div>Now, it is actually quite a difficult job to print bignum integers in decimal. </div><div>In fact, as far as I can see, one requires a function which allocates temporary</div><div>space to efficiently print integers. This is an interesting challenge:</div><div>is there an algorithm to convert from binary to decimal and print the result,</div><div>with just O(1) temporary space, with any complexity.</div><div><br /></div><div>I think it may be possible if one allows the input to be destroyed. If so, a</div><div>subsidiary question would be to do the same thing without destroying the </div><div>input. I doubt that is possible, but I do not have a proof. Of course, to be</div><div>practical, we'd require an algorithm which doesn't destroy the input.</div><div><br /></div><div>To get around this issue, we'll start with a simple nn_printx algorithm, </div><div>which will print a bignum in hexadecimal. We also add an nn_printx_short </div><div>function which prints the first couple of words of a bignum, an ellipsis and </div><div>then the final couple of words. This is useful for large bignums that would </div><div>print for screens and screens due to their size. We'll use this in our test </div><div>code to prevent printing too much output upon test failure.</div><div><br /></div><div>Another function we add is an nn_printx_diff function. It accepts two nn_t's</div><div>and prints information about the range of words where they differ and prints </div><div>the first and last differing word in each case.</div><div><br /></div><div>There is one tricky aspect to our printing functions however. A word is often</div><div>an unsigned long, but on some platforms it will be an unsigned long long. For </div><div>this reason, when printing a word, we need to use %lx as the format specifier </div><div>on some platforms and %llx on others. </div><div><br /></div><div>So we need to add a routine which will print a word and abstract away the </div><div>format specifier so the caller doesn't have to think about it. The function</div><div>we include to do this is caled printx_word. It prints a word without needing</div><div>to specify a format specifier.</div><div><br /></div><div>We add files helper.c/h to bsdnt which will contain routines like this one</div><div>which aren't specific to our nn module. A few existing functions and macros</div><div>also get moved there. The configure system will automatically look for </div><div>architecture specific versions of helper.c, allowing us to override the</div><div>definition of the functions in that file on a per architecture basis.</div><div><br /></div><div>We add the printx_word function to helper.c which can be overridden with </div><div>an architecture specific version. On a platform where %llx is required, an </div><div>architecture specific version will simply replace the generic version which </div><div>uses %lx.</div><div><br /></div><div>In test.h we add some macros, print_debug and print_debug_diff which use the</div><div>stringizing operator to print the names of the variables and then print their</div><div>values. The stringizing operator (#) is a preprocessor macro which turns a </div><div>macro parameter into a string. In our case, we pass the variable name to the</div><div>macro and turn it into a string so that we can print the variable name. </div><div><br /></div><div>A few modifications to the TEST_START and TEST_END macros in test.h also </div><div>allow us to give a unique name to each test which is then printed along with </div><div>the iteration at which the test failed. This also uses the stringizing </div><div>operator so that the caller of TEST_START can specify the unique name for the</div><div>test. It seems difficult to come up with an automatic way of generating </div><div>unique test names, so this will have to do.</div><div><br /></div><div>It would also be a useful thing to have it print the value of the random seed </div><div>at the start of a failing iteration too. After we have improved the random</div><div>generation code in bsdnt v0.21, perhaps someone would like to try adding this </div><div>feature. </div><div><br /></div><div>We roll out our new diagnostic printing routines to our test code. Of course,</div><div>to see any of this new code in action, one has to introduce a bug in one of </div><div>the tests so that the new diagnostic code is actually run. I leave it to you</div><div>to fiddle around introducing bugs to see that the new test code does actually</div><div>print useful diagnostic information.</div><div><br /></div><div>Later on we'll add a bsdnt_printf function which will be variadic and accept</div><div>a format specifier like the C printf function and which will have a</div><div>consistent %w for printing a word. This will also make things easier on </div><div>Windows, where currently the format specifier will be wrong in many places.</div><div>We'll fix this problem in a later update.</div><div><br /></div><div>The code for today's post is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.18">v0.18</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html">v0.17 - div_hensel</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html">v0.19 - asserts</a></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-83648677531726359632010-10-15T17:43:00.000-07:002010-10-16T14:55:35.476-07:00BSDNT - v0.17 div_hensel<div>Now that we have nn_mullow_classical, we can add nn_div_hensel. As explained,</div><div>this will take an integer a of length n and divide by an integer d of length </div><div>m modulo B^n, returning a quotient q and an "overflow".</div><div><br /></div><div>The overflow will be two words which agree with the overflow from mullow(q*d).</div><div><br /></div><div>As per the euclidean division, the dividend a will be destroyed. </div><div><br /></div><div>The algorithm is somewhat simpler than the euclidean algorithm. If d1 is the</div><div>least significant word of d then we use an inverse mod B of d1 (dinv say) and</div><div>multiply it by the current word of the dividend being considered (working from</div><div>right to left) to get a quotient word q1. </div><div><br /></div><div>We then subtract q1*d (appropriately shifted) from the dividend. There is no</div><div>adjustment to do as the inverse mod B is unique (so long as d is odd). </div><div><br /></div><div>Any borrows and overflows from the subtractions are accumulated in the two </div><div>overflow and returned.</div><div><br /></div><div>In our test code, we check a few things. Firstly, for an exact division, we </div><div>want that the quotient is really the exact quotient of a by d. As the </div><div>quotient returned is m words, which may be larger than the actual quotient, </div><div>we check that any additional words of q are zero. We do this by normalising q.</div><div><br /></div><div>The second test we do is for an inexact division. We check that the the </div><div>overflow words turn out to be the same as the overflow from mullow(q*d).</div><div><br /></div><div>Note that if we wish to make Hensel division efficient for doing an exact</div><div>division, say of a 2m - 1 by m division, we merely pass m words of the </div><div>dividend in instead of all 2m - 1 words, so that the quotient is also m words. </div><div><br /></div><div>Once again we don't allow an overflow-in to Hensel division. This wouldn't </div><div>give us any kind of chaining property anyway. </div><div><br /></div><div>Instead, we'd have to do a mulhigh(q, d) and subtract that from the high </div><div>part of the chain before continuing, and the mulhigh will accept our overflow</div><div>words from the low Hensel division.</div><div><br /></div><div>In fact, we add a chaining test that does precisely this. We do a Hensel</div><div>division on the low n words of a chain, subtract a mulhigh from the high</div><div>m words of the chain, then compute the high m words of the quotient using</div><div>a second Hensel division.</div><div><br /></div><div>To make our test code work, we add an ODD flag to randoms_of_len so that only</div><div>odd divisors are used with Hensel division.</div><div><br /></div><div>It isn't clear if it makes sense to allow a carry-in at the top of div_hensel</div><div>or not. On the one hand, it might seem logical to allow a carry-in on account</div><div>of the way mul_classical works. On the other hand, divrem_hensel1 took a </div><div>carry-in, but at the bottom. This was for chaining rather than a read-in of</div><div>an extra word. We choose the latter convention, as it seems to make more</div><div>sense here and stops the code from becoming horribly complex.</div><div><br /></div><div>The code for today's post is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.17">v0.17</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html">v0.16 - mullow_classical</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html">v0.18 - printx_word, nn_printx</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-32864944675470002272010-10-14T12:17:00.000-07:002010-10-15T17:47:36.608-07:00BSDNT - v0.16 mullow_classical, mulhigh_classicalThe most logical routine to implement next would be Hensel division. Its<br />main application is in doing exact division.<br /><br />For that reason, we might want to focus on Hensel division without<br />remainder first.<br /><br />This would take an m word integer and divide it by an n word integer,<br />division proceeding from right to left. Essentially it gives us division<br />modulo B^m.<br /><br />However, before we implement this, it makes sense to think about what its<br />"inverse" operation might be.<br /><br />If q is the Hensel quotient of a by d mod B^m, then the inverse operation<br />can be thought of as multiplication of q by d mod B^m.<br /><br />Hensel division gives an m word quotient q, so this would imply that its<br />inverse should be a multiplication of {d, n} by {q, m} mod B^m. We might call<br />this inverse operation mullow, as it returns the low m words of the product<br />d*q.<br /><br />However, we need to be careful with this kind of multiplication. We'd also<br />like to have a mulhigh which returns the high part of the multiplication,<br />and we'd like the sum of mullow and mulhigh to be the same as a full mul.<br /><br />However, there is a problem if mullow merely returns the product mod B^m. Any<br />carries out of mullow will have been lost. Also, all the word by word<br /><br />multiplications that contribute to the high word of the product mod B^m<br />will be thrown away.<br /><br />To rectify the problem we accumulate an "overflow" out of the mullow<br />corresponding to the sum of all these high words and carries. As this<br />overflow is essentially the sum of arbitrary words it may take up two words.<br /><br />Thus, instead of mullow yielding m words it will yield m + 2 words. We'd<br />like to pass the extra two words as an "overflow-in" to mulhigh, thus the<br />logical thing is to return these two words from mullow separately from the<br />product mod B^m itself.<br /><br />Hensel division will also return two overflow words. After all, what it<br />essentially does to the dividend is subtract a mullow of the quotient by the<br />divisor. So, the overflow from Hensel division will be defined as precisely<br />the overflow from mullow(q*d).<br /><br />We manage the overflow by accumulating it in an dword_t. However, as we don't<br />wish the user to have to deal with dword_t's (these are used in our internal<br />implementations only), we split this dword_t into two separate words at the<br />end and return them as an array of two words representing the "overflow".<br /><br />Today we shall only implement mullow and mulhigh. The first of these is a lot<br />like a full multiplication except that the addmul1's become shorter as the<br />algorithm proceeds and the carry-out's have to be accumulated in two words, as<br />explained.<br /><br />At the same time we implement mulhigh. This takes two "overflow-in" words and<br />computes the rest of the product, again in a fashion similar to a full<br />multiplication.<br /><br />Our test code simply stitches a mullow and mulhigh together to see that the<br />chain is the same as a full multiplication.<br /><br />we have to be careful in that if one does an n by n mullow, the mulhigh that<br />we wish to chain with this must start with an n-1 by 1 multiplication,<br />not an n by 1, otherwise the sum of the mullow and mulhigh would contain the<br />cross product of certain terms twice.<br /><br />We also have to be careful in the case where the full multiplication is only<br />a single word by a single word. Here the overflow out of the mullow part is<br />only a single word and there is no mulhigh to speak of. It merely passes<br />the overflow from the mullow straight through.<br /><br />Both mullow and mulhigh must accept all nonzero lengths, as per full<br />multiplication. This causes a few cases to deal with in mulhigh. This doesn't<br />seem particularly efficient or elegant, but there seems to be little we<br />can do about that.<br /><br />An interesting question is what the inverse of mulhigh is. Essentially,<br />this is our euclidean divapprox.<br /><br />There's something slightly unsatisfying here though. Recall that the divapprox<br />algorithm proceeds by subtracting values q1*d' where q1 is the current quotient<br />word and d' is what is currently left of the divisor. We throw away one word of<br />the divisor at each iteration until finally we are left with just two words.<br /><br />It would be wholly more satisfying if we didn't require this extra word of the<br />divisor throughout. We'd then be working right down to a single word in the<br />divisor so that we will really have subtracted a mulhigh by the time the<br />algorithm completes.<br /><br />Any modification I can think of making to the euclidean division to make this<br />seem more natural also makes it much less efficient.<br /><br />Perhaps some further thought will lead to a more satisfying way of thinking about<br />these things, which isn't also less efficient in practice.<div><br /></div><div>The code for today is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.16">v0.16</a><br /><br />Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html">v0.15 - divapprox_classical</a><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html">v0.17 - div_hensel</a></div></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-81832672556576875822010-10-13T15:00:00.000-07:002011-10-23T10:12:53.909-07:00BSDNT - v0.15 divapprox_classical<div>During the past few weeks, Brian Gladman has been doing </div><div>some tremendous updates, including some random number generators and</div><div>making bsdnt work on Windows (MSVC). </div><div><br /></div><div>We discuss all matters related to bsdnt on our development list:</div><div><br /></div><div><a href="http://groups.google.co.uk/group/bsdnt-devel?hl=en">http://groups.google.co.uk/group/bsdnt-devel?hl=en</a></div><div><br /></div><div>Anyhow, now for today's update. </div><div><br /></div><div>We'd now like to implement a variant of our divrem_classical algorithm. This<br />time we'd like to just return a quotient, with no remainder. The question is,<br />can this be done in less time than a full divrem?<br /><br />At first sight, the answer seems to be no. As we saw in the post on the divrem<br />algorithm, every word of both the dividend and divisor counts, and we need<br />to keep the dividend completely updated as the algorithm proceeds, otherwise<br />we will get the wrong quotient.<br /><br />So, with the exception of perhaps the final update (which is only needed to<br />determine the remainder), there doesn't seem to be much we can save.<br /><br />But what if we allowed the quotient to be approximate, say within 1 of the actual<br /><br />quotient? In fact, let's demand that if q' is our approximate quotient, that |a -<br /><br />d*q'| < d. In other words, we allow the quotient to be too large by 1, but not<br /><br />too small by 1.<br /><br />Ideally, what we would like is to be doing about half the updating work.<br />Specifically, we'd like to be truncating both the dividend and divisor as we<br />go.<br /><br />Ordinarily we start with something like<br /><br />AAAAAAAAAAAAAAA /<br /> DDDDDDDDD<br /><br />To get the first quotient word we shift the divisor and pad with zeroes, thus<br /><br />AAAAAAAAAAAAAAA /<br />DDDDDDDD000000<br /><br />After one iteration, a word of the dividend has been removed, and we then<br />shift the divisor again.<br /><br />AAAAAAAAAAAAAA /<br />DDDDDDDD00000<br /><br />We continue until we have<br /><br /> AAAAAAAAA /<br /> DDDDDDDD<br /><br />Now there is only one quotient word left. But we notice that we don't use<br />most of the remaining dividend or the divisor to determine this quotient<br />word. In fact, we could almost truncate to<br /><br /> AA /<br /> D<br /><br />What if we had truncated at this same point all along?<br /><br />In fact, if we truncate so that the final division is a two word by one<br />word division (here we have to be careful, in that we are talking about the<br />number of words *after* normalisation), then clearly our quotient could be<br />out by as much as two on that final division, by what we have said in an<br />earlier post. That is of course ignoring any accumulated error along the way.<br /><br />As we don't wish to multiply the entire thing out to see what we have to do<br />to correct it, it is clear that this amount of truncation is too great.<br /><br />So let's truncate one further word to the right in both the dividend and<br />divisor, so that the final division (to get the final quotient word) is a<br />three word by two word division.<br /><br />In fact, in the example above, as there will be five quotient words, there<br />will be five iterations of the algorithm, after which we want two words of<br />the divisor remaining. So, we will start with<br /><br />AAAAAAA.A /<br />DDDDDD.D<br /><br />(The decimal points I have inserted are arbitrary, and only for notational<br />purposes in what follows.)<br /><br />After five iterations, throwing away one more word of the divisor each time,<br />we'll be left with<br /><br /> AA.A /<br /> D.D<br /><br />The first thing to notice is that our previous divrem algorithms, with the<br />adjustments they made as they went, gave the precise quotient given the<br />data they started with.<br /><br />The second thing to notice is that truncating both the dividend and the<br />divisor at the same point, as above, will not yield a quotient that is too<br />small. In fact, the quotient we end up with will be the same as what we would<br />have obtained if we had not truncated the dividend at all, and only truncated<br />the divisor. Additional places in the dividend can't affect the algorithm.<br /><br />Truncating the divisor, on the other hand, may result in a different quotient<br />than we would have obtained without truncation. In fact, as we end up<br />subtracting less at each update than we would if all those words were still<br />there, we may end up with a quotient which is too large. The divisor may also<br />divide more times, because of the truncation, than it would have if it had not<br />been truncated.<br /><br />However, it is not enough to merely consider how the quotient changes with<br />truncation in order to see how far we can be out. We'll likely end up with a<br />very pessimistic estimate if we do this, because we may suppose that the<br />quotient can be one too large at each iteration, which is not true.<br /><br />Instead, the quantity to keep track of is the original dividend minus the<br />product of the full divisor d and the computed quotient q. At the end of the<br />algorithm, this is the actual remainder we'll end up with, and we'd like to<br />keep track of how much our *computed* remainder (what's left of the dividend<br />after the algorithm completes) differs from this actual remainder.<br /><br />Essentially, we accumulate an error in the computed remainder due to the<br />truncation.<br /><br />Clearly, at each iteration, the word after the decimal point in what remains<br />of the dividend may be completely incorrect. And we may miss a borrow out of<br />this place into the place before the decimal point. So after n iterations of<br />the algorithm, the dividend may become too large by n. Of course n.0 is much<br />smaller than our original (normalised) divisor d (also considered as a decimal<br />D.DD...).<br /><br />At the final step of the algorithm, we will have a dividend which is too large<br />by at most this amount, and we'll be using a divisor which is truncated to<br />just two words. However, the latter affects the computed remainder by an<br />amount much less than the original d (if Q is the final quotient word, it is<br />as though we added q*0.0DDDDDD to our divisor, so that the full divisor would<br />go Q times where it otherwise would only go Q-1 times).<br /><br />So these two sources of error only increase the computed value q'*d + r (where<br />q' is the computed quotient and r is the computed remainder) by an amount less<br />than d. Thus, the computed quotient q' can be at most one larger than the<br />actual quotient q.<br /><br />This is equivalent to the required |a - d*q'| < d.<br /><br />So it seems that if we truncate out dividend at the start of the algorithm,<br />and our divisor after each iteration, we can get an approximate quotient q'<br />within the required bounds.<br /><br />We'll leave it to another time to describe the usefulness of an algorithm<br />which computes a quotient which may be out by one. What we will note is that<br />we've done computations with much smaller integers. It therefore costs us<br />significantly less time than a full divrem.<br /><br />In this week's branch we implement this algorithm. In the test code, we check<br />the required quotient is the same as the one returned by divrem, or at most<br />one too large.<br /><br />The trickiest part is ensuring we truncate at the right point. We want to<br />finish on the last iteration with two words *after* normalisation of the<br />divisor.<br /><br />Actually, if we are really smart, we realise that if d does not need<br />to be shifted by much to normalise it, we can get away with finishing with<br />just two *unnormalised* words in the divisor. The error will still be much<br />less than d.<br /><br />To be safe, if the number of quotient words is to be n, I check if the<br />leading word of the unnormalised divisor is more than 2*n. If not, too much<br />normalisation may be required, and I set up the algorithm to finish with<br />three unnormalised words instead of two. Otherwise it is safe to finish<br />with two words in the unnormalised divisor.<br /><br />The algorithm begins by computing the number of words s the divisor needs to<br />be to start. This is two more than the number of iterations required to<br />get all but the final quotient word, since we should have two words in the<br />divisor at this point. If normalisation is going to be a problem, we add one<br />to this so that we compute the final quotient word with three unnormalised<br />words in the divisor.<br /><br />Now the number of words required to start depends only on the size of the<br />quotient, and thus it may be more than the number of words d actually has.<br />Thus we begin with the ordinary divrem algorithm until the number of words<br />required is less than the number of words d actually has.<br /><br />Now we truncate d to the required number of words and the dividend to one<br />more than that. The remaining part of the algorithm proceeds in the same<br />manner, throwing away a divisor word every iteration.</div><div><br /></div><div>Only one thing can go wrong, and that is the following: because we are </div><div>truncating the divisor at each point, we may end up subtracting too little from</div><div>the dividend. In fact, what can happen is that the top word of the dividend </div><div>does not become zero after we subtract q*d' (where d' is the truncated divisor).</div><div><br /></div><div>When this happens, the top word of the dividend may be 1 after the subtraction.</div><div><br /></div><div>We know that the least significant word of the dividend could be completely</div><div>wrong, and the next word may be too large by about as many iterations as </div><div>we've completed so far.</div><div><br /></div><div>Thus, in order to fix our overflow, we subtract from the dividend as much as we </div><div>need to in order for the overflow to disappear. We don't mind our dividend </div><div>being too large, as we adjust for that in the algorithm. But we cannot allow it to</div><div>become too small. Thus we must only subtract from the dividend precisely the</div><div>amount required to make the overflow vanish.</div><div><br /></div><div>We can safely subtract away whatever is in the bottom two words of the</div><div>dividend, as this is not even enough to remove the overflow. And then we can</div><div>subtract 1 from the whole dividend. This must remove the overflow and is</div><div>clearly the least we can get away with subtracting to do so.</div><div><div><br /></div><div>The code for today's post is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.15">v0.15</a><br /><br />Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html">v0.14 - divrem_classical</a><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html">v0.16 - mullow_classical</a></div></div></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-51350970822394444432010-10-06T03:49:00.000-07:002010-10-13T15:28:00.401-07:00BSDNT - v0.14 divrem_classical<div>It's time to implement our schoolboy division routine. I prefer the name</div><div>divrem_classical, in line with the multiplication routines.</div><div><br /></div><div>This function will take a dividend a, divisor d, carry-in ci and returns</div><div>a quotient q and remainder r. We'll also need to pass in a precomputed</div><div>inverse to speed up the dword_t by word_t divisions that need to occur.</div><div><br /></div><div>We are going to implement the first algorithm spoken of in the previous</div><div>post, namely the one which uses a single "normalised" word of the divisor.</div><div><br /></div><div>We need to be careful not to just shift the top word of the divisor so </div><div>that it is normalised, but if it has more than one word, shift any high</div><div>bits of the second word as well. We want the top WORD_BITS bits of the</div><div>divisor, starting with the first nonzero bit.</div><div><br /></div><div>We can use our macro for doing a dword_t by word_t division to get each</div><div>new word of the quotient. We start with the carry-in and the most </div><div>significant word of a. The macro will automatically shift these by the</div><div>same amount as we shifted the leading words of the divisor. </div><div><br /></div><div>As per the divrem1 function, we require that the carry-in be such that</div><div>the quotient won't overflow. In other words, we assume that if the</div><div>divisor d is m words, then the top m words of the dividend including the </div><div>carry-in, are reduced mod d.</div><div><br /></div><div>First we need to check that we are not in the special case where truncating</div><div>the dividend and divisor to two and one words respectively would cause</div><div>an overflow of the quotient word to be computed. This only happens </div><div>when the top word of the dividend equals the top word of the divisor, as </div><div>explained in the previous post. </div><div><br /></div><div>If the truncation would cause an overflow in the quotient, we collect a </div><div>quotient word of ~0, as discussed in the previous post. If not, we compute </div><div>the quotient using our macro.</div><div><br /></div><div>After this point, the remainder is computed. We allow the function to </div><div>destroy the input a for this purpose. We leave it up to the caller to make</div><div>a copy of a and pass it to the function, if this is not desired</div><div>behaviour.</div><div><br /></div><div>We do our adjustment so that the quotient is correct. We need a while loop</div><div>for this, as mentioned in the previous article. </div><div><br /></div><div>Finally we write out the quotient word and read in the next word of the</div><div>dividend that remains.</div><div><br /></div><div>In the test code we use our mul_classical and muladd_classical functions to</div><div>check that divrem_classical is indeed the inverse of these functions, with</div><div>zero remainder and nonzero remainder respectively.</div><div><br /></div><div>The code for today's post is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.14">v0.14</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html">divrem discussion</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html">v0.15 - divapprox_classical</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-22847256906115173012010-10-02T09:45:00.000-07:002010-10-06T03:59:05.588-07:00BSDNT - divrem discussion<div>Today's post is a discussion only, without accompanying code. This is because</div><div>the topic is divrem using the "school boy" algorithm, and this is not as </div><div>straightforward as one might imagine. The discussion below is informal, and</div><div>may contain errors. Please let me know if so and I can make some adjustments</div><div>before releasing the code for the next article, where we implement this.</div><div><br /></div><div>As a consolation, I have released a failed attempt at using C89 macros to make</div><div>even more simplifications to our test code. In particular, I tried to make the</div><div>chain tests more "automatic". Unfortunately, this didn't work out. It ended up</div><div>making the test code too complicated to use and it was going to be way too much</div><div>work to convert all of it over to the new test system. </div><div><br /></div><div>This test code "simplification" was completely abandoned and will not be merged.</div><div>But if you are curious you can see it here: <a href="http://github.com/wbhart/bsdnt/tree/gentest">gentest</a></div><div><br /></div><div>In particular, take a look at generic.c/h and the test code in t-nn.c.</div><div><br /></div><div>Anyhow, back to today's discussion:</div><div><br /></div><div>Firstly, what exactly do we mean by the school boy division algorithm? </div><div><br /></div><div>Certainly I leaned how to do 39 divide 7 at school, and even 339 divide 7. </div><div>But how about 10615 divide 1769? </div><div><br /></div><div>The first step is to divide 1769 into 1061. It doesn't go, so we grab another </div><div>digit. So it becomes 1769 into 10615. I have no idea how many times it goes!</div><div><br /></div><div>Either I missed something at school when I studied "long division", or I only </div><div>think I learned an algorithm suitable for problems like this. </div><div><br /></div><div>In fact, the only way I know to do this division is trial and error, i.e. </div><div>to go through all the likely quotient candidates from one to nine. </div><div><br /></div><div>Using a calculator I find it goes 6 times. </div><div><br /></div><div>I think I intuitively know how to proceed from here. We place the digit 6 </div><div>into our quotient, multiply 1769 by 6 and subtract to form a remainder.</div><div><br /></div><div>But now we have a problem. What if we are dividing 17699949 into 106150000.</div><div>Does it still go 6 times? In fact it does not. But how would I know this</div><div>without being able to divide 17699949 into 106150000 in the first place!</div><div><br /></div><div>So I have two problems. Firstly, if I simply truncate the numerator and</div><div>denominator to a certain number of digits, even the first digit of my </div><div>quotient may be wrong, and secondly, if I use more digits in the first place</div><div>my intiuitive "algorithm" doesn't help. It basically says, guess and refine.</div><div>That's all very well, but when my digits are 64 bit words, I may not be so</div><div>lucky with the guessing thing.</div><div><br /></div><div>Now an interesting thing to note in the example above is that no matter how </div><div>many digits we add to the dividend, 1769 will always go into the first 5 </div><div>digits 10615 of the dividend, 6 times (with some remainder), no more, no </div><div>less. </div><div><br /></div><div>To see this, we note that 6*1769 is less than 10615 and 7*1765 is greater. </div><div>So the only way we could get 1765 to go in 7 times would be to increase </div><div>those first five digits of the dividend. Any following digits are irrelevant </div><div>to that first digit, 6, of the quotient. </div><div><br /></div><div>This is a general feature of integer division, and also applies for the</div><div>word based (base B) integer division problem.</div><div><br /></div><div>However, adding digits to the divisor is not the same, as the example above </div><div>shows. In fact 17699949 only goes 5 times into 106150000.</div><div><br /></div><div>Now the question is, how far off can it be if I truncate the dividend and</div><div>divisor? How bad can it get? </div><div><br /></div><div>We can answer this as follows. Note that 17699949 < 17700000. So 10615xxxx <div>divided by 1769yyyy is greater than or equal to 10615 divided by 1770, which </div><div>happens to be 5. So the answer has to be either 5 or 6 times. </div><div><br /></div><div>What I have done here is simply add 1 to the first four digits of the </div><div>divisor, to get a lower bound on the first digit of the quotient. In this</div><div>way, I can get a small range of possible values for the first digit of the</div><div>quotient. Then I can simply search through the possibilities to find the </div><div>correct quotient digit. This matches my intuition of the long division </div><div>algorithm at least.</div><div><br /></div><div>It seems that *perhaps* a quotient digit obtained in this way will be at </div><div>most out by 1. In other words, roughly speaking, if X is the first few digits </div><div>of the dividend and Y the appropriate number of digits of the divisor (in the</div><div>example above, X = 10615 and Y = 1769), then perhaps X / Y >= X / (Y + 1)</div><div>>= (X / Y) - 1, where by division here, I mean integer division. Let's call </div><div>this condition R.</div><div><br /></div><div>Let's suppose now that we are working on integers written as binary words</div><div>instead of decimal digits. Let's also suppose that X / Y < B. Let's call<div>this condition S.</div><div><br /></div><div>Well, it is easy to generate a counterexample to condition R. Let Y = 2, </div><div>X = B. Clearly R is not satisfied, even though S is. </div><div><br /></div><div>But this seems to only be a problem because Y is so small. So, what if we </div><div>constrain Y to be at least B/2?</div><div><br /></div><div>It turns out that condition R still doesn't hold. A counterexample is</div><div>Y=B/2, X = B*Y - 2.</div><div><br /></div><div>However, I claim that X / (Y + 1) >= (X / Y) - 2 does hold, so long as </div><div>condition S holds. Let us call this inequality T.</div><div> </div><div>Clearly there does not exist a counterexample to T with X <= Y.</div><div>Let us suppose that for a certain Y >= B/2, there is a counterexample to T. </div><div>Clearly if X0 is the first such counterexample, then X0 is a multiple of Y, </div><div>but not of Y + 1. Nor is X0 - 1 a multiple of Y + 1 (else X0 - 2 would have </div><div>been a counterexample).</div><div><br /></div><div>It is also clear that every multiple of Y from X0 on is a counterexample, as</div><div>the left hand side can never "catch up" again. </div><div><br /></div><div>Thus we have the following result: if there exists a counterexample to T for </div><div>some Y >= B/2 and X < BY, then X = (B - 1)*Y is a counterexample. </div><div><br /></div><div> </div><div>Substituting this value of X into T yields that T holds if and only if</div><div>(B - 1)*Y / (Y + 1) >= B - 3 for all Y >= B/2. Thus, if we can show that this</div><div>last inequality holds for all Y >= B/2 then we have proven T.</div><div><br /></div><div>Note that this inequality is equivalent to (B - 1)*Y >= (Y + 1)*(B - 3), i.e.</div><div>2*Y >= B - 3. However, as Y >= B/2, this holds under our hypotheses. </div><div><br /></div><div>So putting all of the above together, suppose we are dividing A by D and that</div><div>Y >= B/2 is the leading word of the divisor D and B*Y > X >= Y is the first </div><div>word or two words of the numerator A (whichever is appropriate). Then if Q0 </div><div>is the leading word of the quotient Q = A / D, then we have shown that</div><div>X / Y >= Q0 >= (X / Y) - 2. </div><div><br /></div><div>In other words, X / Y can be no more than 2 away from the first word of the </div><div>quotient Q = A / D that we are after.</div><div><br /></div><div>This leads us immediately to an algorithm for computing a quotient of two </div><div>multiprecision integers. </div><div><br /></div><div>(i) Let Y be the first WORD_BITS bits of the divisor D, so that </div><div>B > Y >= B/2. </div><div><br /></div><div>(ii) Let X be the first word or two words (appropriately shifted as per Y) of</div><div>A such that B*Y > X >= Y.</div><div><br /></div><div>(iii) Let Q0 = X/Y. </div><div><br /></div><div>(iv) Set R = A - D*Q0 * B^m (for the appropriate m), to get a "remainder". </div><div>While R < 0, subtract 1 from Q0 and add D * B^m to R.<div><br /></div><div>(v) Write down Q0 as the first word of the quotient A / D and continue on by </div><div>replacing A by R and returning to step (i) to compute the next word of the</div><div>quotient, etc.</div><div><br /></div><div>Another algorithm can be derived as follows. Instead of using Y >= B/2 in the</div><div>above algorithm, let's choose Y >= B^2/2. A similar argument to the above</div><div>shows that X / (Y + 1) >= (X / Y) - 1 for Y >= B^2/2 and B*Y > X >= Y. It boils</div><div>down to showing that (B - 1)*Y / (Y + 1) >= B - 2 for Y >= B^2/2, i.e. that</div><div>Y >= B - 2, which is clearly satisfied under our hypotheses.</div><div><br /></div><div>The algorithm is precisely the same, but at step (iv) we can replace the while</div><div>statement with an if statement and perform at most one adjustment to our </div><div>quotient Q0 and to R. </div><div><br /></div><div>Now we return to step (v), in which we said that we could just continue</div><div>on from step (i) to compute the next word of the quotient Q. If we do this</div><div>and set A = R then compute the new X, what we would like is something like </div><div>the divrem1 algorithm we implemented, where, (after possibly some kind of </div><div>initial iteration that we handled specially), it is always true that the new </div><div>X is two words and has its first word reduced mod Y. </div><div><br /></div><div>However, this does not follow from 0 <= R < D*B^m, and it may be that the <div>first word of the remainder is equal to Y! This is due to the truncation of </div><div>R to get Y. </div><div><br /></div><div>It is clear from 0 <= R < D*B^m that if X is the first two words of the <div>remainder that X <= B*Y. So to make the algorithm continue in the way we'd </div><div>like, we only have to deal with the special case where X = B*Y.</div><div><br /></div><div>We know that in the first algorithm above, the next word of the quotient may</div><div>be B - 1 or B - 2, since we know already that it is not B. We must multiply </div><div>out and check which it is.</div><div><br /></div><div>In the second algorithm above, where Y >= B^2/2, the only possibility for </div><div>the next quotient word is B - 1, as we know it is not B. </div><div><br /></div><div>In the next article we will look at implementing the first of the two</div><div>algorithms above, leveraging the code we already have for computing and using </div><div>a precomputed inverse. </div></div></div></div></div></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html">v0.13 - muladd1, muladd</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html">v0.14 - divrem_classical</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-38895338036269412282010-10-01T05:13:00.000-07:002010-10-02T10:02:52.117-07:00BSDNT - v0.13 muladd1, muladd<div>We noted in the last article that multiplication need not take a carry-in</div><div>and that it doesn't seem related to the linear functions we have been </div><div>implementing.</div><div><br /></div><div>Instead, we think of multiplication in a different way, as an inverse of </div><div>division. We'll soon implement division with remainder, i.e. given a and</div><div>d, find q and r such that a = d*q + r, where 0 <= r < d<br /></div><div>If d and q are of lengths m and n respectively, then r is of length at most</div><div>m and a is either of length m + n or m + n - 1.</div><div><br /></div><div>Multiplication corresponds to the case where r = 0. </div><div><br /></div><div>As we will certainly not wish to have to zero pad our division out to </div><div>m + n limbs if a is in fact only m + n - 1 limbs, it makes sense that our </div><div>division will take a carry-in. For this reason, it made sense for our</div><div>multiplication to yield a carry-out, i.e. it will write m + n - 1 limbs</div><div>and return an m + n - th limb, which may or may not be 0.</div><div><br /></div><div>When r is not zero in our division, the inverse operation would take a </div><div>value r of n limbs and add d*q to it, returning the result as a value a of </div><div>m + n - 1 limbs (and a carry which may or may not be zero). We call this </div><div>routine muladd. In the case where r and a happen to be aliased, the result </div><div>will be written out, overwriting a.</div><div><br /></div><div>We first implement nn_muladd1 which is identical to addmul1 except that </div><div>muladd1 writes its result out to a location not necessarily coincident with</div><div>any of the inputs. In other words, nn_addmul1 is an in-place operation </div><div>whereas nn_muladd1 is not.</div><div><br /></div><div>Next we implement nn_muladd_classical. It takes an argument a of length m</div><div>to which it adds the product of b of length m and c of length n. The result</div><div>may or may not alias with a.</div><div><br /></div><div>We also implement a version of the linear and quadratic muladds which writes </div><div>out the carry, naming the functions with the nn_s_ prefix, as per the </div><div>convention initiated in our nn_linear module.</div><div><br /></div><div>As for multiplication, we don't allow zero lengths. This dramatically </div><div>simplifies the code.</div><div><br /></div><div>An important application of the muladd function will be in chaining full </div><div>multiplication. We'll discuss this when we get to dealing with multiplication</div><div>routines with better than quadratic complexity.</div><div><br /></div><div>Our test code simply checks that muladd is the same as a multiplication and</div><div>and an addition.</div><div><br /></div><div>The github repo for this post is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.13">v0.13</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html">configure and assembly</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html">divrem discussion</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0tag:blogger.com,1999:blog-7651115430416156636.post-70330890403517261852010-09-25T12:59:00.000-07:002010-10-01T05:43:49.545-07:00BSDNT - configure and assembly improvements<div>It's time we added architecture dependent assembly support to bsdnt.</div><div><br /></div><div>Here is how we are going to do it. For each of the implementation files</div><div>we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,</div><div>e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc. </div><div><br /></div><div>This file will be #included in the relevant implementation file, e.g.</div><div>nn_linear_arch.h will be #included in nn_linear.c.</div><div><br /></div><div>These _arch.h files will be generated by a configure script, based on </div><div>the CPU architecture and operating system kernel. They will merely</div><div>include a list of architecture specific .h files in an arch directory.</div><div><br /></div><div>For example we might have nn_linear_x86_64_core2.h in the arch directory, </div><div>which provides routines specific to core2 processors running in 64 bit </div><div>mode.</div><div><br /></div><div>In these architecture specific files, we'll have inline assembly routines</div><div>designed to replace various routines in the pure C implementation files</div><div>that we have already written. They'll do this by defining flags, e.g.</div><div>HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific</div><div>version of nn_mul1_c is available. We'll then wrap the implementation of</div><div>nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,</div><div>the C version will not be compiled.</div><div><br /></div><div>In order to make this work, the configure script has to work out whether</div><div>the machine is 32 or 64 bit and what the CPU type is. It will then link </div><div>in the correct architecture specific files.</div><div><br /></div><div>At the present moment, we are only interested in x86 machines running on</div><div>*nix (or Windows, but the architecture will be determined in a different</div><div>way on Windows).</div><div><br /></div><div>A standard way of determining whether the kernel is 64 bit or not is to</div><div>search for the string x86_64 in the output of uname -m. If something else</div><div>pops out then it is probably a 32 bit machine.</div><div><br /></div><div>Once we know whether we have a 32 or 64 bit machine, we can determine the</div><div>exact processor by using the cpuid instruction. This is an assembly </div><div>instruction supported by x86 cpus which tells you the manufacturer, family</div><div>and model of the CPU. </div><div><br /></div><div>We include a small C program cpuid.c with some inline assembly to call the </div><div>cpuid instruction. As this program will only ever be run on *nix, we can</div><div>make use of gcc's inline assembly feature.</div><div><br /></div><div>When the parameter to the cpuid instruction is 0 we get the vendor ID,</div><div>which is a 12 character string. We are only interested in "AuthenticAMD"</div><div>and "GenuineIntel" at this point.</div><div><br /></div><div>When we pass the parameter 1 to the cpuid instruction, we get the processor</div><div>model and family. </div><div><br /></div><div>For Intel processors, table 2-3 in the following document gives information</div><div>about what the processor is:</div><div><br /></div><div>http://www.intel.com/Assets/PDF/appnote/241618.pdf</div><div><br /></div><div>However the information is out of date. Simply googling for Intel Family 6</div><div>Model XX reveals other models that are not listed in the Intel documentation.</div><div><br /></div><div>The information for AMD processors is a little harder to come by. However,</div><div>one can essentially extract the information from the revision guides, though</div><div>it isn't spelled out clearly:</div><div><br /></div><div>http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides</div><div><br /></div><div>It seems AMD only list recent processors here, and they are all 64 bit. </div><div>Information on 32 bit processors can be found here:</div><div><br /></div><div>http://www.sandpile.org/ia32/cpuid.htm</div><div><br /></div><div>At this point we'd like to identify numerous different architectures. We </div><div>aren't interested in 32 bit architectures, such as the now ancient Pentium </div><div>4 or AMD's K7. Instead, we are interested when 32 bit operating system </div><div>kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify</div><div>as x86. </div><div><br /></div><div>There are numerous 64 bit processors we are interested in: </div><div><br /></div><div>64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4. </div><div>All the 64 bit ones support SSE2 and SSE3 and are based on the netburst </div><div>technology. </div><div><br /></div><div>The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.</div><div>They were an enhanced version of the p6 architecture. They get branded as x86</div><div>for which only generic 32 bit assembly code will be available, if any.</div><div><br /></div><div>Core 2's are very common. They will identify as core2. They all support SSE2,</div><div>SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we</div><div>don't distinguish these at this stage).</div><div><br /></div><div>Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are</div><div>mostly 64 bit. We identify them as atom.</div><div><br /></div><div>More recently Intel has released Core i3, i5, i7 processors, based on the</div><div>Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,</div><div>SSSE3, SSE4.1 and SSE4.2.</div><div><br /></div><div>AMD K8's are still available today. They support SSE2 and SSE3. We identify</div><div>them as k8.</div><div><br /></div><div>AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a. </div><div>We identify these as k10. There are three streams of AMD K10 processors, </div><div>Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.</div><div><br /></div><div>So in summary, our configure script first identifies whether we have a 32 or</div><div>64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2, </div><div>nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.</div><div><br /></div><div>Our configure script then links in architecture specific files as appropriate. </div><div><br /></div><div>The only assembly code we include so far are the nn_add_mc functions we wrote</div><div>for 64 bit core2 and k10. As these are better than nothing on other 64 bit</div><div>processors from the same manufacturers, we include this code in the k8</div><div>specific files until we write versions for each processor. We also add an</div><div>nn_sub_mc assembly file for Intel and AMD 64 bit processors.</div><div><br /></div><div>The configure script includes the architecture specific .h files starting from</div><div>the most recent processors so that code for earlier processors is not used</div><div>when something more recent is available.</div><div><br /></div><div>The github branch for this revision is here: <a href="http://github.com/wbhart/bsdnt/tree/asm2">asm2</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/09/bsdnt-v012-mulclassical.html">v0.12 mul_classical</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html">v0.13 - muladd1, muladd</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com1tag:blogger.com,1999:blog-7651115430416156636.post-75864410994637412872010-09-24T12:25:00.000-07:002010-09-25T13:04:15.755-07:00BSDNT - v0.12 mul_classical<div>I've been on holidays for a couple of days, hence the break in transmission.</div><div>(For academics: the definition of a holiday is a day where you actually</div><div>stop working and do nothing work related for that whole day. I realise the</div><div>definition is not widely known, nor is it well-defined.)</div><div><br /></div><div>Note, due to accidentally including some files in today's release that I</div><div>intended to hold over till next time, you have to do ./configure before</div><div>make check. This detects your CPU and links in any assembly code</div><div>that is relevant for it. More on this when I do the actual blog post about it.</div><div><br /></div><div>It's time to start implementing quadratic algorithms (i.e. those that</div><div>take time O(n^2) to run, such as classical multiplication and division).</div><div><br /></div><div>Before we do this, we are going to reorganise slightly. The file which</div><div>is currently called nn.c we will rename nn_linear.c to indicate that it</div><div>contains our linear functions. We'll also rename t-nn.c to t-nn_linear.c.</div><div><br /></div><div>The macros in that file will be moved into test.h.</div><div><br /></div><div>We also modify the makefile to build all .c files in the current directory</div><div>and to run all tests when we do make check. A new directory "test" will</div><div>hold our test files.</div><div><br /></div><div>Finally, we add an nn_quadratic.c and test/t-nn_quadratic.c to hold the</div><div>new quadratic routines and test code that we are about to write.</div><div><br /></div><div>The first routine we want is nn_mul_classical. This is simply a mul1</div><div>followed by addmul1's. Of course this will be horrendously slow, but once</div><div>again we defer speeding it up until we start adding assembly routines,</div><div>which will start happening shortly.</div><div><br /></div><div>In a departure from our linear functions, we do not allow zero lengths. </div><div>This dramatically simplifies the code and means we do not have to check </div><div>for special cases. </div><div><br /></div><div>There does not appear to be any reason to allow a carry-in to our </div><div>multiplication routine. An argument can be made for it on the basis of</div><div>consistency with mul1. However, the main use for carry-in's and carry-out's</div><div>thus far has been for chaining. </div><div><br /></div><div>For example, we chained mul1's s*c and t*c for s and t of length m and </div><div>c a single word in order to compute (s*B^m + t)*c.</div><div><br /></div><div>But the analogue in the case of full multiplication would seem to be </div><div>chaining to compute (s*B^m + t)*c where s, t and c all have length m. But</div><div>it probably doesn't make sense to chain full multiplications in this way</div><div>as it would involve splitting the full product t*c say, into two separate </div><div>parts of length m, which amongst other things, would be inefficient.</div><div><br /></div><div>It might actually make more sense to pass a whole array of carries to our</div><div>multiplication routine, one for every word of the multiplier. However it is </div><div>not clear what use this would be. So, for now at least, we pass no carry-ins</div><div>to our multiplication routine.</div><div><br /></div><div>We settle on allowing a single word of carry out. This may be zero if the </div><div>leading words of the multiplicands are small enough. Rather than write this </div><div>extra word out, we simply return it as a carry so that the caller can decide </div><div>whether to write it.</div><div><br /></div><div>The classical multiplication routine is quite straightforward, but that plus the</div><div>rearrangement we've done is more than enough for one day.</div><div><br /></div><div>The github branch for this release is here: <a href="http://github.com/wbhart/bsdnt/tree/v0.12">v0.12</a></div><div><br /></div><div>Previous article: <a href="http://wbhart.blogspot.com/2010/09/bsdnt-v011-generic-test-code.html">v0.11 - generic test code</a></div><div>Next article: <a href="http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html">configure and assembly</a></div><div><br /></div>William Harthttp://www.blogger.com/profile/18416881057216462316noreply@blogger.com0