<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7651115430416156636</id><updated>2011-10-23T10:21:57.230-07:00</updated><category term='sage bsdnt flint bignum'/><title type='text'>Reading, Writing and Arithmetic</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>31</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-1857391013309194814</id><published>2011-10-23T08:23:00.000-07:00</published><updated>2011-10-23T10:21:57.262-07:00</updated><title type='text'>BSDNT - interlude</title><content type='html'>You will notice that I have not worked on BSDNT for about a year now. Well, I'm thinking of restarting the project soon. I did complete two new revisions v0.25 and v0.26 since I stopped blogging. The first of these added random functions for a single word. They generate single words which have an increased probability of triggering corner cases, e.g. by a sparse binary representation. The second of these updates is a bsdnt_printf function. This is like printf but adds a %w format specifier for printing single words. There is also a %m for printing a len_t and %b for printing a bits_t. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I likely won't get much more done on BSDNT until early next year, but here is what I am planning:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) I am tremendously grateful to Dr. Brian Gladman for his work on an MSVC version of the library. However, I started to struggle to keep up with this side of things more than I thought. Microsoft's MSVC doesn't support inline assembler in 64 bit x86. This means the entire plan of the MSVC version has to be different. It seems like far too much effort to combine both sets of code into a single library. I've therefore decided (sorry Brian) to ditch the MSVC code from my copy of BSDNT. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) I wasn't happy with the interface of the division code. There are a few issues to consider. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first issue is chaining. Obviously carry-in occurs at the left rather than the right. But for general division functions should the carry-in be a single limb or multiple limbs. It seems like the remainder after division is going to be m limbs and so the carry-in should be also. It is not clear what is better here. Internally, the algorithms deal with just a single carry-in limb because they use 2x1 divisions to compute the quotient digits. Perhaps chaining just means that we consider the first digit of the remainder to be carry-in for the next division.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another problem associated with this is that when reading the carry-in from the array, if the carry-in happens to be zero then the array entry may not exist in memory. This means the code has to always check if the carry-in should be zero or not before proceeding.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The other issue to consider is shifting for normalisation. One assumes that the precomputed inverse is computed from a normalised value (the first couple of limbs of the divisor). Now, it is not necessary to shift either dividend or divisor ahead of time. One can still perform the subtractions that occur in division, on unshifted values. One does need to shift limbs of the dividend in turn however, as the algorithm proceeds, in order to compute the limbs of the quotient. But this shifting can occur in registers and need not be written out anywhere. This is implemented, but currently every limb gets shifted twice. This can be cut down to a single shift. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3) The PRNGs are currently quite hard to read. They have numerous macros to access their context objects. They are extremely flexible, but possibly overengineered. I'd like to simplify their implementations somewhat.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4) The configure script is a little overengineered. The idea of supporting lots of compilers is nice. But in reality GCC should exist almost everywhere. The original concept of BSDNT was to use inline assembly for architecture support. This gets around issues with global symbol prefixes and wotnot. It also makes the library really simple to read. Even on Windows 64 there is MinGW64 and this is the only setup I aim to target in that direction. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I hope to deal with all of these issues before proceeding with development of BSDNT. Give me some time as I am busy until about the end of the year. However, I do plan to continue development of BSDNT after sorting out these issues, because I think that fundamentally what we have is very solid.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html"&gt;v0.24 = nn_bitset/clear/test and nn_test_random&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-1857391013309194814?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/1857391013309194814/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2011/10/bsdnt-interlude.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1857391013309194814'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1857391013309194814'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2011/10/bsdnt-interlude.html' title='BSDNT - interlude'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-8048637254738659972</id><published>2010-11-20T15:01:00.000-08:00</published><updated>2011-10-23T10:17:29.143-07:00</updated><title type='text'>BSDNT - v0.24 nn_bitset/clear/test and nn_test_random</title><content type='html'>&lt;div&gt;In today's update we make a long overdue change to bsdnt, again to improve our testing&lt;/div&gt;&lt;div&gt;of the library.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We are going to add a function for generating random bignums with sparse binary &lt;/div&gt;&lt;div&gt;representation. We'll also add some other random functions based on this primitive.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Using test integers with sparse binary representation in our test code will push our&lt;/div&gt;&lt;div&gt;functions harder because it will test lots of corner cases such as words that are all&lt;/div&gt;&lt;div&gt;zero, in the middle of routines, and so on. As it is currently, we'd be extremely&lt;/div&gt;&lt;div&gt;lucky for the random word generator we've been using to generate an all zero word, or&lt;/div&gt;&lt;div&gt;a word with all bits set to one for that matter.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first step to generating such test randoms is to write a function for setting a &lt;/div&gt;&lt;div&gt;given bit in an integer. This will be an nn_linear function despite it not actually&lt;/div&gt;&lt;div&gt;taking linear time. In fact, it will take essentially constant time. However, it is an&lt;/div&gt;&lt;div&gt;nn type function, so it belongs in an nn module.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The routine is very straightforward. Given a bit index b, starting from 0 for the least&lt;/div&gt;&lt;div&gt;significant bit of a bignum, it will simply use a logical OR to set bit b of the bignum.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Rather than construct a bignum 2^b and OR that with our number, we simply determine&lt;/div&gt;&lt;div&gt;which word of the bignum needs altering and create an OR-mask for that word.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Computing which word to adjust is trivial, but depends on the number of bits in a word.&lt;/div&gt;&lt;div&gt;On a 64 bit machine we shift b to the right by 6 bits (as 2^6 = 64), and on a 32 bit&lt;/div&gt;&lt;div&gt;machine we shift b to the right by 5 bits (2^5 = 32). This has the effect of dividing&lt;/div&gt;&lt;div&gt;b by 64 or 32 respectively (discarding the remainder). This gives us the index of the&lt;/div&gt;&lt;div&gt;word we need to adjust. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we need to determine which bit of the word needs setting. This is given by the &lt;/div&gt;&lt;div&gt;remainder after dividing b by 64 or 32 respectively, and this can be determined by&lt;/div&gt;&lt;div&gt;logical AND'ing b with 2^6-1 or 2^5-1 respectively. This yields a value c between 0 and&lt;/div&gt;&lt;div&gt;63 (or 31) inclusive, which is a bit index. To turn that into our OR-mask, we simply &lt;/div&gt;&lt;div&gt;compute 2^c (by shifting 1 to the left by c bits).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now that we have our OR-mask and the index of the word to OR it with, we can update the&lt;/div&gt;&lt;div&gt;required bit. We call this function nn_bit_set.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While we are at it we create two other functions, nn_bit_clear and nn_bit_test.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's now straightforward to write test functions which randomly set, clear and test&lt;/div&gt;&lt;div&gt;bits in a random bignum.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next we create a random bignum generator which sets random bits of a bignum. In order&lt;/div&gt;&lt;div&gt;to do this, we simply choose a random number of bits to set, from 0 to the number of words&lt;/div&gt;&lt;div&gt;in the bignum, then we set that many bits at random. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We call this function nn_test_random1.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We now add a second random bignum generator. It works by creating two bignums using the &lt;/div&gt;&lt;div&gt;function nn_test_random1 and subtracting one from the other. This results in test randoms &lt;/div&gt;&lt;div&gt;with long strings of 1's and 0's in its representation. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We call this function nn_test_random2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, we create a function nn_test_random which randomly switches between these two &lt;/div&gt;&lt;div&gt;algorithms and our original nn_random algorithm to generate random bignums. We switch all&lt;/div&gt;&lt;div&gt;our test code to use nn_test_random by changing the function randoms_of_len to use it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point we can have greater confidence that our functions are all working as they&lt;/div&gt;&lt;div&gt;are supposed to be, as our test code has been suitably fortified at last! (Well, they are&lt;/div&gt;&lt;div&gt;working now, after I spent a day hunting down the bugs that these new randoms found - no,&lt;/div&gt;&lt;div&gt;I am not kidding. That's how good at finding bugs this trick is!)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Today's code is found here: &lt;a href="https://github.com/wbhart/bsdnt/tree/v0.24"&gt;v0.24&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html"&gt;v0.23 - sha1 and PRNG tests&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2011/10/bsdnt-interlude.html"&gt;BSDNT - interlude&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-8048637254738659972?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/8048637254738659972/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8048637254738659972'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8048637254738659972'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html' title='BSDNT - v0.24 nn_bitset/clear/test and nn_test_random'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-4400954058692563656</id><published>2010-11-12T07:30:00.000-08:00</published><updated>2010-11-20T15:07:09.220-08:00</updated><title type='text'>BSDNT - v0.23 sha1 and PRNG tests</title><content type='html'>&lt;div&gt;In a recent update we added three PRNGs (pseudo random number &lt;/div&gt;&lt;div&gt;generators). What we are going to do today is add test code for &lt;/div&gt;&lt;div&gt;these.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But how do you test a pseudo random generator? It's producing &lt;/div&gt;&lt;div&gt;basically random values after all. So what does it matter if the &lt;/div&gt;&lt;div&gt;output is screwed up!?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well, it does matter, as shown by the problems on 32 bit machines &lt;/div&gt;&lt;div&gt;which I wrote about in the PRNG blog post. It would also matter if &lt;/div&gt;&lt;div&gt;the PRNGs were broken on some platform such that they always output &lt;/div&gt;&lt;div&gt;0 every time!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There's a few ways of testing PRNGs. One is simply to run them for a &lt;/div&gt;&lt;div&gt;given number of iterations, write down the last value it produces and &lt;/div&gt;&lt;div&gt;check that it always does this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The method we are going to use is slightly more sophisticated. We are &lt;/div&gt;&lt;div&gt;going to hash a long series of outputs from the PRNGs, using a hash &lt;/div&gt;&lt;div&gt;function, and check that the hash of the output is always the same. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Basically, our test code will take a long string of words from the &lt;/div&gt;&lt;div&gt;PRNGs, write them to an array of bytes, then compute the sha1 hash of&lt;/div&gt;&lt;div&gt;that array of bytes. It'll then compare the computed hash with a hash&lt;/div&gt;&lt;div&gt;we've computed previously to ensure it has the same value as always. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Moreover, we'll set it up so that the hash is the same regardless of &lt;/div&gt;&lt;div&gt;whether the machine is big or little endian. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The hash function we are going to use is called sha1. Specifically, &lt;/div&gt;&lt;div&gt;we'll be using an implementation of the same written by Brian Gladman &lt;/div&gt;&lt;div&gt;(he also supplied the new test code for the PRNGs for today's update).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first step is to identify whether the machine is big or little &lt;/div&gt;&lt;div&gt;endian. This refers to the order of bytes within a word in physical &lt;/div&gt;&lt;div&gt;memory. On little endian machines (such as x86 machines) the least &lt;/div&gt;&lt;div&gt;significant byte of a word comes first. On big endian machines the &lt;/div&gt;&lt;div&gt;order is the other way around. Thus the number 0x0A0B0C0D would have &lt;/div&gt;&lt;div&gt;the byte 0x0D stored first on a little endian machine, but 0x0A stored &lt;/div&gt;&lt;div&gt;first on a big endian machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Endianness can be identified by architecture, or it can be identified&lt;/div&gt;&lt;div&gt;with a short program. We choose to use the latter method as it should &lt;/div&gt;&lt;div&gt;be hard to fool. At configure time a short C program will run that will &lt;/div&gt;&lt;div&gt;place bytes into a four byte array, then read that array as a single&lt;/div&gt;&lt;div&gt;32 bit number. We then compare the value to a 32 bit value that would&lt;/div&gt;&lt;div&gt;be stored in the given way on a little endian machine. If it compares&lt;/div&gt;&lt;div&gt;equal, then the machine must be little endian. If not we compare with&lt;/div&gt;&lt;div&gt;a number that would be stored in the given way on a big endian machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the machine doesn't match either order, then it must be a very rare&lt;/div&gt;&lt;div&gt;machine with mixed endianness, which we don't support in bsdnt.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The configure script will write some defines out to config.h which &lt;/div&gt;&lt;div&gt;then allow bsdnt modules to identify whether the machine is little or &lt;/div&gt;&lt;div&gt;big endian at compile time, i.e. at zero runtime cost.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now to discuss the sha1 hashing scheme. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A hashing scheme simply takes a piece of data and computes from it a&lt;/div&gt;&lt;div&gt;series of bits which serve to "identify" that piece of data. If &lt;/div&gt;&lt;div&gt;someone else has access to the same hashing algorithm and a piece of&lt;/div&gt;&lt;div&gt;data which purports to be an exact copy of the original, then they &lt;/div&gt;&lt;div&gt;can verify its identity by computing its hash and comparing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course a hash is only valuable in this sense if it is much shorter&lt;/div&gt;&lt;div&gt;than the piece of data itself (otherwise you'd just compare the &lt;/div&gt;&lt;div&gt;actual data itself). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A very simple hashing scheme might simply add all the words in the &lt;/div&gt;&lt;div&gt;input to compute a hash consisting of a single word. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A secure hashing scheme has at least two other properties. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;i) It shouldn't be possible to determine the original data from its &lt;/div&gt;&lt;div&gt;hash. (The data may be secret and one may wish to provide for the&lt;/div&gt;&lt;div&gt;independent verification of its authenticity by having the recipient&lt;/div&gt;&lt;div&gt;compare the hash of the secret data with a publicly published value.&lt;/div&gt;&lt;div&gt;Or, as is sometimes the case, the hash of secret data, such as a&lt;/div&gt;&lt;div&gt;password, might be transmitted publicly, to compare it with a &lt;/div&gt;&lt;div&gt;pre-recorded hash of the data.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ii) It must be computationally infeasible to substitute fake data&lt;/div&gt;&lt;div&gt;for the original such that the computed hash of the fake data is the &lt;/div&gt;&lt;div&gt;same as that of the original data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, if the hash is short compared to the data being hashed, &lt;/div&gt;&lt;div&gt;then by definition many other pieces of data will have the same hash.&lt;/div&gt;&lt;div&gt;The only requirement is that it should be computationally infeasible&lt;/div&gt;&lt;div&gt;to find or construct such a piece of data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first step in the SHA1 algorithm is some bookkeeping. The &lt;/div&gt;&lt;div&gt;algorithm, as originally described, works with an input message which&lt;/div&gt;&lt;div&gt;is a multiple of 16 words in length. Moreover, the last 64 bits are &lt;/div&gt;&lt;div&gt;reserved for a value which gives the length of the original message in &lt;/div&gt;&lt;div&gt;bits.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to facilitate this, the original message is padded to a &lt;/div&gt;&lt;div&gt;multiple of 16 words in length, with at least enough padding to allow&lt;/div&gt;&lt;div&gt;the final 64 bits to be part of the padding, and to not overlap part &lt;/div&gt;&lt;div&gt;of the message.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The padding is done by first appending a single binary 1 bit, then&lt;/div&gt;&lt;div&gt;binary zeroes are appended until the message is the right length.&lt;/div&gt;&lt;div&gt;Then of course the length in bits of the original message is placed&lt;/div&gt;&lt;div&gt;in the final 64 bits of the message.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The hashing algorithm itself performs a whole load of prescribed&lt;/div&gt;&lt;div&gt;twists and massages of the padded message.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For this purpose some functions and constants are used. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Given 32 bit words B, C and D there are four functions:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) f0(B, C, D) = (B AND C) OR ((NOT B) AND D)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) f1(B, C, D) = B XOR C XOR D&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3) f2(B, C, D) = (B AND C) OR (B AND D) OR (C AND D)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4) f3(B, C, D) = B XOR C XOR D &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;and four corresponding 32 bit constants:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) C0 = 0x5A827999&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) C1 = 0x6ED9EBA1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3) C2 = 0x8F1BBCDC&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4) C3 = 0xCA62C1D6&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To begin the algorithm, we break the padded message up into 16 word &lt;/div&gt;&lt;div&gt;blocks M1, M2, M3, i.e. each Mi is 16 words of the padded message. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Each block is processed via a set of steps, and an "accumulated" hash &lt;/div&gt;&lt;div&gt;of 160 bits, consisting of five 32 bit words (the final hash we are &lt;/div&gt;&lt;div&gt;after) is computed: H0, H1, H2, H3, H4.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The algorithm starts by initialising the five "hash words" to the &lt;/div&gt;&lt;div&gt;following values:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;H0 = 0x67452301, H1 = 0xEFCDAB89, H2 = 0x98BADCFE, H3 = 0x10325476 &lt;/div&gt;&lt;div&gt;and H4 = 0xC3D2E1F0.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Each block of 16 words, Mi, of the padded message is then used in &lt;/div&gt;&lt;div&gt;turn to successively transform these five words, according to the&lt;/div&gt;&lt;div&gt;following steps:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;a) Break Mi into 16 words W0, W1, ..., W15.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;b) For j = 16 to 79, let Wj be the word given by&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Wj = S^(-1)(W{j-3}) XOR W{j-8} XOR W{j-14} XOR W{j-16}),&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;where S^n(X) means to rotate the word X to the left through n bits &lt;/div&gt;&lt;div&gt;(a negative n means right rotation).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;c) Make a copy of the hashing words:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A = H0, B = H1, C = H2, D = H3, E = H4&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;d) For j = 0 to 79 repeat the following set of transformations in &lt;/div&gt;&lt;div&gt;the order given:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;TEMP = S^5(A) + f{j/20}(B, C, D) + E + Wj + C{j/20},&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;E = D, D = C, C = S^30(B), B = A, A = TEMP,&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;where j/20 signifies "floor division" by 20, and where f and C are &lt;/div&gt;&lt;div&gt;the above-defined functions and constants.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;e) Update the hashing words according to the following:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;H0 = H0 + A, H1 = H1 + B, H2 = H2 + C, H3 = H3 + D, H4 = H4 + E.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that steps a-e are repeated for each block of 16 words, Mi in &lt;/div&gt;&lt;div&gt;the padded message, further manipulating the five words with each run. &lt;/div&gt;&lt;div&gt;The resulting five words H0, H1, H2, H3, H4 after all the words of the&lt;/div&gt;&lt;div&gt;padded message have been consumed, constitutes the sha1 hash of the &lt;/div&gt;&lt;div&gt;original message.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The function to compute the sha1 hash of a message is given in the&lt;/div&gt;&lt;div&gt;files sha1.c and sha1.h in the top level directory of the source&lt;/div&gt;&lt;div&gt;tree. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A new test file t-rand.c is added in the test directory. It contains&lt;/div&gt;&lt;div&gt;the sha1 hash of a large number of words as output by our three&lt;/div&gt;&lt;div&gt;PRNGs. If a user of bsdnt has the same hash for the PRNGs when run&lt;/div&gt;&lt;div&gt;on their machine, then they can have a very high level of confidence&lt;/div&gt;&lt;div&gt;that they are working as expected.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that the sha1 algorithm is known as a secure hashing algorithm,&lt;/div&gt;&lt;div&gt;which means that in theory it can be used to hash very important&lt;/div&gt;&lt;div&gt;data so that the recipient can independently confirm the data hasn't&lt;/div&gt;&lt;div&gt;been tampered with (by computing the hash of the value and making&lt;/div&gt;&lt;div&gt;sure it matches some published value). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We don't explain how sha1 actually works. The mysterious constants&lt;/div&gt;&lt;div&gt;are not so mysterious. C0 is the square root of 2 in hexadecimal, C1 is&lt;/div&gt;&lt;div&gt;the square root of 3, C2 the square root of 5, C3 the square root of 10.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I don't know the meaning of the functions f0-f3. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What is worth noting is that in recent years, people have figured out &lt;/div&gt;&lt;div&gt;how to produce sha1 hash collisions (two messages with the same hash). &lt;/div&gt;&lt;div&gt;I don't pretend to be an expert in such things.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All we care about here is that a broken PRNG really can't pretend to&lt;/div&gt;&lt;div&gt;be working, and for that, sha1 works a treat.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Disclaimer: use the information in this post at your OWN RISK!! We &lt;/div&gt;&lt;div&gt;make no representations as to its correctness. The same goes for &lt;/div&gt;&lt;div&gt;bsdnt itself. Read the license agreement for details.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Warning: cryptography is restricted by law in many countries including&lt;/div&gt;&lt;div&gt;many of those where the citizens believe it couldn't possibly be so. &lt;/div&gt;&lt;div&gt;Please check your local laws before making assumptions about what you &lt;/div&gt;&lt;div&gt;may do with crypto.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's article is here: &lt;a href="https://github.com/wbhart/bsdnt/tree/v0.23"&gt;v0.23&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html"&gt;v0.22 - Windows support&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v024-nnbitsetcleartest-and.html"&gt;v0.24 - nn_bitset/clear/test and nn_test_random&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-4400954058692563656?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/4400954058692563656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/4400954058692563656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/4400954058692563656'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html' title='BSDNT - v0.23 sha1 and PRNG tests'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-6190934703511269160</id><published>2010-11-11T05:58:00.000-08:00</published><updated>2010-11-12T07:38:14.002-08:00</updated><title type='text'>BSDNT - v0.22 Windows support</title><content type='html'>&lt;div&gt;Today's update is a massive one, and comes courtesy of Brian Gladman. At last we add &lt;/div&gt;&lt;div&gt;support for MSVC 2010 on Windows. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to support different architectures we add architecture specific files in the arch &lt;/div&gt;&lt;div&gt;directory. There are three different ways that architectures might be supported:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Inline assembly code&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Standalone assembly code (using an external assembler, e.g. nasm)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Architecture specific C code&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Windows requires both of the last two of these. On Windows 64 bit, MSVC does not support &lt;/div&gt;&lt;div&gt;inline assembly code, and thus it is necessary to supply standalone assembly code for this&lt;/div&gt;&lt;div&gt;architecture. This new assembly code now lives in the arch/asm directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On both Windows 32 and 64 bit there is also a need to override some of the C code from the base&lt;/div&gt;&lt;div&gt;bsdnt library with Windows specific code. This lives in the arch directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, the inline assembly used by the base bsdnt library on most *nix platforms is now in the&lt;/div&gt;&lt;div&gt;arch/inline directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In each case, the os/abi combination is specified in the filenames of the relevant files. For&lt;/div&gt;&lt;div&gt;example on Windows 32, the files overriding code in nn_linear.c/h are in arch/nn_linear_win32.c/h.&lt;/div&gt;&lt;div&gt;(Note win32 and x64 are standard Windows names for 32 and 64 bit x86 architectures, respectively.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the code contains architecture specific code (e.g. assembly code) then the name of the file&lt;/div&gt;&lt;div&gt;contains an architecture specifier too, e.g. arch/inline/nn_linear_x86_64_k8.h for code specific&lt;/div&gt;&lt;div&gt;to the AMD k8 and above.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's incomprehensible that Microsoft doesn't support inline assembly in their 64 bit compiler&lt;/div&gt;&lt;div&gt;making standalone assembly code necessary. It would be possible to use the Intel C compiler on &lt;/div&gt;&lt;div&gt;Windows 64, as this does support inline assembly. But this is very expensive for our volunteer &lt;/div&gt;&lt;div&gt;developers, so we are not currently supporting this. Thus, on Windows 64, the standalone &lt;/div&gt;&lt;div&gt;assembly is provided in the arch/asm directory as just mentioned.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Brian has also provided MSVC build solution files for Windows. These are in the top level source&lt;/div&gt;&lt;div&gt;directory as one might expect.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are lots of differences on Windows that requires functions in our standard nn_linear.c, &lt;/div&gt;&lt;div&gt;nn_quadratic.c and helper.c files to be overridden on Windows.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first difference is that on 64 bit Windows, the 64 bit type is a long long, not a long. This&lt;/div&gt;&lt;div&gt;is handled by #including a types_arch.h file in helper.h. On most platforms this file is empty.&lt;/div&gt;&lt;div&gt;However, on Windows it links to an architecture specific types.h file which contains the&lt;/div&gt;&lt;div&gt;requisite type definitions. So a word_t is a long long on Windows. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also, when dealing with integer constants, we'd use constants like 123456789L when the word type&lt;/div&gt;&lt;div&gt;is a long, but it has to become 123456789LL when it is a long long, as on Windows 64. To get &lt;/div&gt;&lt;div&gt;around this, an architecture specific version of the macro WORD(x) can be defined. Thus, when&lt;/div&gt;&lt;div&gt;using a constant in the code, one merely writes WORD(123456789) and the macro adds the correct&lt;/div&gt;&lt;div&gt;ending to the number depending on what a word_t actually is. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Some other things are different on Windows too. The intrinsic for counting leading zeroes is &lt;/div&gt;&lt;div&gt;different to that used by gcc on other platforms. The same goes for the function for counting&lt;/div&gt;&lt;div&gt;trailing zeroes. We've made these into macros and given them the names high_zero_bits and&lt;/div&gt;&lt;div&gt;low_zero_bits respectively. The default definitions are overridden on Windows in the architecture&lt;/div&gt;&lt;div&gt;specific versions of helper.h in the arch directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, on Windows 64, there is no suitable native type for a dword_t. The maximum sized&lt;/div&gt;&lt;div&gt;native type is 64 bits. Much of the nn_linear, and some of the nn_quadratic C code needs to &lt;/div&gt;&lt;div&gt;be overridden to get around this on Windows. We'll only be using dword_t in basecase algorithms&lt;/div&gt;&lt;div&gt;in bsdnt, so this won't propagate throughout the entire library. But it is necessary to &lt;/div&gt;&lt;div&gt;override functions which use dword_t on Windows.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Actually, if C++ is used, one can define a class called dword_t and much of the code can&lt;/div&gt;&lt;div&gt;remain unchanged. Brian has a C++ branch of bsdnt which does this. But for now we have C code&lt;/div&gt;&lt;div&gt;only on Windows (otherwise handling of name mangling in interfacing C++ and assembly code &lt;/div&gt;&lt;div&gt;becomes complex to deal with). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Brian has worked around this problem by defining special mul_64_by_64 and div_128_by_64 &lt;/div&gt;&lt;div&gt;functions on 64 bit Windows. These are again defined in the architecture specific version of&lt;/div&gt;&lt;div&gt;helper.h for Windows 64.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Obviously some of the precomputed inverse macros need to be overridden to accomodate these&lt;/div&gt;&lt;div&gt;changes, and so these too have architecture specific versions in the Windows 64 specific version &lt;/div&gt;&lt;div&gt;of the helper.h file.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In today's release we also have a brand new configure file for *nix. This is modified to handle&lt;/div&gt;&lt;div&gt;all the changes we've made to make Windows support easy. But Antony Vennard has also done &lt;/div&gt;&lt;div&gt;some really extensive work on this in preparation for handling standalone assembly on arches &lt;/div&gt;&lt;div&gt;that won't handle our inline assembly (and for people who prefer to write standalone assembly &lt;/div&gt;&lt;div&gt;instead of inline assembly).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The new configure file has an interactive mode which searches for available C compilers (e.g.&lt;/div&gt;&lt;div&gt;gcc, clang, icc, nvcc) and assemblers (nasm, yasm) and allows the user to specify which to use.&lt;/div&gt;&lt;div&gt;This interactive feature is off by default and is only a skeleton at present (it doesn't actually&lt;/div&gt;&lt;div&gt;do anything). It will be the subject of a blog post later on when the configure file is finished.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today is found at: &lt;a href="https://github.com/wbhart/bsdnt/tree/v0.22"&gt;v0.22&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html"&gt;v0.21 - PRNGs&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v023-sha1-and-prng-tests.html"&gt;v0.23 - sha1 and prng tests&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-6190934703511269160?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/6190934703511269160/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6190934703511269160'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6190934703511269160'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html' title='BSDNT - v0.22 Windows support'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-3283856199180578388</id><published>2010-10-31T16:02:00.000-07:00</published><updated>2010-11-11T06:02:34.141-08:00</updated><title type='text'>BSDNT - v0.21 PRNGs</title><content type='html'>In this update we are going to replace the el cheapo random generator in&lt;br /&gt;bsdnt with something of higher quality.&lt;br /&gt;&lt;br /&gt;Some time ago, Brian Gladman brought to my attention the fact that on 32&lt;br /&gt;bit machines, the test code for bsdnt actually caused Windows to hang!&lt;br /&gt;&lt;br /&gt;The issue required some sleuthing work on Brian's part to track down.&lt;br /&gt;He eventually discovered the cause of the problem, and it was, oddly&lt;br /&gt;enough, the pseudo-random number generator (PRNG) I had used.&lt;br /&gt;&lt;br /&gt;Brian suspected the PRNG immediately because of his past experience as a&lt;br /&gt;professional cryptographer. In fact, it turns out that PRNG's of the&lt;br /&gt;kind that I had used, aren't particularly good even if they don't have&lt;br /&gt;bugs!&lt;br /&gt;&lt;br /&gt;The particular kind of PRNG I had used is called a linear congruential&lt;br /&gt;PRNG. They start with the PRNG initialised to some random seed value,&lt;br /&gt;n = seed. Then each time they are called, they apply the transformation&lt;br /&gt;n = (n*c1 + c2) % p for some explicit constants c1, c2 and a large&lt;br /&gt;enough "prime" p.&lt;br /&gt;&lt;br /&gt;One can take c2 = 0 in the transformation and it is also common to see&lt;br /&gt;p = 2^b for some b (e.g. b = WORD_BITS, and yes, I know p = 2^b is not&lt;br /&gt;usually prime). When p = 2^b it is usually the case that the top half of&lt;br /&gt;the bits output have reasonable random properties, but the bottom half&lt;br /&gt;do not. In this case it is acceptable to stitch two LC PRNG's together to&lt;br /&gt;give the full number of bits.&lt;br /&gt;&lt;br /&gt;When p is an actual prime, these PRNG's are called prime modulus linear&lt;br /&gt;congruential PRNG's and they aren't too bad when implemented properly.&lt;br /&gt;They still fail to meet some standards of random quality.&lt;br /&gt;&lt;br /&gt;To return a whole word of random bits, one either needs to use a prime&lt;br /&gt;p that is larger than a word, which is usually impractical, or again&lt;br /&gt;stitch the output of two prime modulus LC PRNG's together.&lt;br /&gt;&lt;br /&gt;However, one needs to be careful, as the period of the generator is p-1,&lt;br /&gt;so if one is on a 32 bit machine, it doesn't do to use a prime p just&lt;br /&gt;over half the size of a word (the first mistake I made), otherwise the&lt;br /&gt;period is just over 65536. That isn't too good for generating random&lt;br /&gt;values for test code!&lt;br /&gt;&lt;br /&gt;But how was my LC PRNG causing Windows to crash!? The reason was that&lt;br /&gt;in some of the test functions we required bignums that didn't overflow&lt;br /&gt;when summed together. This of course depends almost entirely on the top&lt;br /&gt;few bits of the bignums being added together.&lt;br /&gt;&lt;br /&gt;The problem was that in the expression n = (n*c1 + c2) % p, I was using&lt;br /&gt;values of c1 and c2 which were not reduced mod p. It turns out that this&lt;br /&gt;is crucial to correct operation. It might seem that the result ends up&lt;br /&gt;being reduced mod p anyway, and indeed it would be if n*c1 fit in a word.&lt;br /&gt;However, because it doesn't you actually get ((n*c1 + c2) % 2^32) % p&lt;br /&gt;which causes the binary representation of the value generated to be quite&lt;br /&gt;regular.&lt;br /&gt;&lt;br /&gt;Anyhow, on Windows (and probably on other 32 bit machines) the test code&lt;br /&gt;generates length 90 bignums over and over at some point, looking in vain&lt;br /&gt;to find pairs of such bignums which when added do not overflow. As these&lt;br /&gt;are garbage collected at the end of the test function, memory starts&lt;br /&gt;filling up with the orphaned bignums that are discarded by the test code&lt;br /&gt;as it looks for appropriate values. This eventually overwhelms the heap&lt;br /&gt;allocator on Windows and causes the entire OS to crash!&lt;br /&gt;&lt;br /&gt;The problem of writing decent PRNG's has been studied extensively, and one&lt;br /&gt;expert in the subject is George Marsaglia. He famously turned up on a&lt;br /&gt;usenet forum in January of 1999 and dumped not one, but piles of fast, high&lt;br /&gt;quality PRNG's which do not suffer from the problems that other PRNG's do.&lt;br /&gt;&lt;br /&gt;Amazingly, many of the PRNG's in common usage today are either written by&lt;br /&gt;George, or based on ones he wrote. So he's some kind of legend!&lt;br /&gt;&lt;br /&gt;Anyhow, we will make use of two of his PRNG's, Keep It Simple Stupid (KISS)&lt;br /&gt;and Super KISS (SKISS) and a third PRNG called Mersenne Twister, due to&lt;br /&gt;Makoto Matsumoto and Takuji Nishimura in 1997.&lt;br /&gt;&lt;br /&gt;George's generators are in turn based on some simpler PRNG's. He begins by&lt;br /&gt;defining a linear congruential generator, with c1 = 69069 and c2 = 1234567.&lt;br /&gt;This is taken p = mod 2^32 (yes, it's not prime). This has good properties&lt;br /&gt;on its top 16 bits, but not on its bottom 16 bits, and for this reason had&lt;br /&gt;been widely used before George came along. This generator has period 2^32.&lt;br /&gt;&lt;br /&gt;Next he defines a pair of multiply with carry (MWC) generators. These are&lt;br /&gt;of the form n = c1*lo(n) + hi(n) where lo(n) is the low 16 bits of n, hi(n)&lt;br /&gt;is the high 16 bits and c1 is an appropriately chosen constant.&lt;br /&gt;&lt;br /&gt;He stitches together a pair of these MWC PRNG's mod 2^16 to give 32 random&lt;br /&gt;bits. For simplicity we'll refer to this combined random generator as MWC.&lt;br /&gt;This has a period of about 2^60.&lt;br /&gt;&lt;br /&gt;Thirdly he defines a (3-)shift-register generator (SHR3). This views the&lt;br /&gt;value n as a binary vector of 32 bits and applies linear transformations&lt;br /&gt;generated from 32 x 32 matrices L and R = L^T according to&lt;br /&gt;n = n(I + L^17)(I + R^13)(I + L^5) where I is the 32 x 32 identity matrix.&lt;br /&gt;&lt;br /&gt;In order to speed things up, special transformations are chosen that can&lt;br /&gt;be efficiently implemented in terms of XOR and shifts. This is called an&lt;br /&gt;Xorshift PRNG. We'll just refer to it as SHR3.&lt;br /&gt;&lt;br /&gt;Now given appropriate seed values for each of these PRNG's Marsaglia's&lt;br /&gt;KISS PRNG is defined as MWC ^ CONG + SHR3. This generator passes a whole&lt;br /&gt;slew of tests and has a period of 2^123. In this update we make it the&lt;br /&gt;default random generator for bsdnt.&lt;br /&gt;&lt;br /&gt;Super KISS is a random generator defined by George later in 2009. It gives&lt;br /&gt;immense periods by adding together the output of three PRNG's, one with a&lt;br /&gt;massive order. It is basically defined by SKISS = SUPR + CONG + SHR3.&lt;br /&gt;&lt;br /&gt;Here, the new generator SUPR is based on a prime p = a*b^r + 1 such that&lt;br /&gt;the order of b mod p has magnitude quite near to p - 1.&lt;br /&gt;&lt;br /&gt;It starts with a seeded vector z of length r, all of whose entries are&lt;br /&gt;less than b and an additional value c which is less than a.&lt;br /&gt;&lt;br /&gt;One then updates the pair (z, c) by shifting the vector z to the left by&lt;br /&gt;one place and setting the right-most entry to (b - 1) - ((a*z1 + c) mod b)&lt;br /&gt;where z1 is the entry shifted out at the left of z. Then c is set to t/b.&lt;br /&gt;&lt;br /&gt;Naturally in practice one uses b = 2^32 so that all the intermediate&lt;br /&gt;reductions mod b are trivial.&lt;br /&gt;&lt;br /&gt;As with most generators which have massive periods the "state" held by this&lt;br /&gt;generator is large. It requires data mod p for a multiprecision p.&lt;br /&gt;&lt;br /&gt;Note the similarity with the MWC generator except for the "complement" mod&lt;br /&gt;b that occurs. This is called a CMWC (Complemented-Multiply-With-Carry)&lt;br /&gt;generator.&lt;br /&gt;&lt;br /&gt;George proposed using the prime p = 640*b^41265+1, where the order of b is&lt;br /&gt;5*2^1320481. The period of the CMWC generator is then greater than&lt;br /&gt;2^1300000.&lt;br /&gt;&lt;br /&gt;Of course, at each iteration of the algorithm, 41265 random words are&lt;br /&gt;generated in the vector. Once these are exhausted, the next iteration of&lt;br /&gt;the algorithm is made.&lt;br /&gt;&lt;br /&gt;The algorithm SUPR in the definition of SKISS is thus just a simple&lt;br /&gt;array lookup to return one of the words of the vector z. Each time SKISS&lt;br /&gt;is run, the index into the array is increased until all words of the array&lt;br /&gt;are exhausted, at which point the CMWC algorithm is iterated to refill the&lt;br /&gt;array.&lt;br /&gt;&lt;br /&gt;We now come to describing the Mersenne twister.&lt;br /&gt;&lt;br /&gt;It is based on the concept of a feedback shift register (FSR). An FSR shifts&lt;br /&gt;its value left by 1 bit, feeding at the right some linear combination of the&lt;br /&gt;bits in its original value. The Mersenne twister is conceptually a&lt;br /&gt;generalisation of this.&lt;br /&gt;&lt;br /&gt;The difference with the Mersenne twister is that the "feedback" is effected&lt;br /&gt;by a certain "twist". This is effected by applying a "linear transformation"&lt;br /&gt;A of a certain specific form, with multiplication by A having addition&lt;br /&gt;replaced by xor in the matrix multiplication. The twist can be described&lt;br /&gt;more straightforwardly, and we give the more straightforward description&lt;br /&gt;below.&lt;br /&gt;&lt;br /&gt;One sets up a Mersenne twister by picking a recurrence degree n, a "middle&lt;br /&gt;word" 1 &lt;= m &lt;= n and a number of bits for a bitmask, 0 &lt;= r &lt;= 32. One &lt;div&gt;picks these values so that p = 2^(n*w - r) - 1 is a Mersenne prime (hence &lt;/div&gt;&lt;div&gt;the name of this PRNG).  Given a vector of bits a = [a0, a1, ..., a{w-1}] of length &lt;/div&gt;&lt;div&gt;w and a sequence x of words of w bits, the Mersenne twister is defined by a &lt;/div&gt;&lt;div&gt;recurrence relation x[k+n] = x[k+m] ^ ((upper(x[k]) | lower(x[k+1])) A) &lt;/div&gt;&lt;div&gt;where upper and lower return the upper w - r and lower r bits of their &lt;/div&gt;&lt;div&gt;operands, and where A is the "twist" spoken of and defined below, in terms of &lt;/div&gt;&lt;div&gt;a. Of course ^ here is the xor operator, not exponentiation.  For a vector X of w &lt;/div&gt;&lt;div&gt;bits, XA is given by X&gt;&gt;1 if X[0] == 0 otherwise it is given by (X&gt;&gt;1) ^ a.&lt;br /&gt;&lt;br /&gt;Some theory is required to find an A such that the Mersenne twister will have&lt;br /&gt;maximum theoretical period 2^(n*w - r) - 1.&lt;br /&gt;&lt;br /&gt;To finish off, the Mersenne twister is usually "tempered". This tempering&lt;br /&gt;simply mangles the bits in a well understood way to iron out some of the&lt;br /&gt;known wrinkles in the MT algorithm.&lt;br /&gt;&lt;br /&gt;Only a couple of sets of parameters are in common use for Mersenne twisters.&lt;br /&gt;These are referred to as MT19937 for 32 bit words and MT19937-64 for 64 bit&lt;br /&gt;words.&lt;br /&gt;&lt;br /&gt;As with all PRNG's, there is a whole industry around "cracking" these things.&lt;br /&gt;This involves starting with a short sequence of values from a PRNG and&lt;br /&gt;attempting to find the starting constants and seed values.&lt;br /&gt;&lt;br /&gt;Obviously, in crytographic applications, there is not much point generating&lt;br /&gt;"secure" keys with a PRNG with a single source of entropy. Even if your key&lt;br /&gt;is generated by multiplying primes of many words in length, if those words&lt;br /&gt;were generated from a PRNG seeded from the current time, it may only take&lt;br /&gt;a few iterations and a good guess as to which PRNG you used, to determine&lt;br /&gt;the constants used in the PRNG and thus your entire key. And that's&lt;br /&gt;irrespective of which constants you chose in your PRNG!&lt;br /&gt;&lt;br /&gt;So if you are doing crypto, you need to take additional precautions to&lt;br /&gt;generate secure keys. Just seeding a PRNG from the time probably won't cut&lt;br /&gt;it!&lt;br /&gt;&lt;br /&gt;Some PRNG's are more "secure" than others, meaning that knowing a&lt;br /&gt;few output values in a row doesn't give terribly much information about&lt;br /&gt;which values may follow. But if you rely on a PRNG to be secure, you&lt;br /&gt;are essentially betting that because you don't know how to get the&lt;br /&gt;next few values and nor does anyone else that has written about the&lt;br /&gt;subject, then no one at all knows. Of course one needs to ask oneself&lt;br /&gt;if they would tell you if they did.&lt;br /&gt;&lt;br /&gt;Another assumption one should never make is that no one has the computing&lt;br /&gt;power to brute force your PRNG.&lt;br /&gt;&lt;br /&gt;Some PRNG's are designed for cryptographic applications, and maybe one can&lt;br /&gt;believe that these are "safe" to use, for some definition of safe.&lt;br /&gt;&lt;br /&gt;Anyhow, we only care about random testing at this point. In today's update&lt;br /&gt;32 and 64 bit KISS, SKISS and MT PRNG's are added in the directory rand.&lt;br /&gt;Our randword, randinit, and randclear functions are all replaced with&lt;br /&gt;appropriate calls to KISS functions.&lt;br /&gt;&lt;br /&gt;There is also an option to change the default PRNG used by bsdnt. Is it my&lt;br /&gt;imagination or does the test code now run faster, even on a 64 bit machine!&lt;br /&gt;&lt;br /&gt;At some point we will add some tests of the new PRNG's. These will compare&lt;br /&gt;the outputs with known or published values to check that they are working as&lt;br /&gt;designed for a large number of iterations.&lt;br /&gt;&lt;br /&gt;Brian Gladman contributed to this article and also did most of the work&lt;br /&gt;in implementing Marsaglia's PRNG's in bsdnt. The Mersenne twisters were&lt;br /&gt;originally written by Takuji Nishimura and Makoto Matsumoto and made available&lt;br /&gt;under a BSD license. Brian did most of the work in adapting these for bsdnt.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's update is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.21"&gt;v0.21&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html"&gt;v0.20 - redzones&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/11/bsdnt-v022-windows-support.html"&gt;v0.22 - Windows support&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-3283856199180578388?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/3283856199180578388/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3283856199180578388'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3283856199180578388'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html' title='BSDNT - v0.21 PRNGs'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-6921763159398388648</id><published>2010-10-30T12:12:00.000-07:00</published><updated>2010-10-31T16:11:18.134-07:00</updated><title type='text'>BSDNT - v0.20 redzones</title><content type='html'>&lt;div&gt;In this update we implement another improvement to the test code in bsdnt. I don't know&lt;/div&gt;&lt;div&gt;what the correct name is, but I call them redzones.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The basic idea is this: suppose you have a function nn_blah say, and it writes to an nn_t&lt;/div&gt;&lt;div&gt;b say. If it writes well beyond the allocated space for b, then almost certainly a &lt;/div&gt;&lt;div&gt;segfault will occur. But what if it only writes a word or two before the beginning or&lt;/div&gt;&lt;div&gt;after the end of the allocated space? Very likely this will cause a segfault only on&lt;/div&gt;&lt;div&gt;some systems, depending on the granularity of the heap allocator and depending&lt;/div&gt;&lt;div&gt;on what other bsdnt data might be in the overwritten space!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So what if we could detect this kind of error? Well, that is what redzones hope to do. &lt;/div&gt;&lt;div&gt;Essentially if an nn_t b is allocated with m words of space, when redzones are turned on&lt;/div&gt;&lt;div&gt;it allocates m + 2C words of space for some small constant C. It then fills the first&lt;/div&gt;&lt;div&gt;and last C words of b with known words of data (usually some recognisable pattern of bits).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When the garbage collector cleans up, it examines the redzones to ensure that they have&lt;/div&gt;&lt;div&gt;not been altered. If they have, they raise an error.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The nn_t b is set to point just after the first C words, which contain the redzone, and in &lt;/div&gt;&lt;div&gt;every other respect act like a normal nn_t. The user needn't know that an extra C words&lt;/div&gt;&lt;div&gt;of data were allocated immediately before and after the length m nn_t they requested.&lt;/div&gt;&lt;div&gt;Nor do they need to be aware of the checking that goes on when the nn_t is finally cleaned&lt;/div&gt;&lt;div&gt;up, that the redzones haven't been touched.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course it's nice to be able to turn redzones off sometimes, when testing the library. &lt;/div&gt;&lt;div&gt;Therefore I've added a configure option -noredzones which turns off redzones if they are &lt;/div&gt;&lt;div&gt;not required. This works by setting a #define WANT_REDZONES 0 in config.h. The &lt;/div&gt;&lt;div&gt;memory allocator for nn_t's and the garbage collector both operate differently if redzones &lt;/div&gt;&lt;div&gt;are turned on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At present, the only way to allocate memory for nn_t's in test code is to use &lt;/div&gt;&lt;div&gt;randoms_of_len, so it is convenient to rewrite this to call a function alloc_redzoned_nn &lt;/div&gt;&lt;div&gt;instead of malloc, and for the garbage collector to call free_redzoned_nn. These new &lt;/div&gt;&lt;div&gt;functions are defined in test.c. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The only difference when WANT_REDZONES is set in config.h is that REDZONE_WORDS, which is defined in test.h is changed from 0 to 4 words (meaning 4 redzone words are to be&lt;/div&gt;&lt;div&gt;allocated at each end of a redzoned nn_t). Having redzones of length 0 is the same as not &lt;/div&gt;&lt;div&gt;having them at all. So this makes the functions easy to write.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also in test.h REDZONE_BYTE is defined to the hexaecimal byte 0xA which has binary bit&lt;/div&gt;&lt;div&gt;pattern 1010, i.e. alternating one's and zeroes. This is the value that is placed into the &lt;/div&gt;&lt;div&gt;redzones byte-by-byte before the nn_t is used. At the end, when they are cleaned up, the&lt;/div&gt;&lt;div&gt;garbage collector examines the redzones to ensure they are still filled with these bytes.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Fortunately checking redzones does not dramatically slow down our test code, and no new&lt;/div&gt;&lt;div&gt;test failures result. This means it is highly likely that our nn_t functions do not overwrite&lt;/div&gt;&lt;div&gt;their bounds. To check that the new redzones code works, it is a simple matter of mocking&lt;/div&gt;&lt;div&gt;up a broken function which overwrites its bounds. The new code complains loudly as it &lt;/div&gt;&lt;div&gt;should, unless redzones are switched off at configure time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's update is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.20"&gt;v0.20&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html"&gt;v0.19 - asserts&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v021-prngs.html"&gt;v0.21 - prngs&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-6921763159398388648?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/6921763159398388648/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6921763159398388648'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6921763159398388648'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html' title='BSDNT - v0.20 redzones'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-6212996043940700508</id><published>2010-10-25T19:38:00.000-07:00</published><updated>2010-10-30T12:25:36.275-07:00</updated><title type='text'>BSDNT - v0.19 asserts</title><content type='html'>&lt;div&gt;About a week ago I got enthused to work on another coding project I've been &lt;/div&gt;&lt;div&gt;wanting to experiment with for a long time. I discovered that it was highly&lt;/div&gt;&lt;div&gt;addictive and I just couldn't put it down. It's also given me some interesting&lt;/div&gt;&lt;div&gt;ideas for a higher level interface to bsdnt. But more on that later when we &lt;/div&gt;&lt;div&gt;start working on it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately in that week of time there have been no bsdnt updates.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Moreover, on the weekend my main computer died (physical hard drive failure).&lt;/div&gt;&lt;div&gt;I pulled out my backup machine and Windows wanted to install 3 months of &lt;/div&gt;&lt;div&gt;"important updates". Naturally this caused the machine to crash, I was unable&lt;/div&gt;&lt;div&gt;to recover to the restore point it set, the startup repair didn't work and&lt;/div&gt;&lt;div&gt;the only solution was a format and reload. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;*Fifteen and a half hours* later I had reinstalled Windows and it had finally &lt;/div&gt;&lt;div&gt;finished putting the approximately 165 "important updates" back on!!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately all the bsdnt blog articles I had meticulously prepared in &lt;/div&gt;&lt;div&gt;advance were lost. Thus I am regenerating what I can from the diff between&lt;/div&gt;&lt;div&gt;revisions of bsdnt. Sorry if they end up being shorter than past updates.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Fortunately I did not lose any of the code I wrote, as that was backed up in&lt;/div&gt;&lt;div&gt;a git repo on an external server!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow, in this update we make a very simple change to bsdnt, again in an&lt;/div&gt;&lt;div&gt;attempt to improve the test quality of the library. We add asserts to the &lt;/div&gt;&lt;div&gt;code. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An assert is a check that is made at runtime in live code to test if some&lt;/div&gt;&lt;div&gt;predefined condition holds. If the assert fails, an error message is printed&lt;/div&gt;&lt;div&gt;specifying the line of code where the assert is located and what the&lt;/div&gt;&lt;div&gt;condition was that failed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, I am not personally a great believer in asserts. As they are runtime&lt;/div&gt;&lt;div&gt;checks, they require computing cycles, which is just a no-no for a bignum&lt;/div&gt;&lt;div&gt;library. The other option is to turn them off when not testing code. However,&lt;/div&gt;&lt;div&gt;this simply leads to the asserts rarely being run when they are needed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The other problem with asserts is that they pollute the code, making the&lt;/div&gt;&lt;div&gt;source files longer and appear more complex.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, there is one situation where I believe they can be very helpful, &lt;/div&gt;&lt;div&gt;and that is in checking the interface of functions within a library and that&lt;/div&gt;&lt;div&gt;it is being respected both in intra-library calls and by the test code for&lt;/div&gt;&lt;div&gt;the library.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;Specifically, assert are useful for checking that valid inputs have been &lt;/div&gt;&lt;div&gt;passed to the functions, e.g. you might have a restriction that a Hensel&lt;/div&gt;&lt;div&gt;modulus be odd. Adding an assert allows you to test that all the moduli&lt;/div&gt;&lt;div&gt;you pass to the function in your test runs are in fact odd.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The main advantage in putting asserts into the code is that it forces you&lt;/div&gt;&lt;div&gt;to think through what all the conditions should be that you assert. In &lt;/div&gt;&lt;div&gt;adding asserts to the code in bsdnt I discovered one function in which the&lt;/div&gt;&lt;div&gt;test code was pushing the code to do things I didn't write it to cover.&lt;/div&gt;&lt;div&gt;This forced me to either rewrite the test, or drop that as a condition (I&lt;/div&gt;&lt;div&gt;think I chose the former for consistency with other related functions in&lt;/div&gt;&lt;div&gt;bsdnt).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course we do not want to consume cycles when the library is run by the&lt;/div&gt;&lt;div&gt;end user, and so we make asserts optional. This is done using a configure&lt;/div&gt;&lt;div&gt;switch. By default the macro WANT_ASSERT is set to 0 in a file config.h by&lt;/div&gt;&lt;div&gt;configure. However, if the user passes the option -assert to configure, it&lt;/div&gt;&lt;div&gt;sets the value of this define to 1. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A macro ASSERT is then defined in helper.h which is either an empty macro&lt;/div&gt;&lt;div&gt;in the default case or is an alias for the C assert function if WANT_ASSERT&lt;/div&gt;&lt;div&gt;is set to 1 in config.h.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course we have to remember to turn asserts on to run the test code, and&lt;/div&gt;&lt;div&gt;this really highlights their main weakness. As I mentioned, the asserts I &lt;/div&gt;&lt;div&gt;added did clarify the interface, but I don't believe they showed up any &lt;/div&gt;&lt;div&gt;bugs in bsdnt. With this expectation, asserts can be a useful tool.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's update is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.19"&gt;v0.19&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html"&gt;v0.18 - printx_word, nn_printx&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v020-redzones.html"&gt;v0.20 - redzones&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-6212996043940700508?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/6212996043940700508/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6212996043940700508'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6212996043940700508'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html' title='BSDNT - v0.19 asserts'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-2217080325965199785</id><published>2010-10-16T14:51:00.000-07:00</published><updated>2010-10-25T19:52:41.907-07:00</updated><title type='text'>BSDNT - v0.18 printx_word, nn_printx</title><content type='html'>&lt;div&gt;It is time we improved our test code again. We'll spend a few days updating&lt;/div&gt;&lt;div&gt;things to make improvements in the way we test.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Today's update is quite straightforward. We currently have no way of printing&lt;/div&gt;&lt;div&gt;nn_t's. This is quite inconvenient when it comes to the test code, where&lt;/div&gt;&lt;div&gt;little to no diagnostic information is printed at all. In particular, we &lt;/div&gt;&lt;div&gt;aren't printing out any of the multiple precision integers for examination&lt;/div&gt;&lt;div&gt;when a test fails.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, it is actually quite a difficult job to print bignum integers in decimal. &lt;/div&gt;&lt;div&gt;In fact, as far as I can see, one requires a function which allocates temporary&lt;/div&gt;&lt;div&gt;space to efficiently print integers. This is an interesting challenge:&lt;/div&gt;&lt;div&gt;is there an algorithm to convert from binary to decimal and print the result,&lt;/div&gt;&lt;div&gt;with just O(1) temporary space, with any complexity.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think it may be possible if one allows the input to be destroyed. If so, a&lt;/div&gt;&lt;div&gt;subsidiary question would be to do the same thing without destroying the &lt;/div&gt;&lt;div&gt;input. I doubt that is possible, but I do not have a proof. Of course, to be&lt;/div&gt;&lt;div&gt;practical, we'd require an algorithm which doesn't destroy the input.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To get around this issue, we'll start with a simple nn_printx algorithm, &lt;/div&gt;&lt;div&gt;which will print a bignum in hexadecimal. We also add an nn_printx_short &lt;/div&gt;&lt;div&gt;function which prints the first couple of words of a bignum, an ellipsis and &lt;/div&gt;&lt;div&gt;then the final couple of words. This is useful for large bignums that would &lt;/div&gt;&lt;div&gt;print for screens and screens due to their size. We'll use this in our test &lt;/div&gt;&lt;div&gt;code to prevent printing too much output upon test failure.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another function we add is an nn_printx_diff function. It accepts two nn_t's&lt;/div&gt;&lt;div&gt;and prints information about the range of words where they differ and prints &lt;/div&gt;&lt;div&gt;the first and last differing word in each case.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There is one tricky aspect to our printing functions however. A word is often&lt;/div&gt;&lt;div&gt;an unsigned long, but on some platforms it will be an unsigned long long. For &lt;/div&gt;&lt;div&gt;this reason, when printing a word, we need to use %lx as the format specifier &lt;/div&gt;&lt;div&gt;on some platforms and %llx on others. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So we need to add a routine which will print a word and abstract away the &lt;/div&gt;&lt;div&gt;format specifier so the caller doesn't have to think about it. The function&lt;/div&gt;&lt;div&gt;we include to do this is caled printx_word. It prints a word without needing&lt;/div&gt;&lt;div&gt;to specify a format specifier.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add files helper.c/h to bsdnt which will contain routines like this one&lt;/div&gt;&lt;div&gt;which aren't specific to our nn module. A few existing functions and macros&lt;/div&gt;&lt;div&gt;also get moved there. The configure system will automatically look for &lt;/div&gt;&lt;div&gt;architecture specific versions of helper.c, allowing us to override the&lt;/div&gt;&lt;div&gt;definition of the functions in that file on a per architecture basis.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add the printx_word function to helper.c which can be overridden with &lt;/div&gt;&lt;div&gt;an architecture specific version. On a platform where %llx is required, an &lt;/div&gt;&lt;div&gt;architecture specific version will simply replace the generic version which &lt;/div&gt;&lt;div&gt;uses %lx.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In test.h we add some macros, print_debug and print_debug_diff which use the&lt;/div&gt;&lt;div&gt;stringizing operator to print the names of the variables and then print their&lt;/div&gt;&lt;div&gt;values. The stringizing operator (#) is a preprocessor macro which turns a &lt;/div&gt;&lt;div&gt;macro parameter into a string. In our case, we pass the variable name to the&lt;/div&gt;&lt;div&gt;macro and turn it into a string so that we can print the variable name. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A few modifications to the TEST_START and TEST_END macros in test.h also &lt;/div&gt;&lt;div&gt;allow us to give a unique name to each test which is then printed along with &lt;/div&gt;&lt;div&gt;the iteration at which the test failed. This also uses the stringizing &lt;/div&gt;&lt;div&gt;operator so that the caller of TEST_START can specify the unique name for the&lt;/div&gt;&lt;div&gt;test. It seems difficult to come up with an automatic way of generating &lt;/div&gt;&lt;div&gt;unique test names, so this will have to do.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It would also be a useful thing to have it print the value of the random seed &lt;/div&gt;&lt;div&gt;at the start of a failing iteration too. After we have improved the random&lt;/div&gt;&lt;div&gt;generation code in bsdnt v0.21, perhaps someone would like to try adding this &lt;/div&gt;&lt;div&gt;feature.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We roll out our new diagnostic printing routines to our test code. Of course,&lt;/div&gt;&lt;div&gt;to see any of this new code in action, one has to introduce a bug in one of &lt;/div&gt;&lt;div&gt;the tests so that the new diagnostic code is actually run. I leave it to you&lt;/div&gt;&lt;div&gt;to fiddle around introducing bugs to see that the new test code does actually&lt;/div&gt;&lt;div&gt;print useful diagnostic information.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Later on we'll add a bsdnt_printf function which will be variadic and accept&lt;/div&gt;&lt;div&gt;a format specifier like the C printf function and which will have a&lt;/div&gt;&lt;div&gt;consistent %w for printing a word. This will also make things easier on &lt;/div&gt;&lt;div&gt;Windows, where currently the format specifier will be wrong in many places.&lt;/div&gt;&lt;div&gt;We'll fix this problem in a later update.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.18"&gt;v0.18&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html"&gt;v0.17 - div_hensel&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v019-asserts.html"&gt;v0.19 - asserts&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-2217080325965199785?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/2217080325965199785/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2217080325965199785'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2217080325965199785'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html' title='BSDNT - v0.18 printx_word, nn_printx'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-8364867753172635963</id><published>2010-10-15T17:43:00.000-07:00</published><updated>2010-10-16T14:55:35.476-07:00</updated><title type='text'>BSDNT - v0.17 div_hensel</title><content type='html'>&lt;div&gt;Now that we have nn_mullow_classical, we can add nn_div_hensel. As explained,&lt;/div&gt;&lt;div&gt;this will take an integer a of length n and divide by an integer d of length &lt;/div&gt;&lt;div&gt;m modulo B^n, returning a quotient q and an "overflow".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The overflow will be two words which agree with the overflow from mullow(q*d).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As per the euclidean division, the dividend a will be destroyed. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The algorithm is somewhat simpler than the euclidean algorithm. If d1 is the&lt;/div&gt;&lt;div&gt;least significant word of d then we use an inverse mod B of d1 (dinv say) and&lt;/div&gt;&lt;div&gt;multiply it by the current word of the dividend being considered (working from&lt;/div&gt;&lt;div&gt;right to left) to get a quotient word q1. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We then subtract q1*d (appropriately shifted) from the dividend. There is no&lt;/div&gt;&lt;div&gt;adjustment to do as the inverse mod B is unique (so long as d is odd). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Any borrows and overflows from the subtractions are accumulated in the two &lt;/div&gt;&lt;div&gt;overflow and returned.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our test code, we check a few things. Firstly, for an exact division, we &lt;/div&gt;&lt;div&gt;want that the quotient is really the exact quotient of a by d. As the &lt;/div&gt;&lt;div&gt;quotient  returned is m words, which may be larger than the actual quotient, &lt;/div&gt;&lt;div&gt;we check that any additional words of q are zero. We do this by normalising q.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second test we do is for an inexact division. We check that the the &lt;/div&gt;&lt;div&gt;overflow words turn out to be the same as the overflow from mullow(q*d).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that if we wish to make Hensel division efficient for doing an exact&lt;/div&gt;&lt;div&gt;division, say of a 2m - 1 by m division, we merely pass m words of the &lt;/div&gt;&lt;div&gt;dividend in instead of all 2m - 1 words, so that the quotient is also m words. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once again we don't allow an overflow-in to Hensel division. This wouldn't &lt;/div&gt;&lt;div&gt;give us any kind of chaining property anyway. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead, we'd have to do a mulhigh(q, d) and subtract that from the high &lt;/div&gt;&lt;div&gt;part of the chain before continuing, and the mulhigh will accept our overflow&lt;/div&gt;&lt;div&gt;words from the low Hensel division.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fact, we add a chaining test that does precisely this. We do a Hensel&lt;/div&gt;&lt;div&gt;division on the low n words of a chain, subtract a mulhigh from the high&lt;/div&gt;&lt;div&gt;m words of the chain, then compute the high m words of the quotient using&lt;/div&gt;&lt;div&gt;a second Hensel division.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To make our test code work, we add an ODD flag to randoms_of_len so that only&lt;/div&gt;&lt;div&gt;odd divisors are used with Hensel division.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It isn't clear if it makes sense to allow a carry-in at the top of div_hensel&lt;/div&gt;&lt;div&gt;or not. On the one hand, it might seem logical to allow a carry-in on account&lt;/div&gt;&lt;div&gt;of the way mul_classical works. On the other hand, divrem_hensel1 took a &lt;/div&gt;&lt;div&gt;carry-in, but at the bottom. This was for chaining rather than a read-in of&lt;/div&gt;&lt;div&gt;an extra word. We choose the latter convention, as it seems to make more&lt;/div&gt;&lt;div&gt;sense here and stops the code from becoming horribly complex.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.17"&gt;v0.17&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html"&gt;v0.16 - mullow_classical&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v018-printxword-nnprintx.html"&gt;v0.18 - printx_word, nn_printx&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-8364867753172635963?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/8364867753172635963/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8364867753172635963'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8364867753172635963'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html' title='BSDNT - v0.17 div_hensel'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-3286494467547000227</id><published>2010-10-14T12:17:00.000-07:00</published><updated>2010-10-15T17:47:36.608-07:00</updated><title type='text'>BSDNT - v0.16 mullow_classical, mulhigh_classical</title><content type='html'>The most logical routine to implement next would be Hensel division. Its&lt;br /&gt;main application is in doing exact division.&lt;br /&gt;&lt;br /&gt;For that reason, we might want to focus on Hensel division without&lt;br /&gt;remainder first.&lt;br /&gt;&lt;br /&gt;This would take an m word integer and divide it by an n word integer,&lt;br /&gt;division proceeding from right to left. Essentially it gives us division&lt;br /&gt;modulo B^m.&lt;br /&gt;&lt;br /&gt;However, before we implement this, it makes sense to think about what its&lt;br /&gt;"inverse" operation might be.&lt;br /&gt;&lt;br /&gt;If q is the Hensel quotient of a by d mod B^m, then the inverse operation&lt;br /&gt;can be thought of as multiplication of q by d mod B^m.&lt;br /&gt;&lt;br /&gt;Hensel division gives an m word quotient q, so this would imply that its&lt;br /&gt;inverse should be a multiplication of {d, n} by {q, m} mod B^m. We might call&lt;br /&gt;this inverse operation mullow, as it returns the low m words of the product&lt;br /&gt;d*q.&lt;br /&gt;&lt;br /&gt;However, we need to be careful with this kind of multiplication. We'd also&lt;br /&gt;like to have a mulhigh which returns the high part of the multiplication,&lt;br /&gt;and we'd like the sum of mullow and mulhigh to be the same as a full mul.&lt;br /&gt;&lt;br /&gt;However, there is a problem if mullow merely returns the product mod B^m. Any&lt;br /&gt;carries out of mullow will have been lost. Also, all the word by word&lt;br /&gt;&lt;br /&gt;multiplications that contribute to the high word of the product mod B^m&lt;br /&gt;will be thrown away.&lt;br /&gt;&lt;br /&gt;To rectify the problem we accumulate an "overflow" out of the mullow&lt;br /&gt;corresponding to the sum of all these high words and carries. As this&lt;br /&gt;overflow is essentially the sum of arbitrary words it may take up two words.&lt;br /&gt;&lt;br /&gt;Thus, instead of mullow yielding m words it will yield m + 2 words. We'd&lt;br /&gt;like to pass the extra two words as an "overflow-in" to mulhigh, thus the&lt;br /&gt;logical thing is to return these two words from mullow separately from the&lt;br /&gt;product mod B^m itself.&lt;br /&gt;&lt;br /&gt;Hensel division will also return two overflow words. After all, what it&lt;br /&gt;essentially does to the dividend is subtract a mullow of the quotient by the&lt;br /&gt;divisor. So, the overflow from Hensel division will be defined as precisely&lt;br /&gt;the overflow from mullow(q*d).&lt;br /&gt;&lt;br /&gt;We manage the overflow by accumulating it in an dword_t. However, as we don't&lt;br /&gt;wish the user to have to deal with dword_t's (these are used in our internal&lt;br /&gt;implementations only), we split this dword_t into two separate words at the&lt;br /&gt;end and return them as an array of two words representing the "overflow".&lt;br /&gt;&lt;br /&gt;Today we shall only implement mullow and mulhigh. The first of these is a lot&lt;br /&gt;like a full multiplication except that the addmul1's become shorter as the&lt;br /&gt;algorithm proceeds and the carry-out's have to be accumulated in two words, as&lt;br /&gt;explained.&lt;br /&gt;&lt;br /&gt;At the same time we implement mulhigh. This takes two "overflow-in" words and&lt;br /&gt;computes the rest of the product, again in a fashion similar to a full&lt;br /&gt;multiplication.&lt;br /&gt;&lt;br /&gt;Our test code simply stitches a mullow and mulhigh together to see that the&lt;br /&gt;chain is the same as a full multiplication.&lt;br /&gt;&lt;br /&gt;we have to be careful in that if one does an n by n mullow, the mulhigh that&lt;br /&gt;we wish to chain with this must start with an n-1 by 1 multiplication,&lt;br /&gt;not an n by 1, otherwise the sum of the mullow and mulhigh would contain the&lt;br /&gt;cross product of certain terms twice.&lt;br /&gt;&lt;br /&gt;We also have to be careful in the case where the full multiplication is only&lt;br /&gt;a single word by a single word. Here the overflow out of the mullow part is&lt;br /&gt;only a single word and there is no mulhigh to speak of. It merely passes&lt;br /&gt;the overflow from the mullow straight through.&lt;br /&gt;&lt;br /&gt;Both mullow and mulhigh must accept all nonzero lengths, as per full&lt;br /&gt;multiplication. This causes a few cases to deal with in mulhigh. This doesn't&lt;br /&gt;seem particularly efficient or elegant, but there seems to be little we&lt;br /&gt;can do about that.&lt;br /&gt;&lt;br /&gt;An interesting question is what the inverse of mulhigh is. Essentially,&lt;br /&gt;this is our euclidean divapprox.&lt;br /&gt;&lt;br /&gt;There's something slightly unsatisfying here though. Recall that the divapprox&lt;br /&gt;algorithm proceeds by subtracting values q1*d' where q1 is the current quotient&lt;br /&gt;word and d' is what is currently left of the divisor. We throw away one word of&lt;br /&gt;the divisor at each iteration until finally we are left with just two words.&lt;br /&gt;&lt;br /&gt;It would be wholly more satisfying if we didn't require this extra word of the&lt;br /&gt;divisor throughout. We'd then be working right down to a single word in the&lt;br /&gt;divisor so that we will really have subtracted a mulhigh by the time the&lt;br /&gt;algorithm completes.&lt;br /&gt;&lt;br /&gt;Any modification I can think of making to the euclidean division to make this&lt;br /&gt;seem more natural also makes it much less efficient.&lt;br /&gt;&lt;br /&gt;Perhaps some further thought will lead to a more satisfying way of thinking about&lt;br /&gt;these things, which isn't also less efficient in practice.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.16"&gt;v0.16&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html"&gt;v0.15 - divapprox_classical&lt;/a&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v017-divhensel.html"&gt;v0.17 - div_hensel&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-3286494467547000227?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/3286494467547000227/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3286494467547000227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3286494467547000227'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html' title='BSDNT - v0.16 mullow_classical, mulhigh_classical'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-8183267255657687582</id><published>2010-10-13T15:00:00.000-07:00</published><updated>2011-10-23T10:12:53.909-07:00</updated><title type='text'>BSDNT - v0.15 divapprox_classical</title><content type='html'>&lt;div&gt;During the past few weeks, Brian Gladman has been doing &lt;/div&gt;&lt;div&gt;some tremendous updates, including some random number generators and&lt;/div&gt;&lt;div&gt;making bsdnt work on Windows (MSVC). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We discuss all matters related to bsdnt on our development list:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://groups.google.co.uk/group/bsdnt-devel?hl=en"&gt;http://groups.google.co.uk/group/bsdnt-devel?hl=en&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow, now for today's update. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We'd now like to implement a variant of our divrem_classical algorithm. This&lt;br /&gt;time we'd like to just return a quotient, with no remainder. The question is,&lt;br /&gt;can this be done in less time than a full divrem?&lt;br /&gt;&lt;br /&gt;At first sight, the answer seems to be no. As we saw in the post on the divrem&lt;br /&gt;algorithm, every word of both the dividend and divisor counts, and we need&lt;br /&gt;to keep the dividend completely updated as the algorithm proceeds, otherwise&lt;br /&gt;we will get the wrong quotient.&lt;br /&gt;&lt;br /&gt;So, with the exception of perhaps the final update (which is only needed to&lt;br /&gt;determine the remainder), there doesn't seem to be much we can save.&lt;br /&gt;&lt;br /&gt;But what if we allowed the quotient to be approximate, say within 1 of the actual&lt;br /&gt;&lt;br /&gt;quotient? In fact, let's demand that if q' is our approximate quotient, that |a -&lt;br /&gt;&lt;br /&gt;d*q'| &amp;lt; d. In other words, we allow the quotient to be too large by 1, but not&lt;br /&gt;&lt;br /&gt;too small by 1.&lt;br /&gt;&lt;br /&gt;Ideally, what we would like is to be doing about half the updating work.&lt;br /&gt;Specifically, we'd like to be truncating both the dividend and divisor as we&lt;br /&gt;go.&lt;br /&gt;&lt;br /&gt;Ordinarily we start with something like&lt;br /&gt;&lt;br /&gt;AAAAAAAAAAAAAAA /&lt;br /&gt;        DDDDDDDDD&lt;br /&gt;&lt;br /&gt;To get the first quotient word we shift the divisor and pad with zeroes, thus&lt;br /&gt;&lt;br /&gt;AAAAAAAAAAAAAAA /&lt;br /&gt;DDDDDDDD000000&lt;br /&gt;&lt;br /&gt;After one iteration, a word of the dividend has been removed, and we then&lt;br /&gt;shift the divisor again.&lt;br /&gt;&lt;br /&gt;AAAAAAAAAAAAAA /&lt;br /&gt;DDDDDDDD00000&lt;br /&gt;&lt;br /&gt;We continue until we have&lt;br /&gt;&lt;br /&gt;     AAAAAAAAA /&lt;br /&gt;        DDDDDDDD&lt;br /&gt;&lt;br /&gt;Now there is only one quotient word left. But we notice that we don't use&lt;br /&gt;most of the remaining dividend or the divisor to determine this quotient&lt;br /&gt;word. In fact, we could almost truncate to&lt;br /&gt;&lt;br /&gt;      AA            /&lt;br /&gt;        D&lt;br /&gt;&lt;br /&gt;What if we had truncated at this same point all along?&lt;br /&gt;&lt;br /&gt;In fact, if we truncate so that the final division is a two word by one&lt;br /&gt;word division (here we have to be careful, in that we are talking about the&lt;br /&gt;number of words *after* normalisation), then clearly our quotient could be&lt;br /&gt;out by as much as two on that final division, by what we have said in an&lt;br /&gt;earlier post. That is of course ignoring any accumulated error along the way.&lt;br /&gt;&lt;br /&gt;As we don't wish to multiply the entire thing out to see what we have to do&lt;br /&gt;to correct it, it is clear that this amount of truncation is too great.&lt;br /&gt;&lt;br /&gt;So let's truncate one further word to the right in both the dividend and&lt;br /&gt;divisor, so that the final division (to get the final quotient word) is a&lt;br /&gt;three word by two word division.&lt;br /&gt;&lt;br /&gt;In fact, in the example above, as there will be five quotient words, there&lt;br /&gt;will be five iterations of the algorithm, after which we want two words of&lt;br /&gt;the divisor remaining. So, we will start with&lt;br /&gt;&lt;br /&gt;AAAAAAA.A /&lt;br /&gt;DDDDDD.D&lt;br /&gt;&lt;br /&gt;(The decimal points I have inserted are arbitrary, and only for notational&lt;br /&gt;purposes in what follows.)&lt;br /&gt;&lt;br /&gt;After five iterations, throwing away one more word of the divisor each time,&lt;br /&gt;we'll be left with&lt;br /&gt;&lt;br /&gt;      AA.A /&lt;br /&gt;        D.D&lt;br /&gt;&lt;br /&gt;The first thing to notice is that our previous divrem algorithms, with the&lt;br /&gt;adjustments they made as they went, gave the precise quotient given the&lt;br /&gt;data they started with.&lt;br /&gt;&lt;br /&gt;The second thing to notice is that truncating both the dividend and the&lt;br /&gt;divisor at the same point, as above, will not yield a quotient that is too&lt;br /&gt;small. In fact, the quotient we end up with will be the same as what we would&lt;br /&gt;have obtained if we had not truncated the dividend at all, and only truncated&lt;br /&gt;the divisor. Additional places in the dividend can't affect the algorithm.&lt;br /&gt;&lt;br /&gt;Truncating the divisor, on the other hand, may result in a different quotient&lt;br /&gt;than we would have obtained without truncation. In fact, as we end up&lt;br /&gt;subtracting less at each update than we would if all those words were still&lt;br /&gt;there, we may end up with a quotient which is too large. The divisor may also&lt;br /&gt;divide more times, because of the truncation, than it would have if it had not&lt;br /&gt;been truncated.&lt;br /&gt;&lt;br /&gt;However, it is not enough to merely consider how the quotient changes with&lt;br /&gt;truncation in order to see how far we can be out. We'll likely end up with a&lt;br /&gt;very pessimistic estimate if we do this, because we may suppose that the&lt;br /&gt;quotient can be one too large at each iteration, which is not true.&lt;br /&gt;&lt;br /&gt;Instead, the quantity to keep track of is the original dividend minus the&lt;br /&gt;product of the full divisor d and the computed quotient q. At the end of the&lt;br /&gt;algorithm, this is the actual remainder we'll end up with, and we'd like to&lt;br /&gt;keep track of how much our *computed* remainder (what's left of the dividend&lt;br /&gt;after the algorithm completes) differs from this actual remainder.&lt;br /&gt;&lt;br /&gt;Essentially, we accumulate an error in the computed remainder due to the&lt;br /&gt;truncation.&lt;br /&gt;&lt;br /&gt;Clearly, at each iteration, the word after the decimal point in what remains&lt;br /&gt;of the dividend may be completely incorrect. And we may miss a borrow out of&lt;br /&gt;this place into the place before the decimal point. So after n iterations of&lt;br /&gt;the algorithm, the dividend may become too large by n. Of course n.0 is much&lt;br /&gt;smaller than our original (normalised) divisor d (also considered as a decimal&lt;br /&gt;D.DD...).&lt;br /&gt;&lt;br /&gt;At the final step of the algorithm, we will have a dividend which is too large&lt;br /&gt;by at most this amount, and we'll be using a divisor which is truncated to&lt;br /&gt;just two words. However, the latter affects the computed remainder by an&lt;br /&gt;amount much less than the original d (if Q is the final quotient word, it is&lt;br /&gt;as though we added q*0.0DDDDDD to our divisor, so that the full divisor would&lt;br /&gt;go Q times where it otherwise would only go Q-1 times).&lt;br /&gt;&lt;br /&gt;So these two sources of error only increase the computed value q'*d + r (where&lt;br /&gt;q' is the computed quotient and r is the computed remainder) by an amount less&lt;br /&gt;than d. Thus, the computed quotient q' can be at most one larger than the&lt;br /&gt;actual quotient q.&lt;br /&gt;&lt;br /&gt;This is equivalent to the required |a - d*q'| &amp;lt; d.&lt;br /&gt;&lt;br /&gt;So it seems that if we truncate out dividend at the start of the algorithm,&lt;br /&gt;and our divisor after each iteration, we can get an approximate quotient q'&lt;br /&gt;within the required bounds.&lt;br /&gt;&lt;br /&gt;We'll leave it to another time to describe the usefulness of an algorithm&lt;br /&gt;which computes a quotient which may be out by one. What we will note is that&lt;br /&gt;we've done computations with much smaller integers. It therefore costs us&lt;br /&gt;significantly less time than a full divrem.&lt;br /&gt;&lt;br /&gt;In this week's branch we implement this algorithm. In the test code, we check&lt;br /&gt;the required quotient is the same as the one returned by divrem, or at most&lt;br /&gt;one too large.&lt;br /&gt;&lt;br /&gt;The trickiest part is ensuring we truncate at the right point. We want to&lt;br /&gt;finish on the last iteration with two words *after* normalisation of the&lt;br /&gt;divisor.&lt;br /&gt;&lt;br /&gt;Actually, if we are really smart, we realise that if d does not need&lt;br /&gt;to be shifted by much to normalise it, we can get away with finishing with&lt;br /&gt;just two *unnormalised* words in the divisor. The error will still be much&lt;br /&gt;less than d.&lt;br /&gt;&lt;br /&gt;To be safe, if the number of quotient words is to be n, I check if the&lt;br /&gt;leading word of the unnormalised divisor is more than 2*n. If not, too much&lt;br /&gt;normalisation may be required, and I set up the algorithm to finish with&lt;br /&gt;three unnormalised words instead of two. Otherwise it is safe to finish&lt;br /&gt;with two words in the unnormalised divisor.&lt;br /&gt;&lt;br /&gt;The algorithm begins by computing the number of words s the divisor needs to&lt;br /&gt;be to start. This is two more than the number of iterations required to&lt;br /&gt;get all but the final quotient word, since we should have two words in the&lt;br /&gt;divisor at this point. If normalisation is going to be a problem, we add one&lt;br /&gt;to this so that we compute the final quotient word with three unnormalised&lt;br /&gt;words in the divisor.&lt;br /&gt;&lt;br /&gt;Now the number of words required to start depends only on the size of the&lt;br /&gt;quotient, and thus it may be more than the number of words d actually has.&lt;br /&gt;Thus we begin with the ordinary divrem algorithm until the number of words&lt;br /&gt;required is less than the number of words d actually has.&lt;br /&gt;&lt;br /&gt;Now we truncate d to the required number of words and the dividend to one&lt;br /&gt;more than that. The remaining part of the algorithm proceeds in the same&lt;br /&gt;manner, throwing away a divisor word every iteration.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Only one thing can go wrong, and that is the following: because we are &lt;/div&gt;&lt;div&gt;truncating the divisor at each point, we may end up subtracting too little from&lt;/div&gt;&lt;div&gt;the dividend. In fact, what can happen is that the top word of the dividend &lt;/div&gt;&lt;div&gt;does not become zero after we subtract q*d' (where d' is the truncated divisor).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When this happens, the top word of the dividend may be 1 after the subtraction.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We know that the least significant word of the dividend could be completely&lt;/div&gt;&lt;div&gt;wrong, and the next word may be too large by about as many iterations as &lt;/div&gt;&lt;div&gt;we've completed so far.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus, in order to fix our overflow, we subtract from the dividend as much as we &lt;/div&gt;&lt;div&gt;need to in order for the overflow to disappear. We don't mind our dividend &lt;/div&gt;&lt;div&gt;being too large, as we adjust for that in the algorithm. But we cannot allow it to&lt;/div&gt;&lt;div&gt;become too small. Thus we must only subtract from the dividend precisely the&lt;/div&gt;&lt;div&gt;amount required to make the overflow vanish.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can safely subtract away whatever is in the bottom two words of the&lt;/div&gt;&lt;div&gt;dividend, as this is not even enough to remove the overflow. And then we can&lt;/div&gt;&lt;div&gt;subtract 1 from the whole dividend. This must remove the overflow and is&lt;/div&gt;&lt;div&gt;clearly the least we can get away with subtracting to do so.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.15"&gt;v0.15&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html"&gt;v0.14 - divrem_classical&lt;/a&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v016-mullowclassical.html"&gt;v0.16 - mullow_classical&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-8183267255657687582?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/8183267255657687582/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8183267255657687582'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8183267255657687582'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html' title='BSDNT - v0.15 divapprox_classical'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-5135097082239444443</id><published>2010-10-06T03:49:00.000-07:00</published><updated>2010-10-13T15:28:00.401-07:00</updated><title type='text'>BSDNT - v0.14 divrem_classical</title><content type='html'>&lt;div&gt;It's time to implement our schoolboy division routine. I prefer the name&lt;/div&gt;&lt;div&gt;divrem_classical, in line with the multiplication routines.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This function will take a dividend a, divisor d, carry-in ci and returns&lt;/div&gt;&lt;div&gt;a quotient q and remainder r. We'll also need to pass in a precomputed&lt;/div&gt;&lt;div&gt;inverse to speed up the dword_t by word_t divisions that need to occur.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We are going to implement the first algorithm spoken of in the previous&lt;/div&gt;&lt;div&gt;post, namely the one which uses a single "normalised" word of the divisor.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We need to be careful not to just shift the top word of the divisor so &lt;/div&gt;&lt;div&gt;that it is normalised, but if it has more than one word, shift any high&lt;/div&gt;&lt;div&gt;bits of the second word as well. We want the top WORD_BITS bits of the&lt;/div&gt;&lt;div&gt;divisor, starting with the first nonzero bit.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can use our macro for doing a dword_t by word_t division to get each&lt;/div&gt;&lt;div&gt;new word of the quotient. We start with the carry-in and the most &lt;/div&gt;&lt;div&gt;significant word of a. The macro will automatically shift these by the&lt;/div&gt;&lt;div&gt;same amount as we shifted the leading words of the divisor. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As per the divrem1 function, we require that the carry-in be such that&lt;/div&gt;&lt;div&gt;the quotient won't overflow. In other words, we assume that if the&lt;/div&gt;&lt;div&gt;divisor d is m words, then the top m words of the dividend including the &lt;/div&gt;&lt;div&gt;carry-in, are reduced mod d.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First we need to check that we are not in the special case where truncating&lt;/div&gt;&lt;div&gt;the dividend and divisor to two and one words respectively would cause&lt;/div&gt;&lt;div&gt;an overflow of the quotient word to be computed. This only happens &lt;/div&gt;&lt;div&gt;when the top word of the dividend equals the top word of the divisor, as &lt;/div&gt;&lt;div&gt;explained in the previous post. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the truncation would cause an overflow in the quotient, we collect a &lt;/div&gt;&lt;div&gt;quotient word of ~0, as discussed in the previous post. If not, we compute &lt;/div&gt;&lt;div&gt;the quotient using our macro.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After this point, the remainder is computed. We allow the function to &lt;/div&gt;&lt;div&gt;destroy the input a for this purpose. We leave it up to the caller to make&lt;/div&gt;&lt;div&gt;a copy of a and pass it to the function, if this is not desired&lt;/div&gt;&lt;div&gt;behaviour.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We do our adjustment so that the quotient is correct. We need a while loop&lt;/div&gt;&lt;div&gt;for this, as mentioned in the previous article. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally we write out the quotient word and read in the next word of the&lt;/div&gt;&lt;div&gt;dividend that remains.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the test code we use our mul_classical and muladd_classical functions to&lt;/div&gt;&lt;div&gt;check that divrem_classical is indeed the inverse of these functions, with&lt;/div&gt;&lt;div&gt;zero remainder and nonzero remainder respectively.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The code for today's post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.14"&gt;v0.14&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html"&gt;divrem discussion&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v015-divapproxclassical.html"&gt;v0.15 - divapprox_classical&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-5135097082239444443?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/5135097082239444443/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5135097082239444443'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5135097082239444443'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html' title='BSDNT - v0.14 divrem_classical'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-2284725690611517301</id><published>2010-10-02T09:45:00.000-07:00</published><updated>2010-10-06T03:59:05.588-07:00</updated><title type='text'>BSDNT - divrem discussion</title><content type='html'>&lt;div&gt;Today's post is a discussion only, without accompanying code. This is because&lt;/div&gt;&lt;div&gt;the topic is divrem using the "school boy" algorithm, and this is not as &lt;/div&gt;&lt;div&gt;straightforward as one might imagine. The discussion below is informal, and&lt;/div&gt;&lt;div&gt;may contain errors. Please let me know if so and I can make some adjustments&lt;/div&gt;&lt;div&gt;before releasing the code for the next article, where we implement this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As a consolation, I have released a failed attempt at using C89 macros to make&lt;/div&gt;&lt;div&gt;even more simplifications to our test code.  In particular, I tried to make the&lt;/div&gt;&lt;div&gt;chain tests more "automatic". Unfortunately, this didn't work out. It ended up&lt;/div&gt;&lt;div&gt;making the test code too complicated to use and it was going to be way too much&lt;/div&gt;&lt;div&gt;work to convert all of it over to the new test system. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This test code "simplification" was completely abandoned and will not be merged.&lt;/div&gt;&lt;div&gt;But if you are curious you can see it here: &lt;a href="http://github.com/wbhart/bsdnt/tree/gentest"&gt;gentest&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In particular, take a look at generic.c/h and the test code in t-nn.c.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow, back to today's discussion:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Firstly, what exactly do we mean by the school boy division algorithm? &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Certainly I leaned how to do 39 divide 7 at school, and even 339 divide 7. &lt;/div&gt;&lt;div&gt;But how about 10615 divide 1769? &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first step is to divide 1769 into 1061. It doesn't go, so we grab another &lt;/div&gt;&lt;div&gt;digit. So it becomes 1769 into 10615. I have no idea how many times it goes!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Either I missed something at school when I studied "long division", or I only &lt;/div&gt;&lt;div&gt;think I learned an algorithm suitable for problems like this. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fact, the only way I know to do this division is trial and error, i.e. &lt;/div&gt;&lt;div&gt;to go through all the likely quotient candidates from one to nine. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Using a calculator I find it goes 6 times. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think I intuitively know how to proceed from here. We place the digit 6 &lt;/div&gt;&lt;div&gt;into our quotient, multiply 1769 by 6 and subtract to form a remainder.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But now we have a problem. What if we are dividing 17699949 into 106150000.&lt;/div&gt;&lt;div&gt;Does it still go 6 times? In fact it does not. But how would I know this&lt;/div&gt;&lt;div&gt;without being able to divide 17699949 into 106150000 in the first place!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So I have two problems. Firstly, if I simply truncate the numerator and&lt;/div&gt;&lt;div&gt;denominator to a certain number of digits, even the first digit of my &lt;/div&gt;&lt;div&gt;quotient may be wrong, and secondly, if I use more digits in the first place&lt;/div&gt;&lt;div&gt;my intiuitive "algorithm" doesn't help. It basically says, guess and refine.&lt;/div&gt;&lt;div&gt;That's all very well, but when my digits are 64 bit words, I may not be so&lt;/div&gt;&lt;div&gt;lucky with the guessing thing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now an interesting thing to note in the example above is that no matter how &lt;/div&gt;&lt;div&gt;many digits we add to the dividend, 1769 will always go into the first 5 &lt;/div&gt;&lt;div&gt;digits 10615 of the dividend, 6 times (with some remainder), no more, no &lt;/div&gt;&lt;div&gt;less. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To see this, we note that 6*1769 is less than 10615 and 7*1765 is greater. &lt;/div&gt;&lt;div&gt;So the only way we could get 1765 to go in 7 times would be to increase &lt;/div&gt;&lt;div&gt;those first five digits of the dividend. Any following digits are irrelevant &lt;/div&gt;&lt;div&gt;to that first digit, 6, of the quotient. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is a general feature of integer division, and also applies for the&lt;/div&gt;&lt;div&gt;word based (base B) integer division problem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, adding digits to the divisor is not the same, as the example above &lt;/div&gt;&lt;div&gt;shows. In fact 17699949 only goes 5 times into 106150000.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now the question is, how far off can it be if I truncate the dividend and&lt;/div&gt;&lt;div&gt;divisor? How bad can it get? &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can answer this as follows. Note that 17699949 &amp;lt; 17700000. So 10615xxxx &lt;div&gt;divided by 1769yyyy is greater than or equal to 10615 divided by 1770, which &lt;/div&gt;&lt;div&gt;happens to be 5. So the answer has to be either 5 or 6 times. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What I have done here is simply add 1 to the first four digits of the &lt;/div&gt;&lt;div&gt;divisor, to get a lower bound on the first digit of the quotient. In this&lt;/div&gt;&lt;div&gt;way, I can get a small range of possible values for the first digit of the&lt;/div&gt;&lt;div&gt;quotient. Then I can simply search through the possibilities to find the &lt;/div&gt;&lt;div&gt;correct quotient digit. This matches my intuition of the long division &lt;/div&gt;&lt;div&gt;algorithm at least.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It seems that *perhaps* a quotient digit obtained in this way will be at &lt;/div&gt;&lt;div&gt;most out by 1. In other words, roughly speaking, if X is the first few digits &lt;/div&gt;&lt;div&gt;of the dividend and Y the appropriate number of digits of the divisor (in the&lt;/div&gt;&lt;div&gt;example above,  X = 10615 and Y = 1769), then perhaps  X / Y &gt;= X / (Y + 1)&lt;/div&gt;&lt;div&gt;&gt;= (X / Y) - 1, where by division here, I mean integer division. Let's call &lt;/div&gt;&lt;div&gt;this condition R.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's suppose now that we are working on integers written as binary words&lt;/div&gt;&lt;div&gt;instead of decimal digits. Let's also suppose that X / Y &amp;lt; B. Let's call&lt;div&gt;this condition S.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well, it is easy to generate a counterexample to condition R. Let Y = 2, &lt;/div&gt;&lt;div&gt;X = B. Clearly R is not satisfied, even though S is. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But this seems to only be a problem because Y is so small. So, what if we &lt;/div&gt;&lt;div&gt;constrain Y to be at least B/2?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It turns out that condition R still doesn't hold. A counterexample is&lt;/div&gt;&lt;div&gt;Y=B/2, X = B*Y - 2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, I claim that X / (Y + 1) &gt;= (X / Y) - 2 does hold, so long as &lt;/div&gt;&lt;div&gt;condition S holds. Let us call this inequality T.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;Clearly there does not exist a counterexample to T with X &lt;= Y.&lt;/div&gt;&lt;div&gt;Let us suppose that for a certain Y &gt;= B/2, there is a counterexample to T. &lt;/div&gt;&lt;div&gt;Clearly if X0 is the first such counterexample, then X0 is a multiple of Y, &lt;/div&gt;&lt;div&gt;but not of Y + 1. Nor is X0 - 1 a multiple of Y + 1 (else X0 - 2 would have &lt;/div&gt;&lt;div&gt;been a counterexample).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is also clear that every multiple of Y from X0 on is a counterexample, as&lt;/div&gt;&lt;div&gt;the left hand side can never "catch up" again. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus we have the following result: if there exists a counterexample to T for &lt;/div&gt;&lt;div&gt;some Y &gt;= B/2 and X &amp;lt; BY, then X = (B - 1)*Y is a counterexample. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;Substituting this value of X into T yields that T holds if and only if&lt;/div&gt;&lt;div&gt;(B - 1)*Y / (Y + 1) &gt;= B - 3 for all Y &gt;= B/2. Thus, if we can show that this&lt;/div&gt;&lt;div&gt;last inequality holds for all Y &gt;= B/2 then we have proven T.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that this inequality is equivalent to (B - 1)*Y &gt;= (Y + 1)*(B - 3), i.e.&lt;/div&gt;&lt;div&gt;2*Y &gt;= B - 3. However, as Y &gt;= B/2, this holds under our hypotheses. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So putting all of the above together, suppose we are dividing A by D and that&lt;/div&gt;&lt;div&gt;Y &gt;= B/2 is the leading word of the divisor D and B*Y &gt; X &gt;= Y is the first &lt;/div&gt;&lt;div&gt;word or two words of the numerator A (whichever is appropriate). Then if Q0 &lt;/div&gt;&lt;div&gt;is the leading word of the quotient Q = A / D, then we have shown that&lt;/div&gt;&lt;div&gt;X / Y &gt;= Q0 &gt;= (X / Y) - 2. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In other words, X / Y can be no more than 2 away from the first word of the &lt;/div&gt;&lt;div&gt;quotient Q = A / D that we are after.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This leads us immediately to an algorithm for computing a quotient of two &lt;/div&gt;&lt;div&gt;multiprecision integers. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(i) Let Y be the first WORD_BITS bits of the divisor D, so that &lt;/div&gt;&lt;div&gt;B &gt; Y &gt;= B/2. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(ii) Let X be the first word or two words (appropriately shifted as per Y) of&lt;/div&gt;&lt;div&gt;A such that B*Y &gt; X &gt;= Y.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(iii) Let Q0 = X/Y. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(iv) Set R = A - D*Q0 * B^m (for the appropriate m), to get a "remainder". &lt;/div&gt;&lt;div&gt;While R &amp;lt; 0, subtract 1 from Q0 and add D * B^m to R.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(v) Write down Q0 as the first word of the quotient A / D and continue on by &lt;/div&gt;&lt;div&gt;replacing A by R and returning to step (i) to compute the next word of the&lt;/div&gt;&lt;div&gt;quotient, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another algorithm can be derived as follows. Instead of using Y &gt;= B/2 in the&lt;/div&gt;&lt;div&gt;above algorithm, let's choose Y &gt;= B^2/2. A similar argument to the above&lt;/div&gt;&lt;div&gt;shows that X / (Y + 1) &gt;= (X / Y) - 1 for Y &gt;= B^2/2 and B*Y &gt; X &gt;= Y. It boils&lt;/div&gt;&lt;div&gt;down to showing that (B - 1)*Y / (Y + 1) &gt;= B - 2 for Y &gt;= B^2/2, i.e. that&lt;/div&gt;&lt;div&gt;Y &gt;= B - 2, which is clearly satisfied under our hypotheses.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The algorithm is precisely the same, but at step (iv) we can replace the while&lt;/div&gt;&lt;div&gt;statement with an if statement and perform at most one adjustment to our &lt;/div&gt;&lt;div&gt;quotient Q0 and to R.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we return to step (v), in which we said that we could just continue&lt;/div&gt;&lt;div&gt;on from step (i) to compute the next word of the quotient Q. If we do this&lt;/div&gt;&lt;div&gt;and set A = R then compute the new X, what we would like is something like &lt;/div&gt;&lt;div&gt;the divrem1 algorithm we implemented, where, (after possibly some kind of &lt;/div&gt;&lt;div&gt;initial iteration that we handled specially), it is always true that the new &lt;/div&gt;&lt;div&gt;X is two words and has its first word reduced mod Y. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, this does not follow from 0 &lt;= R &amp;lt; D*B^m, and it may be that the &lt;div&gt;first word of the remainder is equal to Y! This is due to the truncation of &lt;/div&gt;&lt;div&gt;R to get Y. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is clear from 0 &lt;= R &amp;lt; D*B^m that if X is the first two words of the &lt;div&gt;remainder that X &lt;= B*Y. So to make the algorithm continue in the way we'd &lt;/div&gt;&lt;div&gt;like, we only have to deal with the special case where X = B*Y.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We know that in the first algorithm above, the next word of the quotient may&lt;/div&gt;&lt;div&gt;be B - 1 or B - 2, since we know already that it is not B. We must multiply &lt;/div&gt;&lt;div&gt;out and check which it is.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the second algorithm above, where Y &gt;= B^2/2, the only possibility for &lt;/div&gt;&lt;div&gt;the next quotient word is B - 1, as we know it is not B. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the next article we will look at implementing the first of the two&lt;/div&gt;&lt;div&gt;algorithms above, leveraging the code we already have for computing and using &lt;/div&gt;&lt;div&gt;a precomputed inverse. &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html"&gt;v0.13 - muladd1, muladd&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v014-divremclassical.html"&gt;v0.14 - divrem_classical&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-2284725690611517301?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/2284725690611517301/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2284725690611517301'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2284725690611517301'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html' title='BSDNT - divrem discussion'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-3889533803626941228</id><published>2010-10-01T05:13:00.000-07:00</published><updated>2010-10-02T10:02:52.117-07:00</updated><title type='text'>BSDNT - v0.13 muladd1, muladd</title><content type='html'>&lt;div&gt;We noted in the last article that multiplication need not take a carry-in&lt;/div&gt;&lt;div&gt;and that it doesn't seem related to the linear functions we have been &lt;/div&gt;&lt;div&gt;implementing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead, we think of multiplication in a different way, as an inverse of &lt;/div&gt;&lt;div&gt;division. We'll soon implement division with remainder, i.e. given a and&lt;/div&gt;&lt;div&gt;d, find q and r such that a = d*q + r, where 0 &amp;lt;= r &amp;lt; d&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If d and q are of lengths m and n respectively, then r is of length at most&lt;/div&gt;&lt;div&gt;m and a is either of length m + n or m + n - 1.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Multiplication corresponds to the case where r = 0. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As we will certainly not wish to have to zero pad our division out to &lt;/div&gt;&lt;div&gt;m + n limbs if a is in fact only m + n - 1 limbs, it makes sense that our &lt;/div&gt;&lt;div&gt;division will take a carry-in. For this reason, it made sense for our&lt;/div&gt;&lt;div&gt;multiplication to yield a carry-out, i.e. it will write m + n - 1 limbs&lt;/div&gt;&lt;div&gt;and return an m + n - th limb, which may or may not be 0.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When r is not zero in our division, the inverse operation would take a &lt;/div&gt;&lt;div&gt;value r of n limbs and add d*q to it, returning the result as a value a of &lt;/div&gt;&lt;div&gt;m + n - 1 limbs (and a carry which may or may not be zero). We call this &lt;/div&gt;&lt;div&gt;routine muladd. In the case where r and a happen to be aliased, the result &lt;/div&gt;&lt;div&gt;will be written out, overwriting a.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We first implement nn_muladd1 which is identical to addmul1 except that &lt;/div&gt;&lt;div&gt;muladd1 writes its result out to a location not necessarily coincident with&lt;/div&gt;&lt;div&gt;any of the inputs. In other words, nn_addmul1 is an in-place operation &lt;/div&gt;&lt;div&gt;whereas nn_muladd1 is not.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next we implement nn_muladd_classical. It takes an argument a of length m&lt;/div&gt;&lt;div&gt;to which it adds the product of b of length m and c of length n. The result&lt;/div&gt;&lt;div&gt;may or may not alias with a.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also implement a version of the linear and quadratic muladds which writes &lt;/div&gt;&lt;div&gt;out the carry, naming the functions with the nn_s_ prefix, as per the &lt;/div&gt;&lt;div&gt;convention initiated in our nn_linear module.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As for multiplication, we don't allow zero lengths. This dramatically &lt;/div&gt;&lt;div&gt;simplifies the code.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An important application of the muladd function will be in chaining full &lt;/div&gt;&lt;div&gt;multiplication. We'll discuss this when we get to dealing with multiplication&lt;/div&gt;&lt;div&gt;routines with better than quadratic complexity.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our test code simply checks that muladd is the same as a multiplication and&lt;/div&gt;&lt;div&gt;and an addition.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github repo for this post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.13"&gt;v0.13&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html"&gt;configure and assembly&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-divrem-discussion.html"&gt;divrem discussion&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-3889533803626941228?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/3889533803626941228/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3889533803626941228'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3889533803626941228'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html' title='BSDNT - v0.13 muladd1, muladd'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-7033089040351726185</id><published>2010-09-25T12:59:00.000-07:00</published><updated>2010-10-01T05:43:49.545-07:00</updated><title type='text'>BSDNT - configure and assembly improvements</title><content type='html'>&lt;div&gt;It's time we added architecture dependent assembly support to bsdnt.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is how we are going to do it. For each of the implementation files&lt;/div&gt;&lt;div&gt;we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,&lt;/div&gt;&lt;div&gt;e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This file will be #included in the relevant implementation file, e.g.&lt;/div&gt;&lt;div&gt;nn_linear_arch.h will be #included in nn_linear.c.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;These _arch.h files will be generated by a configure script, based on &lt;/div&gt;&lt;div&gt;the CPU architecture and operating system kernel. They will merely&lt;/div&gt;&lt;div&gt;include a list of architecture specific .h files in an arch directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example we might have nn_linear_x86_64_core2.h in the arch directory, &lt;/div&gt;&lt;div&gt;which provides routines specific to core2 processors running in 64 bit &lt;/div&gt;&lt;div&gt;mode.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In these architecture specific files, we'll have inline assembly routines&lt;/div&gt;&lt;div&gt;designed to replace various routines in the pure C implementation files&lt;/div&gt;&lt;div&gt;that we have already written. They'll do this by defining flags, e.g.&lt;/div&gt;&lt;div&gt;HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific&lt;/div&gt;&lt;div&gt;version of nn_mul1_c is available. We'll then wrap the implementation of&lt;/div&gt;&lt;div&gt;nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,&lt;/div&gt;&lt;div&gt;the C version will not be compiled.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to make this work, the configure script has to work out whether&lt;/div&gt;&lt;div&gt;the machine is 32 or 64 bit and what the CPU type is. It will then link &lt;/div&gt;&lt;div&gt;in the correct architecture specific files.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At the present moment, we are only interested in x86 machines running on&lt;/div&gt;&lt;div&gt;*nix (or Windows, but the architecture will be determined in a different&lt;/div&gt;&lt;div&gt;way on Windows).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A standard way of determining whether the kernel is 64 bit or not is to&lt;/div&gt;&lt;div&gt;search for the string x86_64 in the output of uname -m. If something else&lt;/div&gt;&lt;div&gt;pops out then it is probably a 32 bit machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once we know whether we have a 32 or 64 bit machine, we can determine the&lt;/div&gt;&lt;div&gt;exact processor by using the cpuid instruction. This is an assembly &lt;/div&gt;&lt;div&gt;instruction supported by x86 cpus which tells you the manufacturer, family&lt;/div&gt;&lt;div&gt;and model of the CPU. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We include a small C program cpuid.c with some inline assembly to call the &lt;/div&gt;&lt;div&gt;cpuid instruction. As this program will only ever be run on *nix, we can&lt;/div&gt;&lt;div&gt;make use of gcc's inline assembly feature.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When the parameter to the cpuid instruction is 0 we get the vendor ID,&lt;/div&gt;&lt;div&gt;which is a 12 character string. We are only interested in "AuthenticAMD"&lt;/div&gt;&lt;div&gt;and "GenuineIntel" at this point.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When we pass the parameter 1 to the cpuid instruction, we get the processor&lt;/div&gt;&lt;div&gt;model and family. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For Intel processors, table 2-3 in the following document gives information&lt;/div&gt;&lt;div&gt;about what the processor is:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;http://www.intel.com/Assets/PDF/appnote/241618.pdf&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However the information is out of date. Simply googling for Intel Family 6&lt;/div&gt;&lt;div&gt;Model XX reveals other models that are not listed in the Intel documentation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The information for AMD processors is a little harder to come by. However,&lt;/div&gt;&lt;div&gt;one can essentially extract the information from the revision guides, though&lt;/div&gt;&lt;div&gt;it isn't spelled out clearly:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It seems AMD only list recent processors here, and they are all 64 bit. &lt;/div&gt;&lt;div&gt;Information on 32 bit processors can be found here:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;http://www.sandpile.org/ia32/cpuid.htm&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point we'd like to identify numerous different architectures. We &lt;/div&gt;&lt;div&gt;aren't interested in 32 bit architectures, such as the now ancient Pentium &lt;/div&gt;&lt;div&gt;4 or AMD's K7. Instead, we are interested when 32 bit operating system &lt;/div&gt;&lt;div&gt;kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify&lt;/div&gt;&lt;div&gt;as x86. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are numerous 64 bit processors we are interested in: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4. &lt;/div&gt;&lt;div&gt;All the 64 bit ones support SSE2 and SSE3 and are based on the netburst &lt;/div&gt;&lt;div&gt;technology. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.&lt;/div&gt;&lt;div&gt;They were an enhanced version of the p6 architecture. They get branded as x86&lt;/div&gt;&lt;div&gt;for which only generic 32 bit assembly code will be available, if any.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Core 2's are very common. They will identify as core2. They all support SSE2,&lt;/div&gt;&lt;div&gt;SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we&lt;/div&gt;&lt;div&gt;don't distinguish these at this stage).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are&lt;/div&gt;&lt;div&gt;mostly 64 bit. We identify them as atom.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;More recently Intel has released Core i3, i5, i7 processors, based on the&lt;/div&gt;&lt;div&gt;Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,&lt;/div&gt;&lt;div&gt;SSSE3, SSE4.1 and SSE4.2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;AMD K8's are still available today. They support SSE2 and SSE3. We identify&lt;/div&gt;&lt;div&gt;them as k8.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a. &lt;/div&gt;&lt;div&gt;We identify these as k10. There are three streams of AMD K10 processors, &lt;/div&gt;&lt;div&gt;Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So in summary, our configure script first identifies whether we have a 32 or&lt;/div&gt;&lt;div&gt;64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2, &lt;/div&gt;&lt;div&gt;nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our configure script then links in architecture specific files as appropriate. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The only assembly code we include so far are the nn_add_mc functions we wrote&lt;/div&gt;&lt;div&gt;for 64 bit core2 and k10. As these are better than nothing on other 64 bit&lt;/div&gt;&lt;div&gt;processors from the same manufacturers, we include this code in the k8&lt;/div&gt;&lt;div&gt;specific files until we write versions for each processor. We also add an&lt;/div&gt;&lt;div&gt;nn_sub_mc assembly file for Intel and AMD 64 bit processors.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The configure script includes the architecture specific .h files starting from&lt;/div&gt;&lt;div&gt;the most recent processors so that code for earlier processors is not used&lt;/div&gt;&lt;div&gt;when something more recent is available.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github branch for this revision is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/asm2"&gt;asm2&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v012-mulclassical.html"&gt;v0.12 mul_classical&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/10/bsdnt-v013-muladd1-muladd.html"&gt;v0.13 - muladd1, muladd&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-7033089040351726185?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/7033089040351726185/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/7033089040351726185'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/7033089040351726185'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html' title='BSDNT - configure and assembly improvements'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-7586441099463741287</id><published>2010-09-24T12:25:00.000-07:00</published><updated>2010-09-25T13:04:15.755-07:00</updated><title type='text'>BSDNT - v0.12 mul_classical</title><content type='html'>&lt;div&gt;I've been on holidays for a couple of days, hence the break in transmission.&lt;/div&gt;&lt;div&gt;(For academics: the definition of a holiday is a day where you actually&lt;/div&gt;&lt;div&gt;stop working and do nothing work related for that whole day. I realise the&lt;/div&gt;&lt;div&gt;definition is not widely known, nor is it well-defined.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note, due to accidentally including some files in today's release that I&lt;/div&gt;&lt;div&gt;intended to hold over till next time, you have to do ./configure before&lt;/div&gt;&lt;div&gt;make check. This detects your CPU and links in any assembly code&lt;/div&gt;&lt;div&gt;that is relevant for it. More on this when I do the actual blog post about it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's time to start implementing quadratic algorithms (i.e. those that&lt;/div&gt;&lt;div&gt;take time O(n^2) to run, such as classical multiplication and division).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Before we do this, we are going to reorganise slightly. The file which&lt;/div&gt;&lt;div&gt;is currently called nn.c we will rename nn_linear.c to indicate that it&lt;/div&gt;&lt;div&gt;contains our linear functions. We'll also rename t-nn.c to t-nn_linear.c.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The macros in that file will be moved into test.h.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also modify the makefile to build all .c files in the current directory&lt;/div&gt;&lt;div&gt;and to run all tests when we do make check. A new directory "test" will&lt;/div&gt;&lt;div&gt;hold our test files.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, we add an nn_quadratic.c and test/t-nn_quadratic.c to hold the&lt;/div&gt;&lt;div&gt;new quadratic routines and test code that we are about to write.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first routine we want is nn_mul_classical. This is simply a mul1&lt;/div&gt;&lt;div&gt;followed by addmul1's. Of course this will be horrendously slow, but once&lt;/div&gt;&lt;div&gt;again we defer speeding it up until we start adding assembly routines,&lt;/div&gt;&lt;div&gt;which will start happening shortly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In a departure from our linear functions, we do not allow zero lengths. &lt;/div&gt;&lt;div&gt;This dramatically simplifies the code and means we do not have to check &lt;/div&gt;&lt;div&gt;for special cases. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There does not appear to be any reason to allow a carry-in to our &lt;/div&gt;&lt;div&gt;multiplication routine. An argument can be made for it on the basis of&lt;/div&gt;&lt;div&gt;consistency with mul1. However, the main use for carry-in's and carry-out's&lt;/div&gt;&lt;div&gt;thus far has been for chaining. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example, we chained mul1's s*c and t*c for s and t of length m and &lt;/div&gt;&lt;div&gt;c a single word in order to compute (s*B^m + t)*c.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But the analogue in the case of full multiplication would seem to be &lt;/div&gt;&lt;div&gt;chaining to compute (s*B^m + t)*c where s, t and c all have length m. But&lt;/div&gt;&lt;div&gt;it probably doesn't make sense to chain full multiplications in this way&lt;/div&gt;&lt;div&gt;as it would involve splitting the full product t*c say, into two separate &lt;/div&gt;&lt;div&gt;parts of length m, which amongst other things, would be inefficient.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It might actually make more sense to pass a whole array of carries to our&lt;/div&gt;&lt;div&gt;multiplication routine, one for every word of the multiplier. However it is &lt;/div&gt;&lt;div&gt;not clear what use this would be. So, for now at least, we pass no carry-ins&lt;/div&gt;&lt;div&gt;to our multiplication routine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We settle on allowing a single word of carry out. This may be zero if the &lt;/div&gt;&lt;div&gt;leading words of the multiplicands are small enough. Rather than write this &lt;/div&gt;&lt;div&gt;extra word out, we simply return it as a carry so that the caller can decide &lt;/div&gt;&lt;div&gt;whether to write it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The classical multiplication routine is quite straightforward, but that plus the&lt;/div&gt;&lt;div&gt;rearrangement we've done is more than enough for one day.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github branch for this release is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.12"&gt;v0.12&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v011-generic-test-code.html"&gt;v0.11 - generic test code&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-configure-and-assembly.html"&gt;configure and assembly&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-7586441099463741287?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/7586441099463741287/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v012-mulclassical.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/7586441099463741287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/7586441099463741287'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v012-mulclassical.html' title='BSDNT - v0.12 mul_classical'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-4572677179173436955</id><published>2010-09-20T18:37:00.000-07:00</published><updated>2010-09-24T12:39:15.770-07:00</updated><title type='text'>BSDNT - v0.11 generic test code</title><content type='html'>It's time we improved our test framework for all the functions we are&lt;br /&gt;adding, before things get out-of-control.&lt;br /&gt;&lt;br /&gt;The best strategy is to actually write an interpreter for a small&lt;br /&gt;expression parser which allows us to test arbitrary expressions.&lt;br /&gt;However, that would be a distraction at this point, so we opt to&lt;br /&gt;simply add some convenience functions.&lt;br /&gt;&lt;br /&gt;In particular, we'll add some new random functions which generate&lt;br /&gt;lists of random words and random garbage collected nn's. Our only aim&lt;br /&gt;here is to simplify our test code and cut down the number of lines of&lt;br /&gt;it.&lt;br /&gt;&lt;br /&gt;We don't want to make it *more* complicated for people to use, and&lt;br /&gt;C89 places some hard limits on us in that regard.&lt;br /&gt;&lt;br /&gt;The first step is to add two new files, test.c and test.h which&lt;br /&gt;will contain our garbage collection and random functions.&lt;br /&gt;&lt;br /&gt;We begin by adding an enum type_t, which parameterises all the different&lt;br /&gt;types our garbage collection can clean up. At the moment we only need&lt;br /&gt;one type, NN.&lt;br /&gt;&lt;br /&gt;Next we add a variadic function (variable number of arguments) which&lt;br /&gt;creates a whole bunch of random words. Unfortunately C89 doesn't support&lt;br /&gt;variadic macros, so we need to pass pointers to the words, so that the&lt;br /&gt;actual words can be modified.&lt;br /&gt;&lt;br /&gt;This random function takes a flag which specifies what type of random&lt;br /&gt;number we want. Specifically we'll have flags for ODD, NONZERO, ANY.&lt;br /&gt;&lt;br /&gt;C is also so stupid that we actually need to tell it explicitly, or mark&lt;br /&gt;somehow, the number of arguments we are giving to the variadic function.&lt;br /&gt;In C99 one can work around this by wrapping a variadic function with a&lt;br /&gt;variadic macro. But we don't have that facility here. We choose to simply&lt;br /&gt;pass NULL as the final argument to variadic functions.&lt;br /&gt;&lt;br /&gt;We also introduce a global garbage stack which will keep a pointer to all&lt;br /&gt;the objects allocated so far, so that a garbage collector can clean them&lt;br /&gt;up later on, when called. The fact that this garbage is global makes it not&lt;br /&gt;threadsafe, but we are only using the test support for test code at this&lt;br /&gt;point and so we don't care about thread safety. Later we could pass a&lt;br /&gt;context around as a parameter if we wanted to make it threadsafe.&lt;br /&gt;&lt;br /&gt;We also add a gc _cleanup function, which cleans away all the garbage, so&lt;br /&gt;that memory leaks don't occur. Later we could expand this function to do&lt;br /&gt;redzone checking and various other automatic tests on the garbage it is&lt;br /&gt;disposing of.&lt;br /&gt;&lt;br /&gt;Now we can add some convenience functions for generating random&lt;br /&gt;multiprecision integers.&lt;br /&gt;&lt;br /&gt;The randoms_of_len function again takes a NULL terminated list of&lt;br /&gt;*pointers* to nn_t's and both initialises them to the given length and&lt;br /&gt;sets them to random limbs. We can specify various flags here, such as&lt;br /&gt;ANY, FULL, the latter returning a multiprecision integer with exactly&lt;br /&gt;the given number of limbs, with the top limb being nonzero (unless&lt;br /&gt;the number of limbs requested is zero).&lt;br /&gt;&lt;br /&gt;Now that we have all this, in our test file test.h, we define a macro&lt;br /&gt;TEST_START which takes a number of iterations and simply sets up a loop&lt;br /&gt;with that many iterations. A TEST_END function then calls garbage collection&lt;br /&gt;to clean up at the end.&lt;br /&gt;&lt;br /&gt;Finally, we add a test_generics function, which generates some random values&lt;br /&gt;using our new generics and then cleans up.&lt;br /&gt;&lt;br /&gt;By running a program called valgrind over our test code, we can see if all&lt;br /&gt;the memory used by our generics actually got cleaned up after use.&lt;br /&gt;&lt;br /&gt;We now roll out these changes to the rest of our test code. We note that the&lt;div&gt;entire test file is now about two-thirds of the length it was before we rewrote&lt;/div&gt;&lt;div&gt;it!&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The new branch is on github: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.11"&gt;v0.11&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v010-mod1preinv_19.html"&gt;v0.10 - mod1_preinv&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v012-mulclassical.html"&gt;v0.12 - mul_classical&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-4572677179173436955?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/4572677179173436955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v011-generic-test-code.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/4572677179173436955'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/4572677179173436955'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v011-generic-test-code.html' title='BSDNT - v0.11 generic test code'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-1350517570259490282</id><published>2010-09-19T07:49:00.000-07:00</published><updated>2010-09-20T18:58:15.555-07:00</updated><title type='text'>BSDNT - v0.10 mod1_preinv</title><content type='html'>&lt;div&gt;We have one more linear function to implement before we move on to some&lt;/div&gt;&lt;div&gt;assembly language optimisation and then quadratic functions, namely mod1. &lt;/div&gt;&lt;div&gt;It returns the remainder after division by a single word. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One might think that this cannot be computed any faster than divrem1,&lt;/div&gt;&lt;div&gt;however the following algorithm allows us to perform the computation &lt;/div&gt;&lt;div&gt;slightly faster.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The algorithm is credited to Peter Montgomery. It is based on the following&lt;/div&gt;&lt;div&gt;observation. Suppose we have computed,  in advance, the values&lt;/div&gt;&lt;div&gt;b2 = B^2 mod d and b3 = B^3 mod d.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Then a number of the form a0 + a1 * B + a2 * B^2 + a3 * B^3 can be&lt;/div&gt;&lt;div&gt;reduced mod d by computing s = a0 + a1 * B + a2 * b2 + a3 * b3, then&lt;/div&gt;&lt;div&gt;later reducing s mod d. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that as d can be at most B - 1, the values bi are at most B - 2. &lt;/div&gt;&lt;div&gt;The values ai are at most B - 1. Thus ai * bi is at most B^2 - 3*B + 2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This means that when summing up s, we are adding at most B^2 - 3*B + 2&lt;/div&gt;&lt;div&gt;to B^2 - 1 at each step. The total is at most 2*B^2 - 3*B + 1. In&lt;/div&gt;&lt;div&gt;other words, there might be a carry of *1* into a third limb. But we&lt;/div&gt;&lt;div&gt;already know that b2 = B^2 mod d. Thus we can get rid of this 1 from&lt;/div&gt;&lt;div&gt;the third limb by adding b2 to our total in its place. The total will&lt;/div&gt;&lt;div&gt;then be at most B^2 - 2*B - 1, which fits easily into two limbs.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We may need to do this adjustment twice when summing up. It is an &lt;/div&gt;&lt;div&gt;unpredicted branch to do this correction, but with a deep enough pipeline&lt;/div&gt;&lt;div&gt;it is possible the processor will compute both paths, minimising the cost&lt;/div&gt;&lt;div&gt;of a misprediction. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus, with some precomputation, we can reduce four limbs of our dividend&lt;/div&gt;&lt;div&gt;to two limbs with 2 multiplications. Another advantage is that these &lt;/div&gt;&lt;div&gt;multiplications are independent. This gives the processor lots of &lt;/div&gt;&lt;div&gt;opportunity to pipeline them and compute them quickly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One tricky implementation detail is in testing when s overflows two limbs.&lt;/div&gt;&lt;div&gt;We can accumulate into three limbs, but this is not quite efficient.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One way of testing for overflow of a double word addition is to check if &lt;/div&gt;&lt;div&gt;the result is smaller than one of the summands. If so, an overflow &lt;/div&gt;&lt;div&gt;occurred and we can make our adjustment. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the case where d is at most B/3, we can make another optimisation. Also &lt;/div&gt;&lt;div&gt;compute b1 = B mod d and perform a third multiplication a1 * b1. We have &lt;/div&gt;&lt;div&gt;that bi is less than B/3. Thus the three products ai*bi are at most (B/3 - 1)*(B - 1)&lt;div&gt;= B^2/3 - 4*B/3 + 1. Clearly s is now at most B^2 - 3*B + 2 and no overflow &lt;/div&gt;&lt;div&gt;occurs. Therefore no tests or adjustments are required to compute s.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I will leave this optimised case as an exercise and just implement the&lt;/div&gt;&lt;div&gt;generic case for now.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At the end of the algorithm we must reduce a double word mod d. This could&lt;/div&gt;&lt;div&gt;be optimised with a precomputed inverse a la divrem1, but again I skip this&lt;/div&gt;&lt;div&gt;optimisation and leave it as an exercise.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At the start of the algorithm we have to set up differently depending on &lt;/div&gt;&lt;div&gt;whether there are an even or odd number of words (including the carry-in).&lt;/div&gt;&lt;div&gt;If there is an odd number in total, we need to do a single reduction of&lt;/div&gt;&lt;div&gt;the extra word so that an even number remain.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add a new mod_preinv1_t type and a function precompute_mod_inverse1&lt;/div&gt;&lt;div&gt;to compute it. This computes B, B^2 and B^3 mod d. For now this precomputation&lt;/div&gt;&lt;div&gt;is not optimised to use a precomputed inverse. This would actually save &lt;/div&gt;&lt;div&gt;significant time, but again I leave it as an exercise to optimise this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The end result looks a little messy compared to the algorithms implemented&lt;/div&gt;&lt;div&gt;so far. I wonder if anyone can find any simplifications of the code.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.10"&gt;v0.10&lt;/a&gt; is the github branch for this post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v09-divrem1hensel_18.html"&gt;v0.9 - divrem1_hensel&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v011-generic-test-code.html"&gt;v0.11 - generic test code&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-1350517570259490282?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/1350517570259490282/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v010-mod1preinv_19.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1350517570259490282'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1350517570259490282'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v010-mod1preinv_19.html' title='BSDNT - v0.10 mod1_preinv'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-6812214516594038848</id><published>2010-09-18T05:11:00.000-07:00</published><updated>2010-09-19T07:53:17.018-07:00</updated><title type='text'>BSDNT - v0.9 divrem1_hensel</title><content type='html'>&lt;div&gt;The next division function we'll introduce is so-called Hensel division, or&lt;/div&gt;&lt;div&gt;right-to-left division. This starts at the least significant word and applies&lt;/div&gt;&lt;div&gt;the usual division algorithm back towards the most significant word.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, now any "remainder" will be at the most significant word end and &lt;/div&gt;&lt;div&gt;have a completely different meaning.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Hensel division is actually just division mod B^m. Thus if q is the Hensel&lt;/div&gt;&lt;div&gt;quotient, you have a = q * d mod B^m. In fact, if r is the Hensel remainder, &lt;/div&gt;&lt;div&gt;you have a = q*d + r*B^m.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If a fits in m words and the division is exact (i.e. a = q*d), the Hensel&lt;/div&gt;&lt;div&gt;division and ordinary (euclidean) division return the same quotient. If the&lt;/div&gt;&lt;div&gt;division is not exact, this is no longer the case.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, it may be that chaining Hensel divisions together does produce an &lt;/div&gt;&lt;div&gt;exact division, even when division over the bottom part of the chain &lt;/div&gt;&lt;div&gt;wouldn't produce an exact division. Thus, the ability to chain Hensel &lt;/div&gt;&lt;div&gt;divisions, and thus the ability to return the Hensel remainder, is &lt;/div&gt;&lt;div&gt;important.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well, now we come to the first issue if we want to implement this. What is &lt;/div&gt;&lt;div&gt;the C operator for Hensel division? Actually, there doesn't seem to be one.&lt;/div&gt;&lt;div&gt;We have to implement it ourselves.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If a0 is a word from our dividend, and d is our single word divisor, then &lt;/div&gt;&lt;div&gt;we require q0 such that d * q0 = a0 mod B. In other words, we want q0 and&lt;/div&gt;&lt;div&gt;r0 such that d * q0 = a0 + r0*B.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Suppose we can find a value dinv such that dinv * d = 1 mod B. Then &lt;/div&gt;&lt;div&gt;we have that dinv * d * q0 = dinv * a0 mod B, i.e. q0 = dinv * a0 mod B,&lt;/div&gt;&lt;div&gt;which is a single word-by-word mullow. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once we have q0 we can compute d * q0, the high word of which is the Hensel&lt;/div&gt;&lt;div&gt;"remainder" r0. We need to subtract this from the next limb of our divisor&lt;/div&gt;&lt;div&gt;before continuing. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We'll make the convention that carry-ins and carry-outs from our Hensel&lt;/div&gt;&lt;div&gt;division are positive rather than negative. We'll just remember to subtract &lt;/div&gt;&lt;div&gt;them. This way, if q is our Hensel quotient and r our Hensel remainder, &lt;/div&gt;&lt;div&gt;then d * q = a + r * B^n.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So how do we compute dinv? Firstly, we better require that d is odd, else&lt;/div&gt;&lt;div&gt;dinv * d = 1 mod B is an impossibility. With this restriction, we have &lt;/div&gt;&lt;div&gt;gcd(d, B) = 1 and therefore the extended euclidean algorithm lets us find &lt;/div&gt;&lt;div&gt;s, t such that s*d + t*B = 1. Then modulo B we have s*d = 1 mod B, i.e. &lt;/div&gt;&lt;div&gt;dinv = s is the precomputed inverse that we are after.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another way to solve dinv * d = 1 mod B is to use Hensel lifting. This &lt;/div&gt;&lt;div&gt;works by first solving the equation mod 2, then use that solution to &lt;/div&gt;&lt;div&gt;solve it mod 4 and so on. This can all be done with nothing more than &lt;/div&gt;&lt;div&gt;multiplications. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Firstly, we know a solution mod 2, namely 1 (as d is odd), i.e. if v1 = 1&lt;/div&gt;&lt;div&gt;then we have v1 * d = 1 mod 2. Now suppose we have a solution vk mod 2^k, &lt;/div&gt;&lt;div&gt;i.e. vk * d = 1 mod 2^k. We'd like to find a value wk such that &lt;/div&gt;&lt;div&gt;(vk + 2^k * wk) * d = 1 mod 2^2k. This is always possible -- I omit the&lt;/div&gt;&lt;div&gt;easy proof -- so we only need to solve this equation for wk, i.e. &lt;/div&gt;&lt;div&gt;(d * vk - 1) = -2^k * wk * d. We can get rid of the d on the right hand&lt;/div&gt;&lt;div&gt;side by multiplying by vk, i.e. 2^k * w = ((1 - d * vk) * vk) (mod 2^2k). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In other words, two multiplications will suffice to compute &lt;/div&gt;&lt;div&gt;2^k * wk (mod 2^2k) from vk. Then v{k+1} = vk + 2^k * wk is the inverse &lt;/div&gt;&lt;div&gt;of d mod 2^2k.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As a further optimisation of all this, one can just work mod 2^64 all&lt;/div&gt;&lt;div&gt;the way through the computation. In other words, start with v1 = 1 &lt;/div&gt;&lt;div&gt;and compute v{k+1} = vk + ((1 - d * vk) * vk) (mod 2^64), six times.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To get from a solution mod 2 to a solution mod 2^64 clearly only takes&lt;/div&gt;&lt;div&gt;six Hensel lifting steps. Of course, with a lookup table mod 2^8 to start&lt;/div&gt;&lt;div&gt;from, instead of starting from a solution mod 2, one can do it in three&lt;/div&gt;&lt;div&gt;steps, requiring just six word-by-word multiplications (mullow only).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, today I am feeling very lazy and I leave it as an exercise for &lt;/div&gt;&lt;div&gt;someone to implement the Hensel lifting lookup table method.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add a simple type to hold our precomputed Hensel inverse, and a function&lt;/div&gt;&lt;div&gt;for computing it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The actual nn_divrem_hensel_1_preinv function is again a straightforward&lt;/div&gt;&lt;div&gt;loop. We ensure our remainder is positive at each step and ensure that&lt;/div&gt;&lt;div&gt;we propagate any borrows from subtractions at each step.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The test code for the Hensel division is very similar to that for the&lt;/div&gt;&lt;div&gt;ordinary euclidean division except that d must be odd and the chaining&lt;/div&gt;&lt;div&gt;test will proceed right-to-left.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;OK, that is enough brain strain for one day!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.9"&gt;v0.9&lt;/a&gt; is the github repository for today's post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous post: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v08-divrem1preinv.html"&gt;v0.8 - divrem1_preinv&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next post: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v010-mod1preinv_19.html"&gt;v0.10 - mod1_preinv&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-6812214516594038848?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/6812214516594038848/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v09-divrem1hensel_18.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6812214516594038848'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6812214516594038848'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v09-divrem1hensel_18.html' title='BSDNT - v0.9 divrem1_hensel'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-1463886545115236342</id><published>2010-09-16T18:18:00.000-07:00</published><updated>2010-09-18T15:28:58.702-07:00</updated><title type='text'>BSDNT - v0.8 divrem1_preinv</title><content type='html'>&lt;div&gt;Today we'll implement a not-stupid division function. Since the earliest&lt;/div&gt;&lt;div&gt;days of computers, people realised that where possible, one should &lt;/div&gt;&lt;div&gt;replace division with multiplication. This is done by computing a &lt;/div&gt;&lt;div&gt;precomputed inverse v of the divisor d and multiplying by v instead of&lt;/div&gt;&lt;div&gt;dividing by d.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One particularly ingenious example of this, which was developed recently,&lt;/div&gt;&lt;div&gt;is the algorithm of Möller and Granlund. It can be found in this paper:&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.lysator.liu.se/~nisse/archive/draft-division-paper.pdf"&gt;http://www.lysator.liu.se/~nisse/archive/draft-division-paper.pdf&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If one has a precomputed inverse v as defined at the start of section 3&lt;/div&gt;&lt;div&gt;of that paper, then one can compute a quotient and remainder using &lt;/div&gt;&lt;div&gt;algorithm 4 of the paper using one multiplication and one mullow.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There is one caveat however. The divisor d must be "normalised". This&lt;/div&gt;&lt;div&gt;means that the divisor is a word whose most significant bit is a 1.&lt;/div&gt;&lt;div&gt;This is no problem for us as we can simply shift n to the left so that it is &lt;/div&gt;&lt;div&gt;normalised, and also shift the two input limbs to the left by the same &lt;/div&gt;&lt;div&gt;amount. The quotient will then be the one we want and the remainder &lt;/div&gt;&lt;div&gt;will be shifted to the left by the same amount as n. We can shift it back &lt;/div&gt;&lt;div&gt;to get the real remainder we are after.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can use this algorithm in our divrem1 function. However, rather than &lt;/div&gt;&lt;div&gt;shift the remainder back each iteration, as it is the high limb of the &lt;/div&gt;&lt;div&gt;next input to the same algorithm we can simply pass it unmodified to the&lt;/div&gt;&lt;div&gt;next iteration of the divrem1 function.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The name of our function is nn_divrem1_preinv. We also introduce a type,&lt;/div&gt;&lt;div&gt;preinv1_t, to nn.h which contains the precomputed inverse (dinv) and the&lt;/div&gt;&lt;div&gt;number of bits n has been shifted by (norm). We add a function, &lt;/div&gt;&lt;div&gt;precompute_inverse1, which computes this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to compute the number of leading zeroes of n, we simply use a gcc &lt;/div&gt;&lt;div&gt;intrinsic __builtin_clzl which does precisely what we are after. This gives&lt;/div&gt;&lt;div&gt;the number of bits we have to shift by.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We implement the Möller-Granlund algorithm as a macro to save as many cycles&lt;/div&gt;&lt;div&gt;as we can. This time we really have to be careful to escape the variables we&lt;/div&gt;&lt;div&gt;introduce inside the macro, to prevent clashes with parameters that might be&lt;/div&gt;&lt;div&gt;passed in. The standard way of doing this is to prepend some underscores to&lt;/div&gt;&lt;div&gt;the variables inside the macro. We call the macro divrem21_preinv1 to signify&lt;/div&gt;&lt;div&gt;that it is doing a division with dividend of 2 words and divisor of 1 word.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once this is done, the nn_divrem1_preinv algorithm is very simple to implement. &lt;/div&gt;&lt;div&gt;It looks even simpler than our divrem1_simple function. The test code is very&lt;/div&gt;&lt;div&gt;similar as well. The new function runs much faster than the old one, as can&lt;/div&gt;&lt;div&gt;easily be seen by comparing the pauses for the different functions when &lt;/div&gt;&lt;div&gt;running the test code. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Later we'll have to introduce a timing and profiling function so that we can&lt;/div&gt;&lt;div&gt;be more explicit about how much of a speedup we are getting. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I haven't experimented with the divrem21_preinv macro to see if I can knock &lt;/div&gt;&lt;div&gt;any cycles off the computation. For example, careful use of const on one or &lt;/div&gt;&lt;div&gt;more of the variables inside the macro, or perhaps fewer word_t/dword_t &lt;/div&gt;&lt;div&gt;casts might speed things up by a cycle or two. Let me know if you find a &lt;/div&gt;&lt;div&gt;better combination. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course this is really just doing optimisations that the C compiler should &lt;/div&gt;&lt;div&gt;already know about. The best way to get real performance out of it is to add&lt;/div&gt;&lt;div&gt;some assembly language optimisations, which we'll eventually do.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github branch for today's post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.8"&gt;v0.8&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-assembly-language.html"&gt;bsdnt assembly language&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v09-divrem1hensel_18.html"&gt;v0.9 - divrem1_hensel&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-1463886545115236342?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/1463886545115236342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v08-divrem1preinv.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1463886545115236342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1463886545115236342'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v08-divrem1preinv.html' title='BSDNT - v0.8 divrem1_preinv'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-8203338876417514401</id><published>2010-09-14T18:20:00.000-07:00</published><updated>2010-09-16T18:24:05.277-07:00</updated><title type='text'>BSDNT - assembly language</title><content type='html'>Today a few of us (Antony Vennard, Brian Gladman, Gonzalo Tornaria and myself) had a go at writing some assembly language.&lt;br /&gt;&lt;br /&gt;The results of our labour, for AMD64 can be found on the asm branch on github:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://github.com/wbhart/bsdnt/tree/asm"&gt;http://github.com/wbhart/bsdnt/tree/asm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is not a proper revision of bsdnt, but a heavily hacked version, just to demonstrate the principles.&lt;br /&gt;&lt;br /&gt;In the nn.c file is some inline assembly code which implements the nn_add_mc function in x64 assembler. The format used is Intel format assembly code, instead of the usual AT&amp;amp;T syntax used by inline gcc normally. We made a change to the makefile so that gcc would use the Intel format assembly code pervasively. Note, this will cause gcc to miscompile bsdnt for many revisions of gcc. This is due to bugs in gcc. Specifically gcc versions 4.4.1 and 4.4.4 work, however.&lt;br /&gt;&lt;br /&gt;At this point we haven't unrolled the loop (repeated the contents of the loop over and over to save some time on loop arithmetic). But the results are still pretty nice.&lt;br /&gt;&lt;br /&gt;The straight C version we had before took about 12 cycles per word (with gcc 4.4.1 on an AMD Opteron K10 machine). With this new assembly code it takes about 3 cycles per word.&lt;br /&gt;&lt;br /&gt;With loop unrolling we might hope to halve that again, but this messy optimisation will have to wait, otherwise work on the functionality of the library will slow right down.&lt;br /&gt;&lt;br /&gt;We also wrote some assembly code for Core 2 machines. This runs in about 4.5 cycles per word. At the moment this is not committed anywhere, but you can find a copy here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://groups.google.co.uk/group/bsdnt-devel/msg/15e2883d94014196?hl=en"&gt;http://groups.google.co.uk/group/bsdnt-devel/msg/15e2883d94014196?hl=en&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://groups.google.co.uk/group/bsdnt-devel/msg/8fd45a763f7ba917?hl=en"&gt;&lt;/a&gt;The important thing to realise about assembly language is it is highly architecture dependent. Our AMD64 code is actually slower than C on an Intel Core 2 machine! The assembly code needs to be prepared for the machine in question. Moreover, this code won't work at all on Windows 64 or on any 32 bit machine! &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At some point we'll introduce a configuration script which will select the correct assembly code for the architecture being used.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To actually time this code, we simply hacked t-nn.c to only run one of the nn_add_m tests, but with the nn_add_m executed 1000 times and the actual test that the code works, turned off.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We set the number of words being added in each nn_add_m to 24, as the test machine we used was 2.4GHz. This means the number of seconds the test takes to run is approximately the number of cycles per word. This is of course a temporary hack, just to illustrate working assembly code (the test will pass if you remove the loop that calls nn_add_m a thousand times instead of once).&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v07-divrem1simple.html"&gt;v0.7 - divrem1simple&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v08-divrem1preinv.html"&gt;v0.8 - divrem1_preinv&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-8203338876417514401?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/8203338876417514401/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-assembly-language.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8203338876417514401'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8203338876417514401'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-assembly-language.html' title='BSDNT - assembly language'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-429777096411446507</id><published>2010-09-13T10:24:00.000-07:00</published><updated>2010-09-14T18:51:54.159-07:00</updated><title type='text'>BSDNT - v0.7 divrem1_simple</title><content type='html'>&lt;div&gt;It's time for our first division function. This will again be a linear function&lt;/div&gt;&lt;div&gt;in that it divides a multiprecision integer by a single word and returns the&lt;/div&gt;&lt;div&gt;(single word) remainder.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is the first time we can do substantially better than a simple for loop&lt;/div&gt;&lt;div&gt;(other than when we coded nn_cmp and other functions, which allowed early exit).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To get us going, we'll write a really dumb division function first. It'll just&lt;/div&gt;&lt;div&gt;use C's "/" and "%" operators in a for loop. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;C is supposed to optimise simultaneous occurrences of / and % together in the&lt;/div&gt;&lt;div&gt;same piece of code to use a single processor instruction where available. &lt;/div&gt;&lt;div&gt;However, even with that optimisation, our function will be exceedingly slow.&lt;/div&gt;&lt;div&gt;We'll see why that is when we implement a not-stupid divrem1, next time around.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To emphasise its stupidity, we'll call our exceedingly dumb division function&lt;/div&gt;&lt;div&gt;nn_divrem1_simple. Perhaps we can use it in our test code to compare against &lt;/div&gt;&lt;div&gt;when we write the more sophisticated version.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Hopefully our naming conventions won't be defeated by this function, which &lt;/div&gt;&lt;div&gt;proceeds left to right (or most significant word first). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One doesn't usually need to write the remainder down immediately following &lt;/div&gt;&lt;div&gt;the quotient (though one could imagine a function operating that way). &lt;/div&gt;&lt;div&gt;However, it is convenient to be able to accept a "remainder-in", essentially &lt;/div&gt;&lt;div&gt;thought of as an extra limb of the dividend, so that divrem can be chained. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this way, divrem1 is most similar to right shift in semantics (the latter &lt;/div&gt;&lt;div&gt;is just division by a power of two, after all).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, if we blindly try to implement this function, we quickly discover a &lt;/div&gt;&lt;div&gt;problem. Even if we work with double words, a division of a two limb quantity &lt;/div&gt;&lt;div&gt;by a single limb can be problematic. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can illustrate this using decimal digits, instead of machine words. Suppose&lt;/div&gt;&lt;div&gt;we divided 94 by 7. The resulting quotient is 13 with remainder 3 (I hope!). &lt;/div&gt;&lt;div&gt;The point is, the quotient takes up two "words" 1 and 3, not just a single &lt;/div&gt;&lt;div&gt;word.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can get around it by reducing the top word first. We have 9 divide 7 is 1&lt;/div&gt;&lt;div&gt;remainder 2. Thus the first word of our quotient is 1. Now we are left with&lt;/div&gt;&lt;div&gt;24 divide 7. This time the quotient is 3 with remainder 3 and everything is &lt;/div&gt;&lt;div&gt;fine. Note the remainder, 3, is less than 7.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The critical thing to note is that once we get things started, everything &lt;/div&gt;&lt;div&gt;works fine from then on. The top word in our double words are always reduced&lt;/div&gt;&lt;div&gt;modulo the divisor after the first iteration.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So the almost trivial algorithm almost works, except for the lead-in, where&lt;/div&gt;&lt;div&gt;we have to get things right. After that, we can divide our dividend word on&lt;/div&gt;&lt;div&gt;word until we reach the bottom, where we'll have a single limb remainder.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now as mentioned, it is also convenient to supply a "carry-in". This acts like &lt;/div&gt;&lt;div&gt;an additional limb of the dividend. Assuming this carry-in was reduced mod &lt;/div&gt;&lt;div&gt;the divisor, we find that we end up with as many words in our quotient as we&lt;/div&gt;&lt;div&gt;had in our original dividend.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Clearly however, if the carry-in is not reduced, we will end up with even one&lt;/div&gt;&lt;div&gt;more quotient limb! However, there is an easy way to avoid this problem. &lt;/div&gt;&lt;div&gt;We simply decide to restrict the carry-in to be reduced modulo the divisor! &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This restriction is not a problematic restriction because we will still be &lt;/div&gt;&lt;div&gt;able to chain our division functions together with this restriction (so long &lt;/div&gt;&lt;div&gt;as we use the same divisor throughout).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow, with this restriction, the quotient will have precisely the same number&lt;/div&gt;&lt;div&gt;of limbs as the dividend, and since we'll start with the carry-in, not the top&lt;/div&gt;&lt;div&gt;limb of the dividend, and that carry-in will already be reduced, there is no&lt;/div&gt;&lt;div&gt;bootstrapping iteration required, i.e. the problem of a non-reduced first limb&lt;/div&gt;&lt;div&gt;as explained above, simply doesn't exist.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this way we avoid any complex feed-in and wind-down code, and as for all &lt;/div&gt;&lt;div&gt;the functions so far, the length m can be zero. The result is satisfyingly&lt;/div&gt;&lt;div&gt;symmetric, even if it is terribly slow.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the test code, if q = a / c with remainder r, we can check that a = q*c + r.&lt;/div&gt;&lt;div&gt;We can use the functions we have already implemented, including our mul1_c code &lt;/div&gt;&lt;div&gt;to verify this. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also check that passing in a reduced carry-in results in a correct result, &lt;/div&gt;&lt;div&gt;by chaining two divisions together.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This will do for today, as a more serious implementation of divrem1 will &lt;/div&gt;&lt;div&gt;require much more serious code. We'll take a look at that next time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;The github branch here &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.7"&gt;v0.7&lt;/a&gt; is for this post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v06-addmul1-submul1.html"&gt;bsdnt v0.6 - addmul1, submul1&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-assembly-language.html"&gt;bsdnt assembly language&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-429777096411446507?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/429777096411446507/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v07-divrem1simple.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/429777096411446507'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/429777096411446507'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v07-divrem1simple.html' title='BSDNT - v0.7 divrem1_simple'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-2528182992658643898</id><published>2010-09-12T17:04:00.000-07:00</published><updated>2010-09-13T10:53:10.528-07:00</updated><title type='text'>BSDNT - v0.6 addmul1, submul1</title><content type='html'>&lt;div&gt;At last we come to addmul1 and submul1. After implementing these, we would&lt;/div&gt;&lt;div&gt;already be able to implement a full classical multiplication routine, using the &lt;/div&gt;&lt;div&gt;naive O(n^2) multiplication algorithm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It turns out that addmul1 and submul1 are quite similar to mul1. Instead of &lt;/div&gt;&lt;div&gt;just writing out the result, we have to add it to, or subtract it from the &lt;/div&gt;&lt;div&gt;first operand. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;These are really combined operations, i.e. combined mul1 and add_m or sub_m&lt;/div&gt;&lt;div&gt;respectively.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The reason we introduce these is that the processor has multiple pipelines&lt;/div&gt;&lt;div&gt;and merging two operations like this gives us the chance to push more &lt;/div&gt;&lt;div&gt;through those pipelines. We'll add more combined operations later on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The fact that we can add b[i]*c, a[i] and ci and not overflow a double word&lt;/div&gt;&lt;div&gt;needs some justification.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Clearly b[i] and c are at most B-1. Thus b[i]*c is at most B^2-2B+1. And&lt;/div&gt;&lt;div&gt;clearly a[i] and ci are at most B-1. Thus the total is at most&lt;/div&gt;&lt;div&gt;B^2-2B+1 + (B-1) + (B-1) = B^2-1, which just fits into a double word.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The dword_t type allows us to get away with adding all these in C as though&lt;/div&gt;&lt;div&gt;we don't care about any overflow (in fact we know there is none). Of course&lt;/div&gt;&lt;div&gt;that doesn't make it efficient. This function is still a candidate for&lt;/div&gt;&lt;div&gt;assembly optimisation and loop unrolling later on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The submul function is essentially the same, so long as we have taken care to&lt;/div&gt;&lt;div&gt;perform operations in the right order.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The naming of our functions causes us slight difficulties here. We'd ideally &lt;/div&gt;&lt;div&gt;like versions of addmul1 and submul1 which write the carry out to the high &lt;/div&gt;&lt;div&gt;limb and versions which add/sub it from the high limb. We opt for the latter&lt;/div&gt;&lt;div&gt;for now. After all, if one were accumulating addmuls, this is what one would &lt;/div&gt;&lt;div&gt;require most often.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Maybe someone reading this has a better idea how to handle these.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add a test which checks that doing two addmuls in a row is the same as &lt;/div&gt;&lt;div&gt;doing a single addmul with the multiply constant equal to the sum of the &lt;/div&gt;&lt;div&gt;original two. We also add a test to check that chaining addmuls works. We &lt;/div&gt;&lt;div&gt;add similar tests for submul.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We delay coding up a quadratic multiplication basecase as there are a few more&lt;/div&gt;&lt;div&gt;linear functions to work on, most notably the various kinds of division and&lt;/div&gt;&lt;div&gt;remainder functions. These are all interesting functions to write.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github repo for this version is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.6"&gt;v0.6&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v05-add-sub-cmp.html"&gt;v0.5 add, sub, cmp&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v07-divrem1simple.html"&gt;v0.7 divrem1_simple&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-2528182992658643898?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/2528182992658643898/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v06-addmul1-submul1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2528182992658643898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2528182992658643898'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v06-addmul1-submul1.html' title='BSDNT - v0.6 addmul1, submul1'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-3454434631773236808</id><published>2010-09-11T12:44:00.000-07:00</published><updated>2010-09-12T17:11:06.281-07:00</updated><title type='text'>BSDNT - v0.5 add, sub, cmp</title><content type='html'>It's time we added a full add and sub routine. These perform addition and&lt;br /&gt;subtraction with possibly different length operands. For simplicity, we&lt;br /&gt;assume the first operand is at least as long as the second. This saves us&lt;br /&gt;a comparison and a swap or function call.&lt;br /&gt;&lt;br /&gt;Add is simply an add_m followed by an add1 to propagate any carries right&lt;br /&gt;through to the end of the longest operand. A similar thing holds for sub.&lt;br /&gt;&lt;br /&gt;We'd like to code these as macros, however we'd also like to get the return&lt;br /&gt;value. Thus, we use a static inline function, which the compiler has the&lt;br /&gt;option of inlining if it wants, saving the function call overhead.&lt;br /&gt;&lt;br /&gt;We add tests for (a + b) + c = (a + c) + b where b and c are at most the same&lt;br /&gt;length as a. We also check that chaining an add with equal length operands,&lt;br /&gt;followed by one with non-equal operands, works as expected. We add similar&lt;br /&gt;tests for subtraction too.&lt;br /&gt;&lt;br /&gt;The next function we add is comparison. It should return a positive value&lt;br /&gt;if its first operand is greater than its second, a negative value if it is&lt;br /&gt;the other way, and zero if the operands are equal.&lt;br /&gt;&lt;br /&gt;As for equal_m and equal, we introduce different versions of the function&lt;br /&gt;for equal length operands and possibly distinct operands.&lt;br /&gt;&lt;br /&gt;The laziest way to do comparison is to subtract one value from the other&lt;br /&gt;and see what sign the result is. However, this is extremely inefficient&lt;br /&gt;in the case that the two operands are different. It's very likely that&lt;br /&gt;they already differ in the most significant limb, so we start by checking&lt;br /&gt;limb by limb, from the top, until we find a difference.&lt;br /&gt;&lt;br /&gt;Various tests are added. We test that equal things are equal, that operands&lt;br /&gt;with different *lengths* compare in the right order, and operands with&lt;br /&gt;different *values* but the same length compare in the right order.&lt;br /&gt;&lt;br /&gt;We also take the opportunity to do a copy-and-paste job from our cmp test&lt;br /&gt;code and quickly generate some test code for the nn_equal function, which&lt;br /&gt;didn't have test code up to this point.&lt;br /&gt;&lt;br /&gt;From the next section onwards, we will only deal with one or two functions&lt;br /&gt;at a time. All the really simple functions are now done, and we can move on&lt;br /&gt;to more interesting (and useful) things.&lt;br /&gt;&lt;br /&gt;The github branch for this post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.5"&gt;v0.5&lt;/a&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v04-add1-sub1-neg-not.html"&gt;0.4 - add1, sub1, neg, not&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v06-addmul1-submul1.html"&gt;v0.6 addmul1, submul1&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-3454434631773236808?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/3454434631773236808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v05-add-sub-cmp.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3454434631773236808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/3454434631773236808'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v05-add-sub-cmp.html' title='BSDNT - v0.5 add, sub, cmp'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-1176061506739072303</id><published>2010-09-10T07:30:00.000-07:00</published><updated>2010-09-11T12:55:55.383-07:00</updated><title type='text'>BSDNT - v0.4 add1, sub1, neg, not</title><content type='html'>Today we introduce some more convenience functions, neg and not, and we&lt;br /&gt;also add the functions add1 and sub1.&lt;br /&gt;&lt;br /&gt;All of these functions operate slightly differently to the ones we have&lt;br /&gt;introduced so far.&lt;br /&gt;&lt;br /&gt;The functions add1 and sub1 simply add a single word to a multiprecision&lt;br /&gt;integer, propagating any carries/borrows all the way along.&lt;br /&gt;&lt;br /&gt;The main loops of add1 and sub1 need to stop if the carry becomes zero. This&lt;br /&gt;is for efficiency reasons. In most cases when adding a constant limb to a&lt;br /&gt;multiprecision integer, only the first limb or two are affected. One doesn't&lt;br /&gt;want to loop over the whole input and output if that is the case.&lt;br /&gt;&lt;br /&gt;However, we must be careful, as in the case where the input and output are&lt;br /&gt;not aliased (at the same location), we still need to copy the remaining&lt;br /&gt;limbs of the input to the output location.&lt;br /&gt;&lt;br /&gt;We add tests that (a + c1) + c2 = (a + c2) + c1 and do the same thing for&lt;br /&gt;subtraction.&lt;br /&gt;&lt;br /&gt;We also check that chaining of add1's and chaining of sub1's works. Until&lt;br /&gt;we can generate more interesting random test integers this test doesn't&lt;br /&gt;give our functions much of a workout. We eventually want to be able to&lt;br /&gt;generate "sparse" integers, i.e. integers with only a few binary 1's or a&lt;br /&gt;few binary 0's in their binary representation. The latter case would be&lt;br /&gt;interesting here as it would test the propagation of carries in our add1&lt;br /&gt;and sub1 functions. We'd also eventually like to explicitly test corner&lt;br /&gt;cases such as multiprecision 0, ~0, 1, etc.&lt;br /&gt;&lt;br /&gt;A final test of add1/sub1 that we add is a + c1 - c1 = a.&lt;br /&gt;&lt;br /&gt;The not function is logical not. It complements each limb of the input. It&lt;br /&gt;is a simple for loop.&lt;br /&gt;&lt;br /&gt;The neg function is twos complement negation, i.e. negation modulo B^m. In&lt;br /&gt;fact, twos complement negation is the same as taking the logical not of the&lt;br /&gt;integer, then adding 1 to the whole thing. The implementation is similar to&lt;br /&gt;add1, except that we complement each limb after reading it, but before adding&lt;br /&gt;the carry.&lt;br /&gt;&lt;br /&gt;One difference is that we still need to complement the remaining limbs after&lt;br /&gt;the carry becomes zero, regardless of whether the input and output are&lt;br /&gt;aliased.&lt;br /&gt;&lt;br /&gt;The carry out from (neg a) is notionally what you would get if you were&lt;br /&gt;computing 0 - a. In other words, the carry is always 1 unless a is 0. In&lt;br /&gt;order to allow chaining, neg must notionally subtract the carry-in from the&lt;br /&gt;total.&lt;br /&gt;&lt;br /&gt;We test that (not (not a)) = a and that neg is the same as a combination of&lt;br /&gt;not and add1 with constant 1. We can also test that adding -b to a is the&lt;br /&gt;same computing as a - b. And finally we can test chaining of neg, as always.&lt;br /&gt;&lt;br /&gt;I wonder what the most interesting program is that we could implement on top&lt;br /&gt;of what we have so far. Tomorrow we add a few more convenience functions&lt;br /&gt;before we start heading into the more interesting stuff.&lt;br /&gt;&lt;br /&gt;I think I may have solved the test framework problem. More on that when we&lt;br /&gt;get to v0.11 and v0.12.&lt;br /&gt;&lt;br /&gt;There is a github branch here &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.4"&gt;v0.4&lt;/a&gt;  for this article.&lt;br /&gt;&lt;br /&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v03-copy-zero-normalise-mul1_09.html"&gt;v0.3 - copy, zero, normalise, mul1&lt;/a&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v05-add-sub-cmp.html"&gt;v0.5 - add, sub, cmp&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-1176061506739072303?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/1176061506739072303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v04-add1-sub1-neg-not.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1176061506739072303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1176061506739072303'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v04-add1-sub1-neg-not.html' title='BSDNT - v0.4 add1, sub1, neg, not'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-1633271505637210668</id><published>2010-09-09T13:54:00.000-07:00</published><updated>2011-10-23T09:57:43.942-07:00</updated><title type='text'>BSDNT - v0.3 copy, zero, normalise, mul1</title><content type='html'>&lt;div&gt;In this section we add a few more trivial convenience functions, nn_copy,&lt;/div&gt;&lt;div&gt;nn_zero and nn_normalise.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first two are simple for loops which we implement in nn.h. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A pair {c, m} will be considered normalised if either m is zero &lt;/div&gt;&lt;div&gt;(representing the bignum 0) or the limb c[m-1] is nonzero. In other words, &lt;/div&gt;&lt;div&gt;the most significant word of {c, m} will be non-zero.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The only tricky thing to do with nn_normalise is to make sure nothing goes&lt;/div&gt;&lt;div&gt;wrong if all the limbs are zero.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add numerous tests of these new functions. We zero an integer and check&lt;/div&gt;&lt;div&gt;that it normalises to zero limbs. We copy an integer and make sure it &lt;/div&gt;&lt;div&gt;copies correctly. We also copy an integer, then modify a random limb and &lt;/div&gt;&lt;div&gt;then check that it is no longer equal. This provides a further check that&lt;/div&gt;&lt;div&gt;the nn_equal function is working correctly. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, we do a more thorough test of nn_normalise by zeroing the top few &lt;/div&gt;&lt;div&gt;limbs of an integer then normalising it. We then copy just this many limbs&lt;/div&gt;&lt;div&gt;into a location which has been zeroed and check that the new integer is&lt;/div&gt;&lt;div&gt;still equal to the original *unnormalised* integer. This checks that&lt;/div&gt;&lt;div&gt;nn_normalise hasn't thrown away any nonzero limbs.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next we add a mul_1 function. As for all of the linear functions we have &lt;/div&gt;&lt;div&gt;been adding, this will be very slow in C. This time we need to retrieve&lt;/div&gt;&lt;div&gt;the top limb of a word-by-word multiply. Again, ANSI C doesn't provide this&lt;/div&gt;&lt;div&gt;functionality, so we use the gcc extensions that allow us to upcast to a &lt;/div&gt;&lt;div&gt;double word before doing the multiply. Again the compiler is supposed to&lt;/div&gt;&lt;div&gt;"do the right thing" with this combination.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We add a test to check that a * (c1 + c2) = a * c1 + a * c2. We also add &lt;/div&gt;&lt;div&gt;a test for chaining of mul1's together, passing the carry-out of one to &lt;/div&gt;&lt;div&gt;the carry-in of the next.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The github branch for this post is here: &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.3"&gt;v0.3&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v02.html"&gt;v0.2 - subtraction, shifting&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v04-add1-sub1-neg-not.html"&gt;v0.4 - add1, sub1, neg, not&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-1633271505637210668?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/1633271505637210668/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v03-copy-zero-normalise-mul1_09.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1633271505637210668'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/1633271505637210668'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v03-copy-zero-normalise-mul1_09.html' title='BSDNT - v0.3 copy, zero, normalise, mul1'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-5882219849551876008</id><published>2010-09-08T08:00:00.000-07:00</published><updated>2011-10-23T09:56:01.649-07:00</updated><title type='text'>BSDNT - v0.2 subtraction, shifting</title><content type='html'>&lt;div&gt;In today's revision we add a subtraction function and left and right &lt;/div&gt;&lt;div&gt;shift functions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Subtraction is almost trivial after addition. We copy what we did for &lt;/div&gt;&lt;div&gt;addition. The main change is that carry-ins become borrows. we have to&lt;/div&gt;&lt;div&gt;negate our borrows at each step, so that we are always dealing with &lt;/div&gt;&lt;div&gt;positive quantities. The returned borrow is zero or positive.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the test code for subtraction, we can add a test which checks that &lt;/div&gt;&lt;div&gt;a + b - b = a. This is a further test of the addition function as well.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Left shift is straightforward. Again we make use of the dword_t type&lt;/div&gt;&lt;div&gt;which is twice as wide as a word_t. This allows us to do a single shift&lt;/div&gt;&lt;div&gt;(which the compiler has the option of converting to two shifts if required).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We shift each limb by the specified bits, then hang onto the bits that were&lt;/div&gt;&lt;div&gt;shifted out the top, to be added back in at the next iteration. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For right shift we only have three functions instead of four. It doesn't make&lt;/div&gt;&lt;div&gt;sense to write out the "carry-out" at the bottom of a right shift. Instead&lt;/div&gt;&lt;div&gt;we do a "read-in" nn_r_rsh version which is the exact opposite of nn_s_lsh. It &lt;/div&gt;&lt;div&gt;reads the "carry-in" from an (m + 1)-th limb (which it shifts appropriately).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The nn_r_rsh function, also returns the bits that are shifted out the bottom end.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can provide an extra test for addition now, by checking that a + a = a &amp;lt;&amp;lt; 1&lt;div&gt;We also check that (a &amp;lt;&amp;lt; sh1) &amp;gt;&amp;gt; sh1 gets us back to a.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, we test that we can chain our carry-in carry-out functions together. For&lt;/div&gt;&lt;div&gt;example, I should be able to shift the first few limbs of an integer, feed the&lt;/div&gt;&lt;div&gt;carry-out as a carry-in in to the next shift, which will then shift the rest of&lt;/div&gt;&lt;div&gt;the integer. This should give me the same result as if I had shifted the whole &lt;/div&gt;&lt;div&gt;integer in one go.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also do similar carry-in/carry-out chaining with the addition and subtraction&lt;/div&gt;&lt;div&gt;functions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We are going to need to set up a proper test framework at some point, but for&lt;/div&gt;&lt;div&gt;now, whilst we are still working on implementing "linear" functions, I'm simply&lt;/div&gt;&lt;div&gt;going to keep duplicating the test functions over and over and making the&lt;/div&gt;&lt;div&gt;minor modifications required to test the new functions we are adding.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is &lt;a href="http://github.com/wbhart/bsdnt/tree/v0.2"&gt;v0.2&lt;/a&gt; on github for this post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v01-basic-types-and-addition.html"&gt;v0.1 - basic types and addition&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v03-copy-zero-normalise-mul1_09.html"&gt;v0.3 - copy, zero, normalise, mul1&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-5882219849551876008?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/5882219849551876008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v02.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5882219849551876008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5882219849551876008'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v02.html' title='BSDNT - v0.2 subtraction, shifting'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-6124817280871759947</id><published>2010-09-07T11:56:00.000-07:00</published><updated>2011-10-23T08:17:20.689-07:00</updated><title type='text'>BSDNT v0.1 - basic types and addition</title><content type='html'>&lt;div&gt;We start with v0.1 of bsdnt. See the github branch: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://github.com/wbhart/bsdnt/tree/v0.1"&gt;http://github.com/wbhart/bsdnt/tree/v0.1&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Firstly, we set up a bit of infrastructure before we add our first function, &lt;/div&gt;&lt;div&gt;the addition function.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We begin that with some basic types in nn.h:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;word_t - represents a machine word (either 32 or 64 bit)&lt;/div&gt;&lt;div&gt;dword_t - twice the size of a machine word (to handle carries from addition)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;nn_t - our basic multiprecision integer type, consisting of an array of words&lt;/div&gt;&lt;div&gt;nn_src_t - same as an nn_t, but declared const, so it can't be modified, used &lt;/div&gt;&lt;div&gt;           for input/source parameters&lt;/div&gt;&lt;div&gt;len_t - the length in words of a multiprecision integer. We don't use a struct&lt;/div&gt;&lt;div&gt;        as we'd have to pass it by reference, which would be less efficient due&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;to having to dereference the pointer all the time&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;rand_t - a random state, unused for now. This will ensure thread safety of our&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt; random functions later on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also define WORD_BITS, the number of bits in a machine word. To simplify &lt;/div&gt;&lt;div&gt;things, we'll informally let B refer to the number 2^WORD_BITS in what follows.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our basic bignum type is a pair {c, len} consisting of an nn_t c and a len_t len&lt;/div&gt;&lt;div&gt;counting the number of limbs of our bignum. At this stage we restrict len to &lt;/div&gt;&lt;div&gt;being non-negative, and we allow leading zero limbs in our representation of&lt;/div&gt;&lt;div&gt;our bignums. This allows for twos complement arithmetic with fixed length &lt;/div&gt;&lt;div&gt;bignums (arithmetic modulo B^len for some fixed len).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first function we write is a simple addition function. This needs to be&lt;/div&gt;&lt;div&gt;pretty sophisticated. In particular, we want to be able to pass in a carry, so&lt;/div&gt;&lt;div&gt;that addition functions can be chained together, or on the end of other functions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first addition function will add two operands of the same length (number of &lt;/div&gt;&lt;div&gt;machine words), which we signify with an m at the end of the function name.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We signify that the function takes a carry-in by appending c to the function name.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We will also want the function to pass a carry-out. It'll return this carry out as a &lt;/div&gt;&lt;div&gt;word_t.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The function nn_add_mc is defined in nn.c. Notice how we cast each word to a &lt;/div&gt;&lt;div&gt;dword first, then do the addition, then cast back to a word for the low word of&lt;/div&gt;&lt;div&gt;the sum, and shift by WORD_BITS to get the high word of the result.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The compiler is *supposed* to optimise this combination. Actually, it makes poor &lt;/div&gt;&lt;div&gt;use of the processor carry flag, and screws up the loop, which is why C is not&lt;/div&gt;&lt;div&gt;the best language for a bignum library. But for now it is the best we can do.&lt;/div&gt;&lt;div&gt;Later on we'll add assembly optimisations to our library.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We could unroll the loop (write the contents four times in a row, say, to amortise&lt;/div&gt;&lt;div&gt;the cost of the loop counter arithmetic over four iterations). However, we are&lt;/div&gt;&lt;div&gt;after a cleanly coded library, so we resist this temptation until we write some&lt;/div&gt;&lt;div&gt;assembly code.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we introduce some extra macros in nn.h for our addition function. These &lt;/div&gt;&lt;div&gt;represent various permutations of allowing a carry in or not, and one more&lt;/div&gt;&lt;div&gt;interesting macro, nn_s_add_m. This function checks whether the result &lt;/div&gt;&lt;div&gt;and first input are aliased. If so (in other words, we have a = a + c), then the &lt;/div&gt;&lt;div&gt;carry-out gets *added* to the correct limb of a.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If a and b are not aliased (we have a = b + c), then the carry is *written* &lt;/div&gt;&lt;div&gt;and not added, to the correct limb of a. This macro will make coding cleaner &lt;/div&gt;&lt;div&gt;later on.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Notionally, the extra 's' in the function name stands for 'store'.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We use macros for these functions to stop code duplication, and because macros&lt;/div&gt;&lt;div&gt;prevent extra function call overhead, and sometimes offer an opportunity for the&lt;/div&gt;&lt;div&gt;compiler to optimise further. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Note that C macros are not hygienic. In fact they are just text replacement macros.&lt;/div&gt;&lt;div&gt;Thus we need to be careful about naming the macro variables so they are unlikely&lt;/div&gt;&lt;div&gt;to conflict with symbols passed in by the caller. In this case we don't introduce&lt;/div&gt;&lt;div&gt;any variables inside our macros, so this precaution is not necessary.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We also need to take care when using the macro parameters inside the macro. In &lt;/div&gt;&lt;div&gt;some cases it is necessary to put parentheses around the parameters in case &lt;/div&gt;&lt;div&gt;the caller passes in a complex expression which combines badly with expressions&lt;/div&gt;&lt;div&gt;inside our macro (e.g. due to operator precedence). Where there can be no &lt;/div&gt;&lt;div&gt;confusion when substituting a macro, we do not need to do this.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next we will need to add some random functions, to generate random integers to add&lt;/div&gt;&lt;div&gt;in our test code. In nn.c, we add functions to generate a random word of data, a&lt;/div&gt;&lt;div&gt;random integer up to a limit and a random multiprecision integer.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The random functions take a random state as a parameter. This is essential to &lt;/div&gt;&lt;div&gt;ensure that we can make our random functions threadsafe at a later date. For now&lt;/div&gt;&lt;div&gt;the random state and associated init and clear functions do nothing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The random word function generates two random integers slightly bigger than half&lt;/div&gt;&lt;div&gt;a word and stitches them together. This is done using an el cheapo pseudorandom&lt;/div&gt;&lt;div&gt;function which works modulo a prime slightly bigger than half a word. We take some&lt;/div&gt;&lt;div&gt;care not to always end up with even output, or output divisible by a given small&lt;/div&gt;&lt;div&gt;prime.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The test code is in t-nn.c. We add a comparison function nn_equal_m to nn.h, to &lt;/div&gt;&lt;div&gt;test equality of two mp integers, to be used in our test code, and some other &lt;/div&gt;&lt;div&gt;convenience functions. We generate random length mp integers, and check an &lt;/div&gt;&lt;div&gt;identity, in this case the associative law for addition. This is not a very &lt;/div&gt;&lt;div&gt;sophisticated test, and we'll have to improve out test code at a later date.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For now, it passes, and we move on to grander things.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Previous article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-introduction.html"&gt;BSDNT-introduction&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v02.html"&gt;v0.2 - subtraction, shifting&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-6124817280871759947?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/6124817280871759947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v01-basic-types-and-addition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6124817280871759947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/6124817280871759947'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-v01-basic-types-and-addition.html' title='BSDNT v0.1 - basic types and addition'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-2834987883413406300</id><published>2010-09-07T11:38:00.001-07:00</published><updated>2011-10-23T08:21:59.680-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sage bsdnt flint bignum'/><title type='text'>BSDNT - Introduction</title><content type='html'>&lt;div&gt;Over the next few weeks to months, I'll be blogging about a new project of mine, called bsdnt.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The aim of this project is to develop a cleanly coded bignum library with (eventually) reasonable performance and a BSD license. I'll be developing this code for a while in a blog/tutorial style.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First some organisational matters.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For those who wish to follow along with the code as it is written, it can be found at:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://github.com/wbhart/bsdnt/tree/v0.1"&gt;http://github.com/wbhart/bsdnt/&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you have a github account, you can clone my project on github.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;To make a clone of my repository on your local machine, you can do:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;git clone git://github.com/wbhart/bsdnt.git bsdnt&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The various branches, as I add them, will be called v0.1, v0.2, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To switch branches, simply do, for example:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;git checkout origin/v0.1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To make your own branch of this, to mess around in, do:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;git checkout -b mybranch&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Antony Vennard, set up a google groups discussion list. You can access that&lt;/div&gt;&lt;div&gt;here:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://groups.google.com/group/bsdnt-devel?hl=en"&gt;http://groups.google.com/group/bsdnt-devel?hl=en&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A word on licensing: if you substantially copy my code, then I appreciate you &lt;/div&gt;&lt;div&gt;retaining the copyright, however if you write substantially your own code, &lt;/div&gt;&lt;div&gt;merely being "inspired" by my code, I am fine with you not adding my copyright to your project. But still let me know about your code, because I'd be interested in seeing it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To compile the project, you will need a recent version of gcc. I'm using&lt;/div&gt;&lt;div&gt;gcc 4.4.1, and some of the features may not work with earlier than gcc 4.4. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You can proceed to the first blog about the code here:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next article: &lt;a href="http://wbhart.blogspot.com/2010/09/bsdnt-v01-basic-types-and-addition.html"&gt;BSDNT v0.1 - basic types and addition&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-2834987883413406300?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/2834987883413406300/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-introduction.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2834987883413406300'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/2834987883413406300'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2010/09/bsdnt-introduction.html' title='BSDNT - Introduction'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-5456143144852131986</id><published>2009-06-21T10:18:00.000-07:00</published><updated>2009-06-21T13:31:37.293-07:00</updated><title type='text'>Stacks and Elliptic Cohomology</title><content type='html'>This week I became interested in two different topics due to conversations that I overheard. The first is the topic of stacks and the second is elliptic cohomology.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Stacks&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Apparently there are numerous different kinds of stacks - Deligne-Mumford Stacks, Artin Stacks and for the die hard, apparently more general kinds of stacks.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The following is probably completely wrong, but is my understanding of what stacks are about.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Consider elliptic curves defined over the complex numbers K = C say. It is a classical result that up to isomorphism, these can be parameterised by points in the complex upper half plane, modulo the action of the modular group. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now the upper half plane, modulo the modular group can be compactified by adding a point at the cusp, and made into a Riemann surface (of genus zero in this case). We can put coordinates on this Riemann surface (the complex j-function as it happens) and turn it into an algebraic curve of genus 0.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fact, two elliptic curves are isomorphic iff their j-invariants are equal. In other words the Riemann sphere or j-line as it is often called, can be thought of as classifying all elliptic curves. In fact we call C the course moduli space for elliptic curves.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now suppose we try to construct a universal space for families of elliptic curves over this base space. The problem is that an elliptic curve can have extra automorphisms and it is possible for a family of elliptic curves to contain isomorphic elliptic curves for this reason. That prevents us from having a univeral space for families of elliptic curves. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The way we get around this is using stacks. We define the stack of elliptic curves which is a category whose objects are families of elliptic curves over a base space (fixed for that family) and we define a morphism to be a map between families of elliptic curves along with a map between the corresponding base spaces such that the map between families is compatible with the map between base spaces.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Furthermore, for it to be a morphism (X'-&gt;B') -&gt; (X-&gt;B), we require that if we pull the family of elliptic curves X back along the map B'-&gt;B of base spaces, we get a family of elliptic curves isomorphic to the family X'.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can restrict to the subcategory of families of elliptic curves over a fixed base space B if we want. We call this subcategory the fibre over B.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now note that the  fibre over a base space B is a groupoid (a categories whose only morphisms are isomorphisms). We say that the stack of elliptic curves is a category fibred in groupoids.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now it is clear that there is a universal family of elliptic curves with respect to this construction.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There's more to stacks than this (some of the critical components of the definition are omitted above). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A stack is of Deligne-Mumford type (formerly an algebraic stack) if it satisfies some additional conditions, in particular that there is an etale surjective morphism (called an atlas) from a scheme U to the stack F, amongst other things.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An Artin stack (nowadays what is referred to as an algebraic stack) simply replaces etale with smooth in the previous definition.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow, what was interesting to me is that nowadays stacks are replacing schemes as the ultimate objects of interest. A lot of work has been done to popularise them. Hey, if you want to know more theres only 1000 pages to read: &lt;a href="http://www.math.columbia.edu/~dejong/algebraic_geometry/stacks-git/"&gt;The Stacks Project&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Elliptic cohomology&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The second topic which piqued my interest this week was elliptic cohomology. I thought maybe this might be related to parabolic cohomology, which is defined in terms of parabolic cusps (fixed points of parabolic elements of SL_2(R)). But I don't know that this is the origin of the term.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead I found this enormous survey on the web, which is written helpfully:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www-math.mit.edu/~lurie/papers/survey.pdf"&gt;http://www-math.mit.edu/~lurie/papers/survey.pdf&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That's a long document to be called a summary, so I'll give a summary of the summary.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To a topological space X we can associate the singular cohomology groups A^n(X) = H^n(X; Z) which can be characterised by the Eilenberg-Steenrod axioms. Any collection of functors and connecting maps satisfying these axioms necessarily gives you the usual integral cohomology functors (X \subseteq Y) -&gt; H^n(X, Y; Z). More generally we can replace Z with any abelian group M.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now if we drop the last of the E-S axioms:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* If X is a point then A^n(X) = { 0 if n \neq 0 and Z is n = 0 }&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;then we get something more general, called a cohomology theory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Interestingly, complex K theory is an example of such a cohomology theory!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Complex K-theory is a so-called multiplicative cohomology theory, because A^n(*) is a graded commutative ring. Another nice feature is it is periodic:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* According to the Bott periodicity theorem for complex K-theory, there is an element \beta in K^2(*) such that multiplication by \beta induces an isomorphisms: \beta : K^n(X) -&gt; K^{n+2}(X)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;and even:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* K^i(*) = 0 if i is odd.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Ordinary cohomology H*(X; A) for a commutative ring A is even, but to make it periodic we need to take products over every second cohomology group A^n(X; A) = \prod_k H^{n+2k}(X; A)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now when A is an even periodic cohomology, it turns out that A(CP^\infty) of the infinite dimensional complex projective space, is isomorphic to a formal power series ring A(*)[[t]] over the commutative ring A(*).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We can view the element t as the first Chern class of the universal line bundle O(1). The space CP^\infty is a classifying space for complex line bundles, i.e. for any complex line bundle L on a space X there is a (classifying) map \phi : X -&gt; CP^\infty and an isomorphism L &lt;-&gt; \phi*(O(1)).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We then define c1(L) = \phi*(t) \in A(X), the first Chern class of the cohomology A.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In ordinary cohomology the first Chern class of a tensor product of line bundles is simply the sum of the first Chern classes of the line bundles. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the case of complex K-theory line bundles L can be thought of as representatives of elements of K(X) itself. We write such an element [L]. Then c1(L) = [L] - 1. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now c1(L1 \tensor L2) = c1(L1) + c1(L2) + c1(L1)c1(L2).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the general case we have:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;c1(L1 \tensor L2) = f(c1(L1), c1(L2)) for some f \in A(*)[[t1, t2]].&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It turns out that the following properties hold:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* f(0, t) = f(t, 0) = t&lt;/div&gt;&lt;div&gt;* f(u, v) = f(v, u)&lt;/div&gt;&lt;div&gt;* f(a, f(b, c)) = f(f(a, b), c)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A power series with these properties is called a commutative 1-dimensional formal group law over the commutative ring A(*). It gives a group structure on the formal scheme Spf A*(X)[[t]] = Spf A(CP^\infty).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We call the first of the group laws above f(a, b) = a + b the additive formal group law denoted \hat{G_a} and the other f(a, b) = a + b + a*b the multiplicative formal group law, denoted \hat{G_m}.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now one wonders if there are other possible formal group laws. It turns out that the Lazard ring is a ring classifying formal group laws. Quillen proved that it comes from a cohomology called periodic complex cobordism denoted MP. There is a canonical isomorphism MP(CP^\infty) &lt;-&gt; MP(*)[[t]] and the coefficient ring MP(*) is the Lazard ring.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One may construct the moduli stack of all formal group laws M_{FGL} so that for a commutative ring R, the set of homomorphisms Hom(Spec R, M_{FGL}) can be identified with the power series f(u, v) \in R[[u, v]] satisfying the three conditions above. Then M_{FGL} is an affine scheme, in fact M_{FGL} = Spec MP(*), as we'd expect for a moduli stack.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well it gets a bit more complicated than that. One must mod out by the action on M_{FGL} by the group of automorphisms of the formal affine line Spf Z[[x]]. This yields the stack of formal groups M_{FG}.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now what is an elliptic cohomology?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well we need to relax one thing slightly. Instead of demanding that we have a periodic cohomology, we'll just require our multiplicative cohomology A to be weakly-periodic:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* The natural map A^2(*) \tensor_{A(*)} A^n(*) -&gt; A^{n+2}(*) is an isomorphism for all n \in Z.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we can define an elliptic cohomology A:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* R is a commutative ring&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* E is an elliptic curve over R&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* A is a multiplicative cohomology which is even and weakly-perioidic&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* There are isomorphisms A(*) &lt;-&gt; R and \hat{E} &lt;-&gt; Spf A(CP^\infty) of formal groups, over R which is isomorphic to A(*)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here \hat{E} represents the formal completion of E along its identity section. It is a commutative 1-dimensional formal group over R. It is classified by a map \phi : Spec R -&gt; M_{FG}.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For finite cell complexes X we may interpret the complex cobordism groups MP^n(X) as quasi-coherent sheaves on the moduli stack M_{FG} and in the case of a formal group over R we can define A^n(X) to be the pullback of these sheaves along \phi.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When \phi is flat then we call the formal group Landweber-exact. Landweber gave a criterion for determining when a formal group is Landweber-exact.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anyhow it turns out that in the elliptic cohomology case, when the formal group is Landweber-exact, the elliptic curve and the isomorphism of the definition of elliptic cohomology are uniquely given. In this case, giving the elliptic cohomology theory is exactly the same thing as giving an elliptic curve.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now in the case of the formal multiplicative group, there is a universal formal group. But in the case of elliptic cohomology, there is no such thing as a universal elliptic curve over a commutative ring. The moduli stack of elliptic curves M_{1,1} is not an affine variety, and not even a scheme. However, it is a Deligne-Mumford stack.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For each etale morphism \phi : R \to M_{1,1} there is an elliptic curve E_\phi and happily, when \phi is etale, the associated formal group \hat{E_\phi} is Landweber-exact.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Well, this allows us to create on M_{1,1} a presheaf taking on values in the category of cohomology theories. But this is a nasty construction and so we try to represent our cohomology theories by representing spaces.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Roughly speaking this is where the theory of E-\infty rings or E-\infty spectra comes in. Basically a cohomology theory A_\phi can be represented by a spectrum. That eventually allows one to develop a universal elliptic cohomology theory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We define a presheaf O_{M}^{Der} of E_\infty rings representing elliptic cohomology theories on the category {\phi : Spec R -&gt; M_{1,1}} mentioned above. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now we extract our universal cohomology theory by taking a "homotopy limit" of the functor  O_{M}^{Der}. This gives us the E-\infty ring of topological modular forms tmf[\Delta^{-1}].&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(Note tmf is what you get when you replace M_{1,1} by its Deligne-Mumford compactification. After inverting 2 and 3 [to do with the extra automorphisms on elliptic curves] there is an isomorphism from tmf to the ring of integral holomorphic modular forms.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;OK that summarises the first 9 pages of the long summary above. But probably it isn't much of an improvement on the original. But it helps the memory to write it all down somewhere.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-5456143144852131986?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/5456143144852131986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2009/06/stacks-and-elliptic-cohomology.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5456143144852131986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/5456143144852131986'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2009/06/stacks-and-elliptic-cohomology.html' title='Stacks and Elliptic Cohomology'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7651115430416156636.post-8565704276084071942</id><published>2009-06-07T11:25:00.000-07:00</published><updated>2009-06-07T18:13:24.782-07:00</updated><title type='text'>MPIR - version 1.2</title><content type='html'>Finally version 1.2 of MPIR (Multiple Precision Integers and Rationals), of which I am a lead developer, is released: &lt;a href="http://www.mpir.org/"&gt;http://www.mpir.org/&lt;/a&gt; &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;MPIR is an open source project based on the GNU Multi Precision (GMP, see &lt;a href="http://www.gmplib.org/"&gt;http://www.gmplib.org/&lt;/a&gt;) library, but still licensed under version 2 of the LGPL.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;About a month ago version GMP 4.3.0 was released, which they had been preparing for a LONG time. We expected some nice features, and found some, which we have been subsequently catching up with.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In particular we needed:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Faster assembly code for multiplication basecase &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Faster unbalanced integer multiplication (where you are multiplying integers of different sizes)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Improvements to the speed of multiplying medium sized integers (50-2000 words where 1 word = 2^64 on a 64 bit machine)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Asymptotically fast extended GCD&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Faster integer multiplication for large integers (&gt; 2000 limbs)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Faster integer squaring&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;* Other assembly improvements&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Multiplication Basecase&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Jason Moxham, a brilliant MPIR developer from the UK decided to take on the mul_basecase challenge. He's been writing an assembly optimiser for quite a few months. It takes hand written assembly code and reorganises the instructions over and over, within permitted bounds, to try and find an optimal sequence.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The results over the past year are pretty impressive to see:&lt;br /&gt;&lt;br /&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="4" style="text-align: center;"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Multiplications per second&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;128 x 128 bits&lt;/td&gt;&lt;td&gt;512 x 512 bits&lt;/td&gt;&lt;td&gt;8192 x 8192 bits&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.2.0&lt;/td&gt;&lt;td style="text-align: right;"&gt;53794646&lt;/td&gt;&lt;td style="text-align: right;"&gt;12488043&lt;/td&gt;&lt;td style="text-align: right;"&gt;117404&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.3.0&lt;/td&gt;&lt;td style="text-align: right;"&gt;52766506&lt;/td&gt;&lt;td style="text-align: right;"&gt;10879150&lt;/td&gt;&lt;td style="text-align: right;"&gt;114927&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td style="text-align: right;"&gt;51802252&lt;/td&gt;&lt;td style="text-align: right;"&gt;11802334&lt;/td&gt;&lt;td style="text-align: right;"&gt;111772&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.0.0&lt;/td&gt;&lt;td style="text-align: right;"&gt;35856598&lt;/td&gt;&lt;td style="text-align: right;"&gt;10928085&lt;/td&gt;&lt;td style="text-align: right;"&gt;111641&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 0.9.0&lt;/td&gt;&lt;td style="text-align: right;"&gt;37299412&lt;/td&gt;&lt;td style="text-align: right;"&gt;8122452&lt;/td&gt;&lt;td style="text-align: right;"&gt;86301&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.2.1&lt;/td&gt;&lt;td style="text-align: right;"&gt;25896136&lt;/td&gt;&lt;td style="text-align: right;"&gt;6383542&lt;/td&gt;&lt;td style="text-align: right;"&gt;60819&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that the number of multiplications that can be done per second has more than doubled in most cases since last year, and all this just from assembly language improvements.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The timings for this table were made on an Opteron (AMD K8) server, running at 2.8 GHz. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Toom Multiplication&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Toom multiplication can be described in terms of a polynomial multiplication problem. Firstly the large integers to be multiplied are split apart into pieces, which form the coefficients of polynomials. Then the original integer multiplication problem becomes one of polynomial multiplication, then a reconstruction phase at the end, where the polynomial coefficients of the product are stitched together to make the product integer.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This can be thought of in terms of writing the original integer in terms of some large base, 2^M, where M is usually a multiple of the machine word size (e.g. M = 64B on a 64 bit machine).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example, Toom-3 breaks the two integers into 3 pieces each, corresponding to the digits in base 2^M:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A = a0 + a1 * 2^M + a2 * 2^2M&lt;/div&gt;&lt;div&gt;B = b0 + b1 * 2^M + b2 * 2^2M&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So we construct two polynomials:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;f(x) = a0 + a1 * x + a2 * x^2&lt;/div&gt;&lt;div&gt;g(x) = b0 + b1 * x + b2 * x^2&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Then A * B = f(2^M) * g(2^M). So we first compute h(x) = f(x) * g(x), and then A * B = h(2^M).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To multiply the polynomials f(x) and g(x), we note that their product will be degree 4, and so we can determine it fully if we know its value at 5 independent points. We choose, for convenience, the points 0, 1, -1, 2, infinity. Finally we note that h(0) = f(0) * g(0), h(1) = f(1) * g(1), etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus to compute the product we only need to find the values of f(x) and g(x) at the chosen points, do five small multiplications to get the value of h(x) at those points, then interpolate to get the coefficients of h(x).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So we have replaced the original large multiplication with five small ones. Note that each of the small multiplications involves integers at most one third of the size. If we just used schoolboy multiplication to multiply the "digits" of our large integers, we'd do 3 x 3 = 9 small "digit multiplications".  Instead, through the magic of evaluation/interpolation we only have 5 small "digit multiplications" to do (and a few additions and subtractions, etc., for the evaluation and interpolation phases).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The basecase multiplication code already handles unbalanced multiplication, so I decided to focus on unbalanced variants of Toom multiplication algorithms.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;In MPIR we use Toom-2 (also known as Karatsuba multiplication - though we use a variant called Knuth multiplication where evaluation happens at 0, -1 and infinity), Toom-3, Toom-4 and Toom-7. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I spent a fair bit of time optimising Toom-3. As we have a lot of new assembly instructions available in MPIR, I was actually able to get the interpolation sequence used down from about 11 to 8 steps in MPIR 1.2. I also discovered that it was possible to use the output integer for temporary space. If one is careful, one can even set it up so that the temporaries stored in the output space don't have to be moved at the end, i.e. they are in precisely the right place as part of the output integer at the end of the algorithm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I made similar optimisations for Toom-4 and Toom-7 and I also switched the interpolation phase of the algorithms over to twos complement. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here are the results of this work in the Toom range:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; "&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="5" style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="font-weight: bold; "&gt;Toom Multiplication&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;Kara (3200 bits)&lt;/td&gt;&lt;td&gt;Toom3 (7680 bits)&lt;/td&gt;&lt;td&gt;Toom4 (25600 bits)&lt;/td&gt;&lt;td&gt;Toom7 (131072 bits)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.2.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;300414&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;74337&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;11428&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1153&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.3.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;274738&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;68498&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;11263&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1042&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td style="text-align: right; "&gt;260501&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;60039&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;9890&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;993&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.0.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;261599&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;62275&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;9900&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;826&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.2.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;163273&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;33460&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;4980&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;408&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Timings are on a 2.4GHz Core 2 (eno).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The differences in timing in the karatsuba region for MPIR give an indication of how much of a speedup is occurring on account of better assembly code. Anything beyond that indicates a speedup in the Toom algorithms themselves. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note GMP 4.2.1 had Karatsuba and Toom-3 only, GMP 4.3.1 does not have Toom-7.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Unbalanced Multiplication&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now unbalanced multiplication proceeds in the same way, except that we have integers, and thus polynomials, of different length:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;f(x) = a0 + a1 * x + a2 * x^2 + a3 * x^3&lt;/div&gt;&lt;div&gt;g(x) = b0 + b1 * x&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Again the product is degree four and so we need to interpolate at five points, which we can choose to be precisely the same points that we used for Toom-3, even reusing the interpolation code, if we wish.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We call this Toom-42, denoting that we split our integers into 4 and 2 "digits" respectively, in our base 2^M.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The result is 5 small multiplications instead of 4 x 2 = 8; still a substantial saving.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally I implemented some of the unbalanced variants. In particular we now have Toom-42, an unbalanced version of Toom-33 (where the top coefficients are not necessarily exactly the same size) and Toom-32.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The results for unbalanced multiplications in the Toom range are now quite good:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="4" style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="font-weight: bold; "&gt;Unbalanced Multiplication&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;15000 x 10000 bits&lt;/td&gt;&lt;td&gt;20000 x 10000 bits&lt;/td&gt;&lt;td&gt;30000 x 10000 bits&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.2.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;32975&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;25239&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;14995&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.3.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;31201&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;24099&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;14104&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td style="text-align: right; "&gt;23190&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;19970&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;13289&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.2.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;13235&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;10855&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;7235&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Timings are on the 2.4GHz Core 2 (eno).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Toom Squaring&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Squaring using the Toom algorithms is much the same, except that there is no need to evaluate the same polynomial twice. We also save because the pointwise multiplications are now also squares and we can recurse, right down to a basecase squaring case.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All of the same optimisations can be applied for Toom squaring as for ordinary Toom multiplication. Here are the comparisons:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="5" style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="font-weight: bold; "&gt;Toom Squaring&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;Kara (5120 bits)&lt;/td&gt;&lt;td&gt;Toom3 (12800 bits)&lt;/td&gt;&lt;td&gt;Toom4 (51200 bits)&lt;/td&gt;&lt;td&gt;Toom7 (131072 bits)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.2.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;358656&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;82274&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;10514&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2762&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.3.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;357031&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;83676&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;10430&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2564&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td style="text-align: right; "&gt;349686&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;78510&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;9917&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2347&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.2.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;185201&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;42256&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;5112&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1237&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The performance is now comparable to GMP until the Toom7 region, where we pull ahead considerably. These timings were done on the 2.8GHz Opteron (K8) server.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;FFT Multiplication&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;For multiplication of large integers, MPIR uses a Fast Fourier Transform (FFT) method. Instead of working over the complex numbers, one can work over a ring Z/pZ where p is a special prime, e.g. p = 2^M + 1 where M is some power of 2. This trick is due to Schonhage and Strassen.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For MPIR 1.2 we decided to switch over to the &lt;a href="http://www.loria.fr/~zimmerma/software/"&gt;new FFT&lt;/a&gt; of Pierrick Gaudry, Alexander Kruppa and Paul Zimmermann. An implementation of ideas they presented at ISAAC 2007 was available to download from Paul and Alex's websites.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Most of the work in merging this FFT was removing bugs, making the code work efficiently on Windows and writing and running tuning code for all the platforms we support.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I wrote the tuning code, Brian Gladman discovered some primitives to use on Windows which replace inline assembler available on Linux and Jason Moxham, Brian Gladman, Jeff Gilchrist and myself tuned the FFT on a range of systems.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here are some examples of the speedup:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="5" style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;FFT Multiplication&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;2000000 bits&lt;br /&gt;&lt;/td&gt;&lt;td&gt;6000000 bits&lt;br /&gt;&lt;/td&gt;&lt;td&gt;20000000 bits&lt;/td&gt;&lt;td&gt;64000000 bits&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.2.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;74.9&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;12.8&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;5.20&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1.47&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.3.0&lt;/td&gt;&lt;td style="text-align: right; "&gt;52.4&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;13.4&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3.66&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;0.813&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td style="text-align: right; "&gt;47.2&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;12.5&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3.09&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;0.742&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GMP 4.2.1&lt;/td&gt;&lt;td style="text-align: right; "&gt;32.8&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;8.66&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2.11&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;0.528&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Timings are again on the 2.8GHz Opteron server.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Extended GCD&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;A longstanding issue with MPIR has been the lack of fast extended GCD. We had merged Niels Mollers asymptotically fast GCD code, but unfortunately no suitable extended GCD implementation was available to us. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I began by reading &lt;a href="http://www.lysator.liu.se/~nisse/archive/S0025-5718-07-02017-0.pdf"&gt;Niels' paper&lt;/a&gt; to understand the half-GCD algorithm (ngcd) that he invented. Essentially half-GCD algorithms get their asymptotic improvement over the ordinary Euclidean algorithm as follows:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Begin by noting that the first few steps only depend on the uppermost bits of the original integers. Thus instead of working to full precision, one can split off the topmost bits and compute the first few steps of the Euclidean GCD on those bits. One keeps track of the steps taken in matrix form. One then has an exact representation for the steps taken so far in the algorithm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next one applies the matrix to the &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;original &lt;/span&gt;integers. The integers that result are smaller than the original integers. One then finishes off the algorithm by applying the usual steps of the GCD algorithm to the smaller integers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course to get an asymptotic improvement one needs to take care how to apply the matrix and to recurse the whole algorithm down to some fast basecase.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once I understood that Niels' implementation explicitly computed the matrix at each step, it occurred to me that in order to turn it into an extended GCD algorithm, all I had to do was apply the matrix to the cofactors and keep track of any other changes that would affect the cofactors, throughout the algorithm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As an example of the speedup, for 1048576 bit integers extended GCD benches as follows:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;MPIR 1.1.2:   0.453 &lt;/div&gt;&lt;div&gt;MPIR 1.2.0:     4.06&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Again the benchmarks are for the 2.8GHz Opteron.&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Assembly Improvements&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Finally this mammoth MPIR release also contained numerous sped up assembly routines due to Jason Moxham for AMD Opterons (which usually also translate to improvements for other 64 bit x86 platforms).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Below I include &lt;span class="Apple-style-span" style="font-weight: bold;"&gt;timings &lt;/span&gt;(smaller is better) at each point where improvements have been made in the assembly routines in MPIR.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;table border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="5" style="text-align: center; "&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Assembly Improvements&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;MPIR 0.9&lt;/td&gt;&lt;td&gt;MPIR 1.0&lt;/td&gt;&lt;td&gt;MPIR 1.1.2&lt;/td&gt;&lt;td&gt;MPIR 1.2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_add_n&lt;/td&gt;&lt;td style="text-align: right; "&gt;1598&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1524&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_sub_n&lt;/td&gt;&lt;td style="text-align: right; "&gt;1598&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1524&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_xor_n&lt;/td&gt;&lt;td style="text-align: right; "&gt;3015&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2271&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_and_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3014&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1772&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1523&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_ior_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3014&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1771&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1525&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_nand_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3013&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2035&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1775&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_nior_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3013&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1773&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_andn_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3013&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2273&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_iorn_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3015&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2271&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_addadd_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2525&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2190&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_addsub_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2526&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2189&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_subadd_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2526&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2190&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_addmul_1&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3094&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2524&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_submul_1&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3092&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2524&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_mul_1&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3024&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2522&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_sublsh1_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2527&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2400&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2190&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_com_n&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3014&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1271&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_divexact_by3&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;12016&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2278&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_lshift&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2531&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1701&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_rshift&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2532&lt;/td&gt;&lt;td style="text-align: right; "&gt;1617&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_hamdist&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;8281&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1791&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_popcount&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;7281&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1527&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPN_COPY&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;3012&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2017&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;1021&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;mpn_divrem_1&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;23649&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;15119&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;MPN_ZERO&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;2013&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;&lt;br /&gt;&lt;/td&gt;&lt;td style="text-align: right; "&gt;783&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;This time timings are taken on a 2.6GHz AMD K10 box.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Numerous new assembly functions have also been added for 64 bit x86 machines since MPIR began, including: mpn_addmul_2, mpn_addadd_n, mpn_sublsh1_n, mpn_divexact_byff, mpn_rsh1add_n, mpn_lshift1, mpn_rshift1, mpn_rsh1sub_n, mpn_mul_2, mpn_lshift2, mpn_rshift2.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7651115430416156636-8565704276084071942?l=wbhart.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://wbhart.blogspot.com/feeds/8565704276084071942/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://wbhart.blogspot.com/2009/06/mpir-version-12.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8565704276084071942'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7651115430416156636/posts/default/8565704276084071942'/><link rel='alternate' type='text/html' href='http://wbhart.blogspot.com/2009/06/mpir-version-12.html' title='MPIR - version 1.2'/><author><name>William Hart</name><uri>http://www.blogger.com/profile/18416881057216462316</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry></feed>
