In a recent update we added three PRNGs (pseudo random number

generators). What we are going to do today is add test code for

these.

But how do you test a pseudo random generator? It's producing

basically random values after all. So what does it matter if the

output is screwed up!?

Well, it does matter, as shown by the problems on 32 bit machines

which I wrote about in the PRNG blog post. It would also matter if

the PRNGs were broken on some platform such that they always output

0 every time!

There's a few ways of testing PRNGs. One is simply to run them for a

given number of iterations, write down the last value it produces and

check that it always does this.

The method we are going to use is slightly more sophisticated. We are

going to hash a long series of outputs from the PRNGs, using a hash

function, and check that the hash of the output is always the same.

Basically, our test code will take a long string of words from the

PRNGs, write them to an array of bytes, then compute the sha1 hash of

that array of bytes. It'll then compare the computed hash with a hash

we've computed previously to ensure it has the same value as always.

Moreover, we'll set it up so that the hash is the same regardless of

whether the machine is big or little endian.

The hash function we are going to use is called sha1. Specifically,

we'll be using an implementation of the same written by Brian Gladman

(he also supplied the new test code for the PRNGs for today's update).

The first step is to identify whether the machine is big or little

endian. This refers to the order of bytes within a word in physical

memory. On little endian machines (such as x86 machines) the least

significant byte of a word comes first. On big endian machines the

order is the other way around. Thus the number 0x0A0B0C0D would have

the byte 0x0D stored first on a little endian machine, but 0x0A stored

first on a big endian machine.

Endianness can be identified by architecture, or it can be identified

with a short program. We choose to use the latter method as it should

be hard to fool. At configure time a short C program will run that will

place bytes into a four byte array, then read that array as a single

32 bit number. We then compare the value to a 32 bit value that would

be stored in the given way on a little endian machine. If it compares

equal, then the machine must be little endian. If not we compare with

a number that would be stored in the given way on a big endian machine.

If the machine doesn't match either order, then it must be a very rare

machine with mixed endianness, which we don't support in bsdnt.

The configure script will write some defines out to config.h which

then allow bsdnt modules to identify whether the machine is little or

big endian at compile time, i.e. at zero runtime cost.

Now to discuss the sha1 hashing scheme.

A hashing scheme simply takes a piece of data and computes from it a

series of bits which serve to "identify" that piece of data. If

someone else has access to the same hashing algorithm and a piece of

data which purports to be an exact copy of the original, then they

can verify its identity by computing its hash and comparing.

Of course a hash is only valuable in this sense if it is much shorter

than the piece of data itself (otherwise you'd just compare the

actual data itself).

A very simple hashing scheme might simply add all the words in the

input to compute a hash consisting of a single word.

A secure hashing scheme has at least two other properties.

i) It shouldn't be possible to determine the original data from its

hash. (The data may be secret and one may wish to provide for the

independent verification of its authenticity by having the recipient

compare the hash of the secret data with a publicly published value.

Or, as is sometimes the case, the hash of secret data, such as a

password, might be transmitted publicly, to compare it with a

pre-recorded hash of the data.)

ii) It must be computationally infeasible to substitute fake data

for the original such that the computed hash of the fake data is the

same as that of the original data.

Of course, if the hash is short compared to the data being hashed,

then by definition many other pieces of data will have the same hash.

The only requirement is that it should be computationally infeasible

to find or construct such a piece of data.

The first step in the SHA1 algorithm is some bookkeeping. The

algorithm, as originally described, works with an input message which

is a multiple of 16 words in length. Moreover, the last 64 bits are

reserved for a value which gives the length of the original message in

bits.

In order to facilitate this, the original message is padded to a

multiple of 16 words in length, with at least enough padding to allow

the final 64 bits to be part of the padding, and to not overlap part

of the message.

The padding is done by first appending a single binary 1 bit, then

binary zeroes are appended until the message is the right length.

Then of course the length in bits of the original message is placed

in the final 64 bits of the message.

The hashing algorithm itself performs a whole load of prescribed

twists and massages of the padded message.

For this purpose some functions and constants are used.

Given 32 bit words B, C and D there are four functions:

1) f0(B, C, D) = (B AND C) OR ((NOT B) AND D)

2) f1(B, C, D) = B XOR C XOR D

3) f2(B, C, D) = (B AND C) OR (B AND D) OR (C AND D)

4) f3(B, C, D) = B XOR C XOR D

and four corresponding 32 bit constants:

1) C0 = 0x5A827999

2) C1 = 0x6ED9EBA1

3) C2 = 0x8F1BBCDC

4) C3 = 0xCA62C1D6

To begin the algorithm, we break the padded message up into 16 word

blocks M1, M2, M3, i.e. each Mi is 16 words of the padded message.

Each block is processed via a set of steps, and an "accumulated" hash

of 160 bits, consisting of five 32 bit words (the final hash we are

after) is computed: H0, H1, H2, H3, H4.

The algorithm starts by initialising the five "hash words" to the

following values:

H0 = 0x67452301, H1 = 0xEFCDAB89, H2 = 0x98BADCFE, H3 = 0x10325476

and H4 = 0xC3D2E1F0.

Each block of 16 words, Mi, of the padded message is then used in

turn to successively transform these five words, according to the

following steps:

a) Break Mi into 16 words W0, W1, ..., W15.

b) For j = 16 to 79, let Wj be the word given by

Wj = S^(-1)(W{j-3}) XOR W{j-8} XOR W{j-14} XOR W{j-16}),

where S^n(X) means to rotate the word X to the left through n bits

(a negative n means right rotation).

c) Make a copy of the hashing words:

A = H0, B = H1, C = H2, D = H3, E = H4

d) For j = 0 to 79 repeat the following set of transformations in

the order given:

TEMP = S^5(A) + f{j/20}(B, C, D) + E + Wj + C{j/20},

E = D, D = C, C = S^30(B), B = A, A = TEMP,

where j/20 signifies "floor division" by 20, and where f and C are

the above-defined functions and constants.

e) Update the hashing words according to the following:

H0 = H0 + A, H1 = H1 + B, H2 = H2 + C, H3 = H3 + D, H4 = H4 + E.

Note that steps a-e are repeated for each block of 16 words, Mi in

the padded message, further manipulating the five words with each run.

The resulting five words H0, H1, H2, H3, H4 after all the words of the

padded message have been consumed, constitutes the sha1 hash of the

original message.

The function to compute the sha1 hash of a message is given in the

files sha1.c and sha1.h in the top level directory of the source

tree.

A new test file t-rand.c is added in the test directory. It contains

the sha1 hash of a large number of words as output by our three

PRNGs. If a user of bsdnt has the same hash for the PRNGs when run

on their machine, then they can have a very high level of confidence

that they are working as expected.

Note that the sha1 algorithm is known as a secure hashing algorithm,

which means that in theory it can be used to hash very important

data so that the recipient can independently confirm the data hasn't

been tampered with (by computing the hash of the value and making

sure it matches some published value).

We don't explain how sha1 actually works. The mysterious constants

are not so mysterious. C0 is the square root of 2 in hexadecimal, C1 is

the square root of 3, C2 the square root of 5, C3 the square root of 10.

I don't know the meaning of the functions f0-f3.

What is worth noting is that in recent years, people have figured out

how to produce sha1 hash collisions (two messages with the same hash).

I don't pretend to be an expert in such things.

All we care about here is that a broken PRNG really can't pretend to

be working, and for that, sha1 works a treat.

Disclaimer: use the information in this post at your OWN RISK!! We

make no representations as to its correctness. The same goes for

bsdnt itself. Read the license agreement for details.

Warning: cryptography is restricted by law in many countries including

many of those where the citizens believe it couldn't possibly be so.

Please check your local laws before making assumptions about what you

may do with crypto.

The code for today's article is here:

v0.23