It's time we added architecture dependent assembly support to bsdnt.
Here is how we are going to do it. For each of the implementation files
we have (nn_linear.c, nn_quadratic.c), we are going to add a _arch file,
e.g. nn_linear_arch.h, nn_quadratic_arch.h, etc.
This file will be #included in the relevant implementation file, e.g.
nn_linear_arch.h will be #included in nn_linear.c.
These _arch.h files will be generated by a configure script, based on
the CPU architecture and operating system kernel. They will merely
include a list of architecture specific .h files in an arch directory.
For example we might have nn_linear_x86_64_core2.h in the arch directory,
which provides routines specific to core2 processors running in 64 bit
mode.
In these architecture specific files, we'll have inline assembly routines
designed to replace various routines in the pure C implementation files
that we have already written. They'll do this by defining flags, e.g.
HAVE_ARCH_nn_mul1_c, which will specify that an architecture specific
version of nn_mul1_c is available. We'll then wrap the implementation of
nn_mul1_c in nn_linear.c with a test for this flag. If the flag is defined,
the C version will not be compiled.
In order to make this work, the configure script has to work out whether
the machine is 32 or 64 bit and what the CPU type is. It will then link
in the correct architecture specific files.
At the present moment, we are only interested in x86 machines running on
*nix (or Windows, but the architecture will be determined in a different
way on Windows).
A standard way of determining whether the kernel is 64 bit or not is to
search for the string x86_64 in the output of uname -m. If something else
pops out then it is probably a 32 bit machine.
Once we know whether we have a 32 or 64 bit machine, we can determine the
exact processor by using the cpuid instruction. This is an assembly
instruction supported by x86 cpus which tells you the manufacturer, family
and model of the CPU.
We include a small C program cpuid.c with some inline assembly to call the
cpuid instruction. As this program will only ever be run on *nix, we can
make use of gcc's inline assembly feature.
When the parameter to the cpuid instruction is 0 we get the vendor ID,
which is a 12 character string. We are only interested in "AuthenticAMD"
and "GenuineIntel" at this point.
When we pass the parameter 1 to the cpuid instruction, we get the processor
model and family.
For Intel processors, table 2-3 in the following document gives information
about what the processor is:
http://www.intel.com/Assets/PDF/appnote/241618.pdf
However the information is out of date. Simply googling for Intel Family 6
Model XX reveals other models that are not listed in the Intel documentation.
The information for AMD processors is a little harder to come by. However,
one can essentially extract the information from the revision guides, though
it isn't spelled out clearly:
http://developer.amd.com/documentation/guides/Pages/default.aspx#revision_Guides
It seems AMD only list recent processors here, and they are all 64 bit.
Information on 32 bit processors can be found here:
http://www.sandpile.org/ia32/cpuid.htm
At this point we'd like to identify numerous different architectures. We
aren't interested in 32 bit architectures, such as the now ancient Pentium
4 or AMD's K7. Instead, we are interested when 32 bit operating system
kernels are running on 64 bit machines. Thus all 32 bit CPUs simply identify
as x86.
There are numerous 64 bit processors we are interested in:
64 bit Pentium 4 CPUs were released until August 2008. We identify them as p4.
All the 64 bit ones support SSE2 and SSE3 and are based on the netburst
technology.
The Intel Core Solo and Core Duo processors were 32 bit and do not interest us.
They were an enhanced version of the p6 architecture. They get branded as x86
for which only generic 32 bit assembly code will be available, if any.
Core 2's are very common. They will identify as core2. They all support SSE2,
SSE3 and SSSE3 (the Penryn and following 45nm processors support SSE4.1 - we
don't distinguish these at this stage).
Atoms are a low voltage processor from Intel which support SSE2, SSE3 and are
mostly 64 bit. We identify them as atom.
More recently Intel has released Core i3, i5, i7 processors, based on the
Nehalem architecture. We identify these as nehalem. They support SSE2, SSE3,
SSSE3, SSE4.1 and SSE4.2.
AMD K8's are still available today. They support SSE2 and SSE3. We identify
them as k8.
AMD K10's are a more recent core from AMD. They support SSE2, SSE3 and SSE4a.
We identify these as k10. There are three streams of AMD K10 processors,
Phenom, Phenom-II and Athlon-II. We don't distinguish these at this point.
So in summary, our configure script first identifies whether we have a 32 or
64 bit *nix kernel. Then the CPU is identified as either x86, p4, core2,
nehalem, k8 or k10, where x86 simply means that it is some kind of 32 bit CPU.
Our configure script then links in architecture specific files as appropriate.
The only assembly code we include so far are the nn_add_mc functions we wrote
for 64 bit core2 and k10. As these are better than nothing on other 64 bit
processors from the same manufacturers, we include this code in the k8
specific files until we write versions for each processor. We also add an
nn_sub_mc assembly file for Intel and AMD 64 bit processors.
The configure script includes the architecture specific .h files starting from
the most recent processors so that code for earlier processors is not used
when something more recent is available.
The github branch for this revision is here: asm2
Previous article: v0.12 mul_classical
Next article: v0.13 - muladd1, muladd