Tuesday 27 August 2019

Update on Trip timings for multivariate arithmetic

New Trip timings


In my last blog I wrote about the timings we did for the OpenDreamKit project for parallel multivariate arithmetic.

After publishing those timings, Mickael Gastineau of the Trip project let me know that there is an undocumented feature for modular arithmetic, using the Mod(., p) command. He also told me that they had not linked in a fast parallel memory allocator. He released a patch update version of the publicly available Trip with this fix.

You can see the new timings, including for mod p arithmetic below.



Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 24.0s 10.5s 140s 120s
2 11.9s 5.55s 69.9s 60.0s
3 8.06s 3.80s 46.7s 41.0s
4 6.13s 2.95s 35.2s 31.4s
6 4.19s 2.09s 25.0s 21.1s
8 3.19s 1.60s 18.7s 16.1s
10 2.60s 1.30s 15.1s 13.0s
12 2.19s 1.09s 12.6s 10.9s
14 1.99s 0.96s 10.8s 9.4s
16 1.96s 0.86s 9.7s 8.3s
20 1.42s 0.72s 7.98s 6.7s
24 1.30s 0.64s 6.72s 5.6s
28 1.17s 0.55s 6.01s 4.9s
32 1.10s 0.50s 5.36s 4.9s
Sparse Multiplication over Z

Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 27.9s 9.15s 175s 53.9s
2 14.0s 4.73s 97.5s 28.1s
3 9.38s 3.29s 59.1s 18.7s
4 7.18s 2.49s 48.5s 14.3s
6 5.14s 1.68s 31.8s 9.48s
8 3.80s 1.29s 23.3s 7.26s
10 3.02s 1.04s 19.0s 5.96s
12 2.60s 0.86s 15.6s 4.99s
14 2.25s 0.74s 13.5s 4.22
16 1.97s 0.66s 11.7s 3.76s
20 1.70s 0.56s 9.72s 3.13s
24 1.44s 0.49s 8.50s 2.65s
28 1.29s 0.43s 7.35s 2.27s
32 1.21s 0.38s 6.79s 2.04s
Sparse Multiplication over Z/pZ

Cores Trip Z ODK Z
1 25.6s 5.08s
2 12.8s 2.56s
3 8.55s 1.71s
4 6.44s 1.29s
6 4.38s 0.86s
8 3.36s 0.67s
10 2.81s 0.52s
12 2.34s 0.46s
14 1.98s 0.40s
16 1.79s 0.34s
Dense Multiplication over Z

Cores Trip Z/pZ ODK Z/pZ
1 30.5s 3.64s
2 15.3s 1.83s
3 10.2s 1.22s
4 7.65s 0.92s
6 5.21s 0.62s
8 3.97s 0.46s
10 3.17s 0.37s
12 2.86s 0.31s
14 2.51s 0.28s
16 2.08s 0.24s
Dense Multiplication over Z/pZ

Conclusion


As you can see, there are some really impressive speedups for Trip from 16 to 32 cores. Below that the times seem to have gone up a little, but overall the scaling with the number of cores is much better.

The Trip Z/pZ times are often slower than the Z times, which is the opposite for us.

By the way, as the ODK project is coming to a close and my other work related things now have a blog elsewhere, this blog will probably now revert to just being a private blog again. I will no doubt not be able to resist doing a post here when Oscar is finally ready for its first prerelease/release, and there may be some new timings for giac, but otherwise it'll be mostly personal projects now.

I may post some articles on speeding up graphics on the CGA, EGA and VGA adapters. So stick around if that interests you.

Wednesday 7 August 2019

Fast Parallel Multivariate Arithmetic - Final Report!



We are hiring: see below!

As many readers of my blog know, we received funding here at TU Kaiserslautern in Germany from the European Union H2020 initiative to develop fast parallel multivariate polynomial arithmetic as part of the OpenDreamKit collaboration.

OpenDreamKit is a set of tools developed here in Europe for constructing Virtual Research Environments for scientific/mathematical research and other collaborations, and our site has been part of the High Performance team.

As the OpenDreamKit (ODK) project is drawing to a close, it is time to update everyone on how our project for parallel multivariate arithmetic has gone.

Here in Kaiserslautern, Daniel Schultz has been writing the code for this. There's now well over 100,000 lines of new, really high quality C code and test code, plus associated documentation.

You will remember from previous blog posts of mine that we had the following aims:
  • Parallelise multivariate multiplication, division and GCD
  • Support lex, deglex and degrevlex orderings
  • Support 64 bit exponents with any number of variables [1]
  • Aim for linear speedup with the number of cores [2]
  • Provide divisibility testing, exact division, quotient only division, division with remainder
  • Make it available via the computer algebra system Singular [3]
  • Support Z, Q, Z/pZ as coefficient rings [4]
  • Competitive single core and multicore timings
  • Support dense and sparse arithmetic [5]
  • Open source
This might seem like an impossible list, but we will let you judge if we have been successful.

In previous blog posts we've talked about our single core multiplication and division timings and our multicore multiplication timings versus many other systems.

In this blog we will summarise the multicore performance for multiplication, division and GCD.

Note that many other things were implemented in single core versions: various kinds of evaluation and composition, ideal reduction, powering, derivative and integral, input parsing and a panoply of support functions [12].

Benchmarking


We've decided to only benchmark systems that we know can do all of the following:
  • Parallel multivariate arithmetic
  • Support at least one of the main polynomial orderings: lex, deglex or degrevlex
  • Support at least one of Z, Q, Z/pZ
This means there are only three contenders:
  • Maple 2019 (deglex)
  • Giac 1.5.0 (lex)
  • Trip 14.62 (lex)
Maple and Giac currently optimise for small numbers of cores, e.g. two to four cores [6]. Trip is optimised for many cores.

Note that Maple is proprietary and closed source, Trip is closed source but available for academic use, and Giac is free and open source. All our software is free and open source.

As we were not able to install all software on the same machine, we benchmark against each system separately on different machines. Thus the timings between the different systems are not directly comparable.

To avoid conversion costs to and from Singular polynomials [7] we did all our timings with Oscar/Nemo which directly interfaces to Flint [8].

Benchmarks


We decided to benchmark the following algorithms:
  • Sparse multiplication over Z and Z/pZ, (Pearce n = 16 and n = 20)
  • Dense multiplication over Z and Z/pZ, (Fateman n = 30)
  • Sparse divisibility testing over Z and Z/pZ, (Pearce n = 16) [9]
  • Sparse GCD over Z (Schultz n = (7,5),(4,8))
These are the algorithms that we have spent the most effort parallelising [10]. 

Code for the benchmarks can be found at [13] and [14].

Maple 2019


Maple is not installed on our development server and we thank Michael Monagan for providing access to a machine on which the latest Maple is installed.

We were not able to pin threads when benchmarking Maple [17] and the machine we had access to only had 16 cores, so we only used 12 cores to be respectful of other users.

Unfortunately, for many of our benchmarks, unpredictable and long running garbage collection/conversion costs dominated some of the Maple timings. It was not practical to run the benchmarks to completion. We present here only the timings that did not suffer from this problem.


Cores Maple Z ODK Z Maple Z/pZ ODK Z/pZ
1 30.87s 5.79s 30.97s 4.57s
2 15.12s 2.94s 15.00s 2.33s
3 9.60s 2.04s 9.42s 1.59s
4 6.93s 1.58s 6.97s 1.23s
6 4.95s 1.10s 4.62s 0.89s
8 3.58s 0.88s 3.55s 0.66s
10 2.97s 0.72s 2.93s 0.58s
12 2.52s 0.62s 2.47s 0.49s
Dense Multiplication

Trip 14.62


We were able to install Trip [18] on our Gentoo development server with 32 cores.

We were able to pin cores when benchmarking Trip. For consistency we used the performance governor with turbo off. We did not use a specialised malloc (however Trip does), but Flint has its own multithreaded memory manager [15]. Edit: Mickaël Gastineau has informed us that due to a linking issue in the version of Trip that we used, the system allocator was being used. He has issued a new version which we will soon benchmark here on this blog.

Trip offers a variety of polynomial representations. We picked the one that seemed to give the best timings. Note that our code automatically converts on the fly so that no ahead-of-time selection is required.

Trip has very different applications to our system, so not many of our benchmarks are supported [19]. Edit: Mickaël Gastineau has informed us that mod p arithmetic is now available through an undocumented feature. We will do timings for comparison in an upcoming blog.


Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 21.9s 10.5s 130s 120s
2 11.5s 5.55s 67.2s 60.0s
3 8.02s 3.80s 45.7s 41.0s
4 6.02s 2.95s 33.4s 31.4s
6 3.96s 2.09s 22.9s 21.1s
8 3.29s 1.60s 17.3s 16.1s
10 2.68s 1.30s 15.5s 13.0s
12 2.30s 1.09s 12.9s 10.9s
14 2.07s 0.96s 11.0s 9.4s
16 1.96s 0.86s 10.1s 8.3s
20 1.79s 0.72s 9.39s 6.7s
24 1.73s 0.64s 9.08s 5.6s
28 1.62s 0.55s 8.87s 4.9s
32 1.53s 0.50s 8.57s 4.9s
Sparse Multiplication over Z

Cores Trip Z ODK Z
1 24.4s 5.08s
2 12.3s 2.56s
3 8.30s 1.71s
4 6.21s 1.29s
6 4.24s 0.86s
8 3.17s 0.67s
10 2.87s 0.52s
12 2.47s 0.46s
14 2.01s 0.40s
16 1.78s 0.34s
Dense Multiplication

Giac 1.5.0


Giac did not install successfully on our Gentoo development server [20]. The only system I could install it on was my four core laptop [21]. Therefore the giac timings should not in any way be considered scientific!

The current version of giac is only optimised for two or four threads. Bernard Parisse reports that he is currently working on improving much of the multivariate arithmetic and threading. He also notes that giac has to internally convert from a new polynomial representation to a historic representation, which can significantly increase times. It may be better to compare with Singular rather than Oscar/Nemo.

We did not do any thread pinning for the giac timings.

Some operations did not complete in a reasonable time. We marked these with a dash.

Cores Giac Z n=16 ODK Z n = 16 Giac Z/pZ n=16 ODK Z/pZ n=16 Giac Z n=20 ODK Z n=20 Giac Z/pZ n=20 ODK Z/pZ n=20
1 18.3s 12.0s 153s 7.85s 162s 92.2s ----- 45.3s
2 16.2s 7.42s 153s 4.85s 217s 63.5s ----- 25.0s
3 18.6s 5.91s 158s 3.49s 191s 40.7s ----- 18.1s
4 13.4s 5.67s 153s 2.77s 196s 36.7s ----- 14.7s
Sparse Multiplication

Cores Giac Z ODK Z Giac Z/pZ ODK Z/pZ
1 5.65s 3.61s 4.12s 2.73s
2 3.64s 1.91s 2.61s 1.45s
3 2.64s 1.29s 1.83s 1.01s
4 2.40s 0.99s 1.61s 0.75s
Dense Multiplication

Cores Giac Z ODK Z
1 41.6s 14.8s
2 38.8s 8.03s
3 37.5s 5.31s
4 36.8s 4.22s
Sparse GCD

Other OpenDreamKit (ODK) Timings


Divisibility testing with quotient does not seem to be available in any of the systems we are comparing against [11].  It is available in Magma [16], but this is not parallelised.

[We have since been told that Maple does have divisibility testing but does not return the quotient. However, apparently it internally does sufficient work to compute the quotient, so we could have safely benchmarked it. Even so, the multiplication does not terminate for us in a reasonable time, so we are unable to test the divisibility testing in Maple.]

Thus, we present this benchmark for our system only.

For consistent timings that scale well we used thread pinning, the performance governor and disabled turbo. We do not use a specialised malloc, however Flint has a multithreaded memory manager [15].


Cores ODK Z ODK Z/pZ
1 9.96s 9.57s
2 5.52s 3.30s
3 3.64s 2.41s
4 2.68s 1.97s
6 1.92s 1.72s
8 1.55s 1.38s
9 1.26s 1.11s
12 1.14s 1.01s
14 0.92s 0.81s
16 0.88s 0.76s
Sparse Divisibility Testing

As scaling with many threads for our GCD code is not shown above, we give the timings here.

Cores ODK Z
1 21.58s
2 11.10s
3 7.46s
4 5.74s
6 3.93s
8 3.02s
10 2.50s
12 2.13s
14 1.89s
16 1.70s
Sparse GCD

We spare the reader all the Z/pZ timings for multiplication, as they scale much as the timings over Z do.

[We have since discovered we can do this on a single core in about 3s, but are still trying to figure out how to parallelise that trick.]

Current and Future Work


We are currently working on integration into Singular. This is almost done, with only rational functions still to be done. Our final ODK report will contain timings and details.

We also plan on implementing multivariate factoring directly on top of our multivariate arithmetic. Some progress has already been made on this, but it goes well beyond the scope of the ODK project. It will likely be a side project of Daniel Schultz, probably over a period of months or years.

We are hiring!


Daniel and I will be here in Kaiserslautern long after ODK finishes, but we are also hiring new people. There are numerous open postdocs in Kaiserslautern after Prof. Wolfram Decker managed to secure lots of new funding.

We are looking for individuals who:

* Have good programming skills
* Have a PhD
* Interest in some area of computer algebra (broadly interpreted)
* Are interested in being sited at TU Kaiserslautern

If that's you or you know someone suitable, don't hesitate to put them in contact with us. You can find our email addresses on the University website.

Dr. William Hart.

Notes


[1] We also support multiprecision exponents for some operations.

[2] For sufficient sized problems.

[3] The code is actually in Flint, which is a component of Singular. It is also available through Oscar/Nemo.

[4] We also support Fq in the new Flint fq_nmod_mpoly interface.

[5] We also support what we call "semi-sparse" arithmetic, which is sometimes specially optimised for.

[6] Bernard Parisse tells us he will be optimising for more cores in future.

[7] Groebner bases require fast (monomial)x (polynomial) operations, and Singular is optimised to do many fundamental operations without memory allocation. This is achieved using a custom memory manager (omalloc) and a linked list implementation.

[8] Note that a very small patch is currently required in Nemo to avoid use of Julia's counted_malloc, which isn't threadsafe for use with Flint threads. Essentially we comment out a few lines in Nemo.jl that redirect the memory manager functions in GMP and Flint to Julia's counted malloc.

[9] We use Pearce's multiplication benchmark, then check divisibility by the second factor and return the quotient if it is divisible.

[10] Some dense algorithms are also parallelised by earlier ODK work on parallelising the FFT, however we are not reporting on that work here.

[11] Divisibility testing usually requires a heuristic test to quickly exclude inexact divsion, and a divrem with early termination. If one does not have this then it will be really slow in the case of inexact division. We time only exact division, as this requires the expensive divrem step, with early termination in case it made it past our heuristic test. 



[14] Flint profiling code, see for example: https://github.com/wbhart/flint2/tree/trunk/fmpz_mpoly/profile

[14a] No one will read all these comments, which is why they are comments.

[15] Developed for the ODK project by me, see: https://github.com/wbhart/flint2/blob/trunk/fmpz/link/fmpz_single.c

[16] We also have it on Good Authority TM that (some recent version of) Magma does not do a proper divisibility test, but only a heuristic test, which we do not accept to be sufficient for this benchmark.

[17] We also did not use thread pinning or adjust the CPU governor when timing Oscar/Nemo on this machine.

[18] It didn't install out of the box, but we were easily able to install older versions of certain prerequisites that then allowed Trip to run.

[19] Trip does not provide exact division but has division with remainder. But this is not optimised for performance.

[20] I tried to install it from source, because I do not have sudo access. But Bernard reports that the timings we were obtaining were not reasonable, probably due to some misconfiguration which we were unable to identify.

[21] This was a Windows system, so I had to use the WSL. Unfortunately this is not a recent enough version of Ubuntu, so I attempted an upgrade which was partially successful. It was at least enough to get giac 1.5.0 stable working with timings that Bernard Parisse agrees are reasonable.