Sunday, 29 March 2009

Current shadow stack overheads in HLVM

Many people, including the developers of Haskell (see here) and OCaml (see here) compilers, have been quick to dismiss the shadow stack algorithm for performance reasons.

Despite these concerns, we chose to use a shadow stack in HLVM for several reasons:

  • GC performance is low on the list of priorities for HLVM.
  • Shadow stacks are easy to implement entirely from within HLVM whereas conventional strategies entail injecting stack maps into stack frames on the normal stack and that requires low-level hacking with LLVM's experimental GC API in C++.
  • Shadow stacks are easy to debug because they are just another data structure in the heap.
  • HLVM provides typed nulls and the ability to read array lengths or type constructor tags without an indirection. This requires the use of an unconventional struct-based reference type that is incompatible with LLVM's current GC API that can only handle individual pointers.

Consequently, it is interesting to use HLVM to measure just how expensive its GC and shadow stack implementations are. This cannot be done directly because the shadow stack is essential for garbage collections to work so removing the shadow stack must also remove the whole GC. However, we can benchmark with full GC enabled, with shadow stack updates but no collections and with GC completely disabled:

The difference between the last two measures gives an estimate of the time spent updating the shadow stack and we can present that as a ratio of the original running time:

This is only an estimate because disabling the GC affects the performance of the rest of the system, most notably leaving more allocated blocks for malloc to handle and degrading cache coherence by cold starting all allocated values.

These results show that the time spent manipulating the shadow stack is:

  • Insignificant for the fib, ffib, mandel and mandel2 benchmarks, which is expected because their inner loops act on value types that the GC is oblivious to.
  • Between 10 and 25% for the sieve, Array.fold, List.init and gc benchmarks, which is expected because they use reference types in their inner loops.
  • Around 70% in the case of the 10-queens benchmark, which was entirely unexpected!

We can also see that the List.init benchmark is spending most of its time performing collections.

These results are very encouraging for two reasons:

  • HLVM's comparatively poor performance on the List.init benchmark is not due to the shadow stack but, rather, is due to inefficiencies in our collection algorithm. Specifically, the mark phase uses a hash table with a fixed number of buckets that scales poorly when the heap contains many allocated blocks. Increasing the size of the hash table provides a 36% performance improvement.
  • HLVM's comparatively poor performance on the list-based 10-queens benchmark is due to the presence of reference types in the inner loop: the List.filter function. Specifically, the environment of the predicate closure and the list itself. Inlining and/or CSE will go a long way to eliminating this overhead.

Suffice to say, we are very happy with these results and intend to continue using the shadow stack algorithm.

Saturday, 28 March 2009

Hackage download stats vs OCaml

Donald Bruce Stewart recently published an in-depth analysis of Hackage's download statistics that shows many interesting characteristics:

  • All of the 1,200 Haskell softwares on Hackage combined have been downloaded as many times in total as a single OCaml program, MLDonkey, from Sourceforge (let alone OCaml's other popular software like FFTW and Unison).
  • Over half of the packages on Hackage have been downloaded under 300 times.
  • Dependencies are multiply counted so these figures are a substantial over estimate.

This confirms our suspicions that the few lines of real-world Haskell software ever written have been tested by only a handful of people. Perhaps Haskell will become more mainstream in the future but it seems to be quite a long way off today.

HLVM's first front-end

Alp Mestan has begun work on the first front-end for HLVM. This project has many important goals:

  • Provide a high-level language that HLVM developers can write test code, benchmarks and garbage collectors in.
  • Serve as a tutorial for the people porting existing compilers such as Moscow ML, NekoML and PolyML to HLVM.
  • Track new features as they are added to HLVM, such as closures, parametric polymorphism, parallelism and so on.

Alp intends to apply to Jane St. Capital to fund a summer project that will port the entire OCaml compiler to HLVM.

Performance: OCaml vs HLVM beta 0.4

A quick update due to a surprise performance result for HLVM!

We knew that manipulating the shadow stack was a major slowdown in HLVM from the change in our benchmark results when the GC was first introduced a few weeks ago but we did not expect that a simple local optimization, unrolling, could go so far towards recovering performance. Moreover, this optimization was implemented in only a few lines of OCaml code.

The following results prove that HLVM now produces substantially faster programs than ocamlopt for numerical tasks on x86:

Interestingly, now that HLVM supports standalone compilation using LLVM's IR optimization passes as well as unoptimized JIT compilation, we see that LLVM's optimization passes only give small performance improvements in many cases and even substantially degrade performance on our 10-queens benchmark. This is a direct result of HLVM producing near optimal code directly (which greatly reduces compile times) and has eliminated our desire to add support for LLVM's optimization passes. We shall focus on inlining and then high-level optimization passes instead, most notably the elimination of closures.

Friday, 27 March 2009

Memory footprints: OCaml vs HLVM

Someone recently published a blog article about the sizes of values in the Ruby and OCaml programming languages.

This is also interesting in the context of HLVM because our high performance goals also require an efficient memory representation but our solution is quite different to OCaml's because HLVM is designed for numerical performance whereas OCaml was designed for symbolic performance.

The following table lists the sizes of values of several different types in OCaml and HLVM on 32-bit architectures:

unit32 bits32 bits
bool32 bits8 bits
int3132 bitsNone
int3296 bits32 bits
int64128 bits64 bits
float32None32 bits
float64128 bits64 bits
float * float320 bits128 bits
Enumeration32 bits96 bits
float64 array64+64n bits96+64n bits

Note how HLVM uses the most memory efficient representations possible for ints, floating point numbers and tuples but uses less efficient representations for reference types such as arrays and variant types. A side-effect is that heaps in HLVM contain far fewer pointers than heaps in OCaml and this greatly reduces the stress on the garbage collector.

Tuesday, 17 March 2009

Performance: OCaml vs HLVM beta 0.3

Our HLVM project is evolving rapidly and recently saw its first release with a working garbage collector. This has allowed us to make some performance measurements and, as we had hoped, the results are extremely compelling.

The following graph illustrates the time taken to perform each of the benchmarks in the HLVM test suite on one of the 2.1GHz Opteron 2352 cores in an 8-core Dell PowerEdge:

The individual benchmarks are:

  • fib: The usual Fibonacci function.
  • ffib: A floating-point version of the Fibonacci function.
  • sieve: Sieve of Eratosthenes to find all of the primes under 10^8 and print the last one using a byte array.
  • mandel: Mandelbrot rendering with float arithmetic.
  • mandel2: Mandelbrot rendering with abstracted complex arithmetic where complex numbers are pairs of floating point numbers.
  • Array.fold: Initialize a 10^8-element float array and fold of a float * float pair over it.
  • List.init/fold: Initialize a 10^6-element int list and fold over it (allocation intensive).
  • 10-queens: Solve the 10-queens problem by filtering lists (allocation and GC intensive).

These results show that the latest HLVM is already faster than OCaml on three tests (ffib, mandel and mandel2), gives comparable performance on three more tests (fib, sieve and Array.fold) and is substantially slower on the allocation and GC-intensive list and 10-queens benchmarks.

The third set of results labeled "HLVM unsafe" refer to results obtained by manually avoiding manipulation of the shadow stack when it is not necessary (note that the GC still runs exactly as before). This is the next major optimization that HLVM is likely to automate. With this optimization, HLVM also outperforms OCaml on the sieve and Array.fold benchmarks.

Suffice to say, we are very happy with these results and are now encouraging others to invest in building upon HLVM now that the basic infrastructure is in place. We have already had interest from the developers of HaXe and PolyML in using HLVM as a new VM for their compilers and another group are applying to the Jane St. Summer Project 2009 for funding to port INRIA's OCaml compiler to HLVM.

The next major step for HLVM is support for parallel threads. This will be implemented by cooperatively suspending all threads, performing a GC using the current implementation and then resuming the threads. This alone will make HLVM significantly more capable than OCaml when it comes to parallel computing and our next set of benchmark results will include parallel programs!

Saturday, 7 March 2009

HLVM has been released

We have made the first public release of our much anticipated High-Level Virtual Machine (HLVM) project. The complete source code is now available from the HLVM project page on the OCaml Forge via the subversion repository.

HLVM makes extensive use of some features only recently added to LLVM and, consequently, requires a patched version of LLVM 2.4 or later.

Even at this early stage, our performance results are extremely compelling. HLVM is already several times faster on a variety of benchmarks than some notably high-performance functional programming languages such as OCaml. HLVM is specifically designed for technical computing and we believe it will offer the best performance for numerical computing of any high-level language including easy-to-use support for parallelism. We also intend to provide extensive support for visualization using OpenGL and make HLVM a viable platform for both free and commercial software in the future.

This open source project has been solely funded by Flying Frog Consultancy. If you would like to contribute funding, resources or help to develop HLVM yourself please get in contact with us.