Sunday, 26 September 2010

Lessons learned from HLVM

Although our HLVM project is intended to bring modern VM techniques to modern programming languages without requiring any new research, we have ended up producing several enlightening new results. This article takes a look at some of the weird and wonderful techniques we used in HLVM.

GC as a metaprogram (good!)

Whereas most VMs use a garbage collector written in C or C++, the GC in HLVM is written in its own intermediate representation. This unusual design has a variety of benefits:

  • The GC acts as a test for the compiler.
  • The GC benefits from improvements we make to the compiler.
  • The GC is simplified by being able to use features like tail call elimination and higher-order functions.

Overall, implementing the GC in HLVM's own IR proved to be hugely beneficial: the GC was very easy to write and is easy to maintain.

Fat references (bad!)

Most VMs store header data in each heap-allocated block that describes the size and type of a value in order to allow the garbage collector to traverse it. In HLVM, we moved this header data into the reference itself. This allows C data structures to be used directly without having to be copied in order add the header, simplifying and improving the efficiency of the foreign function interface (FFI).

This adds a 4-word space overhead for duplicated references. Fortunately, duplicate references are rare and no significant space overhead has ever been observed in practice.

However, there are two disadvantages. Firstly, we have observed significant performance degradation as fat references are pushed onto the stadow stack, requiring 4× more bandwidth than 1-word references would. Secondly, our fat references cannot be updated atomically so our VM is not memory safe.

Lack of memory safety is a major concern and the best solution we have come up with to date is to resort to 1-word references as pointers to heap-allocated header information. Note that HLVM currently calls malloc twice for each allocation (once to allocate a mark bit and again to allocate the actual data) so moving the header information back into the heap need not require more allocations. The obvious solution of placing the header alongside the mark bit would incur false sharing when the GC thread writes to the mark bit and a mutator reads the header information. If that proves to be a problem then a better solution might be to store an index that can be used to look up the mark bit and header data from two separate arrays.

Shadow stacks (adequate!)

Xavier Leroy dismissed our proposed shadow stack, saying "probably the worst approach performance-wise; even a conservative GC should work better". Our motivation for using shadow stacks was always simplicity rather than performance. Shadow stacks are simply because LLVM's GC support is highly experimental (to date, only two people claim to have managed to get it to work) and shadow stacks facilitate accurate GC in an uncooperative environment). In fact, our results have already proven that Xavier's statement is not true in general and, in fact, we believe it may even be wrong in general.

Specifically, of the six benchmarks from our benchmark suite than handle references in the inner loop (thus incurring manipulation of the shadow stack), HLVM is significantly faster than OCaml on four of them. So our shadow stack cannot be degrading performance that much! Moreover, OCaml is 9× faster than HLVM on our n-queens benchmark only because its generational collector allows short-lived values to be recycled very efficiently and that has nothing to do with the shadow stack.

So we believe shadow stacks are an adequate solution.

8 comments:

Michael Robin said...

> the best solution we have come up with to date is to resort to 1-word references as pointers to heap-allocated header information.

What about BBOPs (big bag of pages) a la MacLisp?
I.e., pick a page size (4k) and have one header for all objects allocated on that page.
Then do a (ptr & PAGE_MASK)->Header to get your object info.
Obviosly this leads to some fragmentation, but memory isn't quite the issue it was when this scheme (pardon the pun) was first used.

Michael Robin said...

Obviously exceptions have to be made for very large objects - like setting the LSB to mark it as a ptr with it's own direct header.

Michael Robin said...

...or have a BBOP header for large objects that re-directs via a 2nd pointer.

I guess we're now putting quite a burden on reference following, including one conditional jmp which is never good for the pipeline. The only ways I can think of the avoid the conditional have even worse implications.

OTOH, maybe the BBOPs jibe well with generational schemes.

Flying Frog Consultancy Ltd. said...

@Mike: Nice pun! :-)

Wouldn't BIBOP require us to use a custom memory allocator and not malloc?

Michael Robin said...

> ... need custom allocator vs. malloc...

I assume you could just use malloc (to allocate large enuf areas so that your're guaranteed to get N page-size chunks back, and then sub-allocate pages/headers w/in that, and sub-sub-allocate objects in those pages.
Unless you're saying that you want to use malloc so that other code in the same process-space that may not be an HLVM program (FFI) can transparently allocate/free memory using old malloc() - in which case I guess you're stuck - they'd have to be changed to call bbobAlloc() or whatever - but that would bypass GC anyway and seems impolite.

Flying Frog Consultancy Ltd. said...

"Unless you're saying that you want to use malloc so that other code in the same process-space that may not be an HLVM program (FFI) can transparently allocate/free memory using old malloc() - in which case I guess you're stuck". Exactly, yes. Not an essential feature but noteworthy nonetheless.

Anil Madhavapeddy said...

You could use posix_memalign(3) to get 4K pages alloced. I've been thinking about a BBOP approach for OCaml recently too...

Flying Frog Consultancy Ltd. said...

@Anil: Good idea! Allocating non-arrays should be easy because they are all the same size. Potentially, a global dictionary could be used to map the address of an array to its type and size.

On a related note, I found that moving the mark bit from a global LUT into the value did not improve performance significantly. On the same basis, perhaps looking up type and size information via a global dictionary would not be significantly slower.

I am also thinking about moving vs non-moving collectors now. I think that adding support for moving collection adds considerable overhead even if you don't use it because all references must be saved and reloaded across anything that might include a GC safe point including every function call. Surprisingly, none of the GC researchers seem to have quantified this overhead!

I think that supporting moving collectors would significantly degrade the performance of the kind of code HLVM targets so I'm trying to think of ways to get the best of both worlds.

I thought about using separate stacks and heaps for movable vs non-moving values. That avoids rewriting the whole stack. However, you still need to be able to rewrite references from the non-moving heap into the movable heap. An array mapping indices to the current locations of movable values sounds good but you cannot have a global one because you don't want threads to have to contend for it and you cannot have thread-local ones because references must be able to migrate between them. Perhaps these features are fundamentally at odds?