Sunday, 26 September 2010

Lessons learned from HLVM

Although our HLVM project is intended to bring modern VM techniques to modern programming languages without requiring any new research, we have ended up producing several enlightening new results. This article takes a look at some of the weird and wonderful techniques we used in HLVM.

GC as a metaprogram (good!)

Whereas most VMs use a garbage collector written in C or C++, the GC in HLVM is written in its own intermediate representation. This unusual design has a variety of benefits:

  • The GC acts as a test for the compiler.
  • The GC benefits from improvements we make to the compiler.
  • The GC is simplified by being able to use features like tail call elimination and higher-order functions.

Overall, implementing the GC in HLVM's own IR proved to be hugely beneficial: the GC was very easy to write and is easy to maintain.

Fat references (bad!)

Most VMs store header data in each heap-allocated block that describes the size and type of a value in order to allow the garbage collector to traverse it. In HLVM, we moved this header data into the reference itself. This allows C data structures to be used directly without having to be copied in order add the header, simplifying and improving the efficiency of the foreign function interface (FFI).

This adds a 4-word space overhead for duplicated references. Fortunately, duplicate references are rare and no significant space overhead has ever been observed in practice.

However, there are two disadvantages. Firstly, we have observed significant performance degradation as fat references are pushed onto the stadow stack, requiring 4× more bandwidth than 1-word references would. Secondly, our fat references cannot be updated atomically so our VM is not memory safe.

Lack of memory safety is a major concern and the best solution we have come up with to date is to resort to 1-word references as pointers to heap-allocated header information. Note that HLVM currently calls malloc twice for each allocation (once to allocate a mark bit and again to allocate the actual data) so moving the header information back into the heap need not require more allocations. The obvious solution of placing the header alongside the mark bit would incur false sharing when the GC thread writes to the mark bit and a mutator reads the header information. If that proves to be a problem then a better solution might be to store an index that can be used to look up the mark bit and header data from two separate arrays.

Shadow stacks (adequate!)

Xavier Leroy dismissed our proposed shadow stack, saying "probably the worst approach performance-wise; even a conservative GC should work better". Our motivation for using shadow stacks was always simplicity rather than performance. Shadow stacks are simply because LLVM's GC support is highly experimental (to date, only two people claim to have managed to get it to work) and shadow stacks facilitate accurate GC in an uncooperative environment). In fact, our results have already proven that Xavier's statement is not true in general and, in fact, we believe it may even be wrong in general.

Specifically, of the six benchmarks from our benchmark suite than handle references in the inner loop (thus incurring manipulation of the shadow stack), HLVM is significantly faster than OCaml on four of them. So our shadow stack cannot be degrading performance that much! Moreover, OCaml is 9× faster than HLVM on our n-queens benchmark only because its generational collector allows short-lived values to be recycled very efficiently and that has nothing to do with the shadow stack.

So we believe shadow stacks are an adequate solution.

Saturday, 25 September 2010

The effectiveness of Boehm's GC

Many people still seem to be trying to use Boehm's garbage collector. This is surprising because that GC is conservative, meaning it is incapable of accurately distinguishing between integers and pointers and, consequently, it is prone to memory leaks due to false positives where integers are assumed to be pointers into an allocated heap block, preventing the block from being reclaimed. Consequently, Boehm's GC is a notorious source of memory leaks.

Moreover, 32-bit machines with 4Gb of RAM and programs that use a significant proportion of that RAM are still very common and the proportion of the address space in use is, therefore, high so the probability of false positives is very high.

Imagine a 32-bit program using 40Mb of heap contains a random integer. The probability of that random integer coincidentally pointing into an allocated heap block is approximately 1% because 1% of the 4Gb address space is covered by heap blocks. Now imagine a 32-bit program containing n random ints and using a proportion p of the address space. The probability that one or more of those ints point into allocated heap blocks is 1-(1-p)n. In our previous example, if there were 100 random ints then the probability of a false positive is a whopping 63%!

This is why 32-bit programs that use Boehm's GC are so prone to memory leaks. Hash tables are particularly susceptible to this problem because hashes are effectively random ints and the spines of hash tables are large heap blocks. Indeed, this appears to explain the memory leak we observed when using hash tables on Mono back in July.

Sunday, 19 September 2010

Are multicore-capable garbage collectors hard to write?

In this era of multicore computing, garbage collectors need to allow user threads (aka mutators) to run in parallel on separate cores simultaneously in order to facilitate efficient shared memory parallel programming.

There are two relevant phrases from garbage collection terminology here:

  • Parallel GC means the garbage collector itself has been parallelised in order to speed up garbage collections. For example, a stop-the-world collector might traverse the heap in parallel when marking in an attempt to reduce pause times.

  • Concurrent GC means the garbage collector runs at the same time as the user threads (aka mutators). For example, Dijkstra's algorithm and derivatives like Doligez-Leroy use fine-grained synchronization to keep a concurrently-running collector apprised of the constantly-changing heap topology.

However, we are talking about neither parallel GC nor concurrent GC but, rather, the simpler challenge of just allowing mutators to run in parallel. In the absence of any established terminology, we call any GC that allows mutator threads to run in parallel with each other a multicore-friendly garbage collector.

Frustratingly, many people are perpetuating the myth that it is difficult to write parallel or concurrent or even just multicore-friendly garbage collectors. In particular, this is happening around the OCaml language as a result of Xavier Leroy's (in)famous "standard lecture on threads" from 2002 where he explained that they were not creating a multicore-friendly garbage collector for OCaml because multicores would never become popular and he described Intel's hyperthreading as "the last convulsive movement of SMP's corpse". Xavier is a great guy and had done a lot of great work but, of course, he was completely wrong about this. Not only are multicores ubiquitous and hyperthreading is very common but shared memory parallelism is here to stay: even if distributed parallelism becomes essential in the future when cache coherence breaks down we will be doing distributed parallel programming between multicores because shared memory parallelism is so much more efficient than distributed parallelism most of the time.

The JVM and .NET CLR obviously provide multicore-friendly garbage collectors so people sometimes assert that creating such a garbage collector requires huge resources and is beyond a small group such as the OCaml team at INRIA. This is simply not true. The simplest multicore-friendly garbage collector design is a stop-the-world collector that pauses all user threads while the entire heap is marked and swept safe in the knowledge that the heap topology is static. Our own HLVM project currently uses exactly this design and it took just a few days to write and is under 100 lines of code! And we are not alone. Simon Marlow has written several far more sophisticated multicore-friendly garbage collectors for the Glasgow Haskell Compiler by himself. The PolyML project developed a multicore-friendly garbage collector without benefit of funding from a multinational corporation. Same for Manticore.

Even some of the more sophisticated mostly-concurrent garbage collector designs are remarkably simple. For example, the Very Concurrent Garbage Collector (VCGC) uses a breathtakingly-elegant approach based around three epochs (instead of the usual tricolor marking scheme) that completely avoids the error-prone and inefficient fine-grained synchronization originally proposed by Dijkstra and followed by almost everyone else. The entire algorithm for this mostly concurrent garbage collector can be expressed in a single page of code!

So please do not let these myths put you off trying to write your own multicore-friendly garbage collector: this is an interesting, rewarding and entirely tractable challenge!

Sunday, 12 September 2010

Burton Smith vs Erik Meijer

We were recently led to a fascinating Channel 9 interview where the insideous Erik Meijer walks in on parallelism-expert Burton Smith, culminating in a fight the likes of which have not been seen since Chris Curry vs Sir Clive Sinclair in the Baron of Beef here in Cambridge.

A pragmatic Burton points out that lazy evaluation renders performance unpredictable and that, in turn, makes it impossible to tune the granularity of parallel programs to ensure that more effort is spent doing actual work than is wasted administering parallel tasks. The puritanical Erik points out that strictness is essentially another form of side effect because it affects issues such as non-termination. The conclusion is irreconcilable differences between what parallel programming requires and what purely functional programming can provide.

Erik and Burton were joined by an elephant in the room though: caches. They are the main difference between supercomputers and multicores and are a game changer, yet people do not discuss the effect caches have on programming and that effect will only grow as the memory gap continues to widen. Today, effective utilization of caches offers an order of magnitude in performance yet their effective use from purely functional languages like Haskell is essentially impossible precisely because memory has been successfully abstracted away. Indeed, as we have explained before, this not only remains a show-stopping unsolved problem for parallel Haskell but is not even being addressed by researchers. Today's state-of-the-art advice for multicore programmers is still to use mutable data structures in order to ensure caches are used effectively based upon the quantification of asymptotic cache complexity in the context of a simple theoretical model of CPU caches.