Sunday, 19 September 2010

Are multicore-capable garbage collectors hard to write?

In this era of multicore computing, garbage collectors need to allow user threads (aka mutators) to run in parallel on separate cores simultaneously in order to facilitate efficient shared memory parallel programming.

There are two relevant phrases from garbage collection terminology here:

  • Parallel GC means the garbage collector itself has been parallelised in order to speed up garbage collections. For example, a stop-the-world collector might traverse the heap in parallel when marking in an attempt to reduce pause times.

  • Concurrent GC means the garbage collector runs at the same time as the user threads (aka mutators). For example, Dijkstra's algorithm and derivatives like Doligez-Leroy use fine-grained synchronization to keep a concurrently-running collector apprised of the constantly-changing heap topology.

However, we are talking about neither parallel GC nor concurrent GC but, rather, the simpler challenge of just allowing mutators to run in parallel. In the absence of any established terminology, we call any GC that allows mutator threads to run in parallel with each other a multicore-friendly garbage collector.

Frustratingly, many people are perpetuating the myth that it is difficult to write parallel or concurrent or even just multicore-friendly garbage collectors. In particular, this is happening around the OCaml language as a result of Xavier Leroy's (in)famous "standard lecture on threads" from 2002 where he explained that they were not creating a multicore-friendly garbage collector for OCaml because multicores would never become popular and he described Intel's hyperthreading as "the last convulsive movement of SMP's corpse". Xavier is a great guy and had done a lot of great work but, of course, he was completely wrong about this. Not only are multicores ubiquitous and hyperthreading is very common but shared memory parallelism is here to stay: even if distributed parallelism becomes essential in the future when cache coherence breaks down we will be doing distributed parallel programming between multicores because shared memory parallelism is so much more efficient than distributed parallelism most of the time.

The JVM and .NET CLR obviously provide multicore-friendly garbage collectors so people sometimes assert that creating such a garbage collector requires huge resources and is beyond a small group such as the OCaml team at INRIA. This is simply not true. The simplest multicore-friendly garbage collector design is a stop-the-world collector that pauses all user threads while the entire heap is marked and swept safe in the knowledge that the heap topology is static. Our own HLVM project currently uses exactly this design and it took just a few days to write and is under 100 lines of code! And we are not alone. Simon Marlow has written several far more sophisticated multicore-friendly garbage collectors for the Glasgow Haskell Compiler by himself. The PolyML project developed a multicore-friendly garbage collector without benefit of funding from a multinational corporation. Same for Manticore.

Even some of the more sophisticated mostly-concurrent garbage collector designs are remarkably simple. For example, the Very Concurrent Garbage Collector (VCGC) uses a breathtakingly-elegant approach based around three epochs (instead of the usual tricolor marking scheme) that completely avoids the error-prone and inefficient fine-grained synchronization originally proposed by Dijkstra and followed by almost everyone else. The entire algorithm for this mostly concurrent garbage collector can be expressed in a single page of code!

So please do not let these myths put you off trying to write your own multicore-friendly garbage collector: this is an interesting, rewarding and entirely tractable challenge!


xscott said...

It's nonsense to define "stop the world" as "multicore friendly". The more cores we eventually get, the less friendly stopping all of them becomes.

The VCGC paper looks interesting, but it seems like they've got some global barriers and pauses in the "tens of milliseconds". A truly concurrent garbage collector for a language with mutable data and/or closures, that does not stop all running threads, seems pretty difficult.

Jules said...

VCGC may be elegant but according to the paper it cuts pause time by 1/2x to 1/3x at the cost of slowing the program down by a factor of 2x to 3x. This doesn't seem like a win to me.

Flying Frog Consultancy Ltd. said...

@xscott: I see no better place to draw the line than facilitating shared-memory parallelism. Especially given that HLVM's naive stop-the-world GC does scale well in practice across at least 8 cores.

Fully concurrent garbage collection is certainly much more difficult but it is also much less useful because it degrades absolute performance so severely (see "mutator utilization").

@Jules: Don't forget the mutator, marker and sweeper were all competing for a single core in that comparison and it was vs SMLNJ's multicore unfriendly GC. Running the marker and sweeper on their own cores would bring the performance closer but the ability to run mutators in parallel with each other dwarves all other effects. So I think VCGC deserves a lot more attention in this multicore era...

Makarius said...

The above-mentioned multicore-friendly GC of official Poly/ML releases (such as 5.3 or 5.4) already performs pretty well, scaling into the range of 8 cores, with wall-clock speedup up to 5.0 for memory-intensive applications. Even better, there is ongoing work towards parallel GC, e.g. see the Poly/ML SVN 1295.