Sunday, 9 January 2011

Sweeping 700× faster

Our initial experiments using the new garbage collector design indicate that it dramatically improves the performance of the slowest GC phase (sweeping) as expected, to the extent that it brings the overall performance of our C++ prototype within 10% of OCaml without introducing any of the overheads of generational garbage collection:

Moreover the algorithm is very simple and, in particular, parallel and concurrent variants should be much easier to design with this style of collector than with generational collectors because regions are a suitable granularity.

The sweep phase of the GC accounted for around a third of the total running time of the entire program using the traditional algorithm with allocated and free lists, with each sweep taking around 70µs. Using the new bitwise algorithm, the sweep phase takes just 100ns and accounts for just 0.7% of the total running time of the program.

Our prototype mark region collector using a fake marking phase is now between 2 and 10% slower than OCaml without having sacrificed either its better performance on other benchmarks or its multicore capability. However, our new design has increased the cost of both allocation and marking. Although they accounted for a tiny proportion of the total running time, it remains to be seen whether or not this new GC algorithm will be faster than a traditional generational collector in this best-case scenario for generational collection when it is implemented in a real VM. In particular, the overheads of HLVM's current shadow stack may well make it slower.

Our collection strategy also captures the benefits of Appel's semi-generational collection algorithm. Appel's algorithm compacted nursery survivors to one end of the nursery and used the remaining space as a smaller nursery repeatedly until the entire nursery was full of reachable values that were then promoted to the old generation. Our region-based collector can allow each thread to sweep its own region locally, independently of all other regions and cores. This also leads to the same non-linearity that makes Appel's algorithm so effective but without the overheads of copying. Using regions containing 62 blocks each 64 bytes long, local sweeping increased the number of allocations performed before a new region was required from 62 to a whopping 3,000 allocations. Obtaining a new region requires synchronization, so this simple improvement has reduced the rate of synchronizations by around a factor of 50. With 8-byte regions, the 11-queens problem can be solved with a single region and, therefore, without any contention at all between threads.

Therefore, we believe our new GC design will allow us to make HLVM competitively performant on functional benchmarks like this without losing its substantial advantages on imperative code, where it is often several times faster than OCaml.


Jules said...

So sweeping is now so fast that marking becomes the bottleneck? It seems like your algorithm always has to mark the entire heap. Isn't this too much overhead for large heaps? For example imagine a large imperative tree data structure, and the application deletes and inserts items.

Have you read the paper Accurate garbage collection in
uncooperative environments? It provides an interesting alternative to shadow stacks.

Instead of maintaining a shadow stack they put an exception handler around function calls:

foo = ...

try { f(...} } catch StackScan { roots[i++] = foo }

do something with foo

The exception gets thrown when a garbage collection occurs. The code in X then builds the shadow stack data structure by saving local variables somewhere. So you only build the shadow stack right before garbage collection. Since LLVMs exceptions are supposed to be zero-cost when no exception is thrown, theoretically you only pay for building the shadow stack when a GC happens.

Flying Frog Consultancy Ltd. said...

@Jules: "So sweeping is now so fast that marking becomes the bottleneck?" Hopefully but I'll need to benchmark it with a real mark phase to be sure.

"It seems like your algorithm always has to mark the entire heap. Isn't this too much overhead for large heaps?" Global collections will have to mark the whole heap, which most production GC algorithms do AFAIK. But local collections just trace local roots into the local region and sweep that.

That's a very interesting paper. I had not read it before. I'll check it out but it sounds substantially more complicated that just optimizing what I already have. My multi-word references turned out to be a design flaw so I'll be reducing the shadow stack traffic by a factor of four anyway. I'm not sure how much more optimizing it'll need after that.

Jules said...

Yeah you'd need to make substantial changes to do something like in the paper. In the end it would be a bit faster (at least that's what they found), but there are probably better places to optimize.

"But local collections just trace local roots into the local region and sweep that."

So then you're going to implement a generational GC? What do you do if a pointer in a global region points to something in a local region? Are you going to use a write barrier?

Flying Frog Consultancy Ltd. said...

@Jules: "So then you're going to implement a generational GC?" Technically, no, because my algorithm does not use copying to evacuate the survivors. To me, mark region GCs like this generalize the concept of generations so the old generation is not something completely different but, rather, just a collection of former nursery generations. More importantly, ageing is abstract rather than being implemented concretely by copying between physical memory locations. The concept of age will also come in (implicitly) when I recycle regions from the global queues: I use a queue because it means the region a thread dequeues will be the oldest and, therefore, most likely the emptiest.

"What do you do if a pointer in a global region points to something in a local region? Are you going to use a write barrier?" Good question. I have thought of several possible solutions. One is to use a traditional write barrier such as card marking or a remembered set. Another is to use a simplified write barrier, such as copying the allocated bitvector in order to sweep only unreachable values that were allocated since the last pointer write. Another is for the write barrier to set a flag indicating that a local GC cycle will not be possible. In all cases, the write barrier need only be incurred when writing a pointer that points into the local region. As you can see, there are many possible points along this trade-off. Imperative code will prefer the least invasive, of course, but code that mixes pure and impure styles will suffer from that choice...