Thursday, 23 December 2010

When generational GC goes bad

For many years, generational collection was the defacto-standard GC architecture. Based upon the observation that the distribution of value lifetimes is heavily skewed towards short lifetimes (most values die young), generational garbage collectors allocate into a nursery generation and survivors are copied out into an old generation.

Many practical language implementations use generational garbage collection including OCaml, GHC and .NET. Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.

Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation. Specifically, the time taken to enqueue and dequeue 32-bit ints on this 2.1GHz 2352 Opteron jumps from 0.33μs to 0.68-1.13μs. Where is this time being wasted?

When a boxed value (such as a 32-bit integer) is allocated in OCaml, it is augmented with a 1-word header and another for the forwarding pointer and that whole block is bump allocated from the nursery. When that value is written into the Queue in the old generation, a write barrier is incurred which stores a copy of the reference in the remembered set. When the nursery is filled, a minor collection is performed that traces from the global roots and remembered set throughout the reachable values in the nursery. These values are then copied into the old generation, their forwarding pointers are set and all locally-held references to them are updated via the forwarding pointers to point into their copies in the old generation. The nursery is then swept by resetting the bump allocator to the start of the nursery.

Suffice to say, this is a lot of overhead when the values allocated into the nursery do not die quickly enough. In that case, all of this effort is a complete waste of time and we would have been better off allocating directly into the old generation in the first place. What can be done to address this problem?

Fortunately, McKinley et al. made a breakthrough in GC design in recent years with their invention of a new class of GC algorithms known as mark-region GCs. It all began with their invention of the Beltway GC in 2002, a generalization of several existing GC designs, and culminated in their Immix GC in 2007. In effect, this GC design allows a nursery full of reachable values to be migrated to the old heap implicitly without any copying and a new nursery is allocated to replace it. The old generation is then effectively a collection of surviving nurseries. The precise placement policy is more complicated because it is possible to reuse old nurseries in order to avoid gross fragmentation but the basic concept is simple enough.

A Google Summer of Code project had an Immix variant implemented for the Glasgow Haskell Compiler. They found the results to be underwhelming but that is not so surprising given that this GC design should be most effective when filling mutable data structures such as queues, caches, hash sets and hash tables. We believe that a simple mark-region variant should be able to dramatically improve HLVM's performance on parallel functional code without degrading the performance of imperative code as generational garbage collectors like OCaml's do.


6 comments:

BlueDavy said...

:),so Sun JDK will realize Garbage First,because it base on Regions.
But JRockit did a lot things for improving semi-long objects collect.

Veikko Eeva said...
This comment has been removed by the author.
Veikko Eeva said...

For what I gather from the InfoQ presentation, by LMAX guys Martin Thompson and Michael Barker, they circumvented a whole host of issues, including Java garbage collection and latency, by utilising ring buffers that fit into L2 cache in strategic places. In spirit a bit like doing system level programming where they are ubiquitous.

In any event, a short blurp of the presentation:
Summary

Martin Thompson and Michael Barker talk about building a HPC financial system handling over 100K tps at less than 1ms latency by having a new approach to infrastructure and software. Some of the tips include: understand the platform, model the domain, create a clear separation of concerns, choose data structures wisely, and run business logic on a single thread.

Flying Frog Consultancy Ltd. said...

@Veikko: What exactly do they mean by "less than 1ms latency"? Is that really 100% of messages under 1ms or the conventional 95% under 1ms or maybe just 1ms mean/median/modal average latency? Is the system fault tolerant or are they just measuring the software layer without network hops?

Perhaps my interpretation of their results is wrong but you should be able to get well beyond that level of performance from F# without having to resort to any of the kinds of low-level optimizations they go into.

Veikko Eeva said...

@Flying Frog
I don't think they mention variation that exactly. My best guess is that they just read the logs, which say they have processed that-and-that many transactions per second. The other presenter mentions around three minutes latencies of 10 ms, 100 ms and 600 ms between placing a bid and giving a response, after that the other guy mentions about the importance of predictability of variable latency, in the order of microsecond to milliseconds at maximum.

I'd imagine one of the challenges they had was to keep the latency variation within some bounds. That would require keeping the garbage collector for freezing/slowing the system in the wrong spot. They mention one shouldn't generate needless garbage. As for the ring buffers, they are using them in "integration points" between subsystems, for what I understand. To me it looks like a very good method to build fast sub-system communications channels that give them some predictability on the cache control. They also mention ring buffers are good at giving some back-pressure if a preceding part of the system is faster than a following one. Though, it could be that in F#, for instance, some other construct yield the same benefits (I'm not that knowledgeable regarding F# or functional languages).

Regardless of the strategy, avoiding cache eliciting data from the cache is the key to high-performance code, preferably without making the code contrived.

They don't mention how heavy their calculations are, so, it's difficult to judge their affect on the processor load and throughput. I'm not sure what you mean about not to resort on that kind of low level tricks. As I see it, they need to take care of the network I/O and some business logic calculations and then again pushing the results back to network. It looks like a ring-buffer is a simple, well-chosen datastructure for the job. Around 27 minutes they give some figures, which says
One single thread you have ~3 billion instructions to play with:

10+ TPS
- If you don't too anything too stupid

100k+ TPS
- With well organised clean code and standard libraries

1m+ TPS
- With custom cache friendly collections
- Good performance tests
- Controlled gargabe creation
- Very well modelled domain

jay paul said...

Nice post! Can’t wait for the next one. Keep stuff like this coming.

HP - ENVY 15.6" Laptop - 8GB Memory - 1TB Hard Drive - Silver

HP - Pavilion 15.6" Laptop - 4GB Memory - 750GB Hard Drive - Black (15-n030us)