For many years, generational collection was the defacto-standard GC architecture. Based upon the observation that the distribution of value lifetimes is heavily skewed towards short lifetimes (most values die young), generational garbage collectors allocate into a nursery generation and survivors are copied out into an old generation.
Many practical language implementations use generational garbage collection including OCaml, GHC and .NET. Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.
Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation. Specifically, the time taken to enqueue and dequeue 32-bit ints on this 2.1GHz 2352 Opteron jumps from 0.33μs to 0.68-1.13μs. Where is this time being wasted?
When a boxed value (such as a 32-bit integer) is allocated in OCaml, it is augmented with a 1-word header and another for the forwarding pointer and that whole block is bump allocated from the nursery. When that value is written into the Queue in the old generation, a write barrier is incurred which stores a copy of the reference in the remembered set. When the nursery is filled, a minor collection is performed that traces from the global roots and remembered set throughout the reachable values in the nursery. These values are then copied into the old generation, their forwarding pointers are set and all locally-held references to them are updated via the forwarding pointers to point into their copies in the old generation. The nursery is then swept by resetting the bump allocator to the start of the nursery.
Suffice to say, this is a lot of overhead when the values allocated into the nursery do not die quickly enough. In that case, all of this effort is a complete waste of time and we would have been better off allocating directly into the old generation in the first place. What can be done to address this problem?
Fortunately, McKinley et al. made a breakthrough in GC design in recent years with their invention of a new class of GC algorithms known as mark-region GCs. It all began with their invention of the Beltway GC in 2002, a generalization of several existing GC designs, and culminated in their Immix GC in 2007. In effect, this GC design allows a nursery full of reachable values to be migrated to the old heap implicitly without any copying and a new nursery is allocated to replace it. The old generation is then effectively a collection of surviving nurseries. The precise placement policy is more complicated because it is possible to reuse old nurseries in order to avoid gross fragmentation but the basic concept is simple enough.
A Google Summer of Code project had an Immix variant implemented for the Glasgow Haskell Compiler. They found the results to be underwhelming but that is not so surprising given that this GC design should be most effective when filling mutable data structures such as queues, caches, hash sets and hash tables. We believe that a simple mark-region variant should be able to dramatically improve HLVM's performance on parallel functional code without degrading the performance of imperative code as generational garbage collectors like OCaml's do.