Friday, 11 November 2011

Classifying garbage collection algorithms

Richard Jones' excellent new book about garbage collection has rekindled my fascination with this important subject. The Wikipedia page about GC is disappointingly poor quality so it is productive to review the main points here.

GC algorithms are categorized into four main families with production garbage collectors often combining algorithms from different families. The families are copying, mark-sweep, mark-compact and reference counting. The first three are all tracing collectors that work by tracing all reachable values by starting from the global roots (global variables and locals held on the stack).

Copying collectors allocate into one space and then "evacuate" the reachable heap blocks (called "survivors") into another space before clearing the space they came from. The Cheney semi-space algorithm is the simplest copying collector: it uses two spaces and copies from one to the other and back again. The advantage of copying collection is that many heap-allocated blocks are deallocated simultaneously simply by resetting a pointer. This is ideal when only a small proportion of the allocated values survive a collection. As most values die young, the copying collection algorithm is commonly used for the nursery generation in a generational collector.

Mark sweep is the oldest garbage collection algorithm (McCarthy 1959) and works by tracing all reachable values and then deallocating all of the remaining unreachable values. This algorithm offers relatively high throughput and can be made incremental (low latency) using Dijkstra's tricolor marking scheme but is (arguably) prone to fragmentation. Consequently, the mark-sweep algorithm is commonly used for the old generation of generational collectors.

Mark compact traces reachable values and then slides them together. This avoids the problem of fragmentation that can (allegedly) afflict mark-sweep collectors but the throughput of mark-compact collectors is apparently so poor that they are practically unheard of.

Reference counting works by having each value count the number of references to it. This requires the program to register locally held references by incrementing the reference count of the value referred to and then decrementing it again when the reference is no longer held. This is typically done by decrementing the reference count when the local reference falls out of scope in the source code although liveness analysis can be exploited to find a tighter bound on when a reference is no longer needed. When a reference count is decremented to zero there are no longer any references to the value so it is unreachable and can be deallocated. Interestingly, reference counting is the opposite of tracing because it works by pursuing dropped references rather than by following live references. The advantages of reference counting are its simplicity, determinism in the context of single-threaded programs (multithreaded programs race to decrement and, therefore, deallocate non-deterministically) and that it can easily be made incremental by maintaining a queue of reference counts that have yet to be decremented. The disadvantages are that it leaks cycles and has very poor throughput. However, cycles can be collected either by a backup tracing collector or by using a specific cycle collector such as the trial deletion algorithm, and throughput can be improved by deferring decrements (although this recovers the disadvantage of unpredictability).

Latency is a big issue in the context of garbage collection so algorithms are also classified according to their latency characteristics. Stop-the-world algorithms are relatively simple but high latency, incurring arbitrarily-long pause times during which the mutator threads (those running the actual program) make no progress at all. The .NET 4 server GC is an example of a stop-the-world GC in production. GHC also uses a stop-the-world GC. The advantages of stop-the-world garbage collection algorithms are simplicity and high throughput. These GCs often incur pauses of the order of one second, which is totally unacceptable for many soft real-time applications such as visualization, games and (ironically!) servers.

Incremental GCs break up the work of a full GC and interleave the resulting bits of work in with the execution of the program. For example, OCaml's GC performs a "slice" of the work required to collect the old generation every time the young generation is collected. The advantages of incremental GCs are low latency compared to stop-the-world (e.g. ~10ms pauses with OCaml) and simplicity compared to concurrent GCs.

Today, the term "parallel GC" is used to describe a GC that uses multiple threads to speedup the collection of garbage. The .NET 4 server GC is an example of a parallel GC because it uses one thread per core to mark the heap in parallel. Concurrent GC means the GC runs at the same time as the mutators. For example, the .NET workstation GC is an example of a concurrent GC in production. Concurrent garbage collectors are further refined into on-the-fly and mostly-concurrent algorithms. An on-the-fly GC is defined by Jones et al. as one that never suspends more than one mutator at a time (i.e. there is no stop-the-world phase) although I suspect he meant that mutators are suspended independently of each other. Finally, a mostly-concurrent GC is one that does suspend all mutator threads simultaneously but only for a short period of time and not for an arbitrarily-long GC phase. For example, the .NET workstation GC pauses all mutator threads while the global roots are examined but runs concurrently with them during the marking and sweeping of the whole heap so it is a mostly concurrent garbage collector.

These classifications by family of algorithm and by latency characteristics are the most important ones.

No comments: