Sunday, 9 January 2011

Sweeping 700× faster

Our initial experiments using the new garbage collector design indicate that it dramatically improves the performance of the slowest GC phase (sweeping) as expected, to the extent that it brings the overall performance of our C++ prototype within 10% of OCaml without introducing any of the overheads of generational garbage collection:

Moreover the algorithm is very simple and, in particular, parallel and concurrent variants should be much easier to design with this style of collector than with generational collectors because regions are a suitable granularity.

The sweep phase of the GC accounted for around a third of the total running time of the entire program using the traditional algorithm with allocated and free lists, with each sweep taking around 70┬Ás. Using the new bitwise algorithm, the sweep phase takes just 100ns and accounts for just 0.7% of the total running time of the program.

Our prototype mark region collector using a fake marking phase is now between 2 and 10% slower than OCaml without having sacrificed either its better performance on other benchmarks or its multicore capability. However, our new design has increased the cost of both allocation and marking. Although they accounted for a tiny proportion of the total running time, it remains to be seen whether or not this new GC algorithm will be faster than a traditional generational collector in this best-case scenario for generational collection when it is implemented in a real VM. In particular, the overheads of HLVM's current shadow stack may well make it slower.

Our collection strategy also captures the benefits of Appel's semi-generational collection algorithm. Appel's algorithm compacted nursery survivors to one end of the nursery and used the remaining space as a smaller nursery repeatedly until the entire nursery was full of reachable values that were then promoted to the old generation. Our region-based collector can allow each thread to sweep its own region locally, independently of all other regions and cores. This also leads to the same non-linearity that makes Appel's algorithm so effective but without the overheads of copying. Using regions containing 62 blocks each 64 bytes long, local sweeping increased the number of allocations performed before a new region was required from 62 to a whopping 3,000 allocations. Obtaining a new region requires synchronization, so this simple improvement has reduced the rate of synchronizations by around a factor of 50. With 8-byte regions, the 11-queens problem can be solved with a single region and, therefore, without any contention at all between threads.

Therefore, we believe our new GC design will allow us to make HLVM competitively performant on functional benchmarks like this without losing its substantial advantages on imperative code, where it is often several times faster than OCaml.

Saturday, 8 January 2011

The importance of locality and sparsity in memory management

Our previous articles describing the disadvantages of generational garbage collection and our prototype mark-region memory management system designed for HLVM originally showed that region-based allocation and deallocation has the potential to be only 4-20% slower than OCaml's generational collector. However, our more recent work that was designed to be more realistic by deferring deallocations to bulk GC cycles was significantly slower, around twice as slow as OCaml.

There are several differences between the stack-based deallocation scheme used in the first benchmark and the GC-like deallocation scheme used in the second benchmark that have the potential to account for this performance degradation:

  • The mock GC introduced mark bytes into the allocated values and marked them as unreachable when they fell out of scope in the mutator.
  • The mock GC allocated by popping a reference of the top of a region's free list.
  • The mock GC deallocated by pushing a reference onto the free list of the reference's region.
  • The mock GC added an "allocated list" that is used to record everything that has been allocated.
  • Upon deallocation, the mock GC removed a reference from the allocated list by overwriting it with the last element and reducing the length of the allocated list by one.

The overhead of the new "mark phase" that marks values as unreachable when they fall out of scope is minimal. The mark phase in HLVM's real GC accounts for less than 2% of the total running time of this benchmark and it does significantly more work (traversing the shadow stack) than the mark phase in this benchmark.

Using the free list as a stack would cause subsequent allocations to be contiguous in memory if and only if the free list happens to be ordered, i.e. allocations and deallocations are in LIFO order. This was the case in the first benchmark but not the second.

Upon collection, the allocated list was traversed sequentially. However, the way in which references were removed from the allocated list may well have been the critical design flaw. Specifically, moving an the element from the back of the allocated list into the middle, to overwrite a removed element changes the order of the list slightly. We suspected that the disorder would accumulate over time, destroying the locality of the references in the allocated list. Consequently, a sequential traversal of the allocated list is likely to have been of little benefit because the subsequent elements of the allocated list would have referenced random locations. Moreover, that disorder would have been passed on to the free lists, which would have seen values freed in random order rather than sequentially.

We had speculated that sparsity was largely responsible for the remaining performance gap between OCaml and all of the strategies based upon per-value deallocation because OCaml's generational GC is able to sweep runs of contiguously-allocated values from the nursery generation in constant time. On the basis of this, we predicted that a derivative of mark-region capable of deallocating contiguous runs of values from a region might get significantly closer to OCaml's performance on this benchmark. In particular, the performance profile of our prototype indicates that 18% of the total time is spent allocating and 17% is spent collecting. Furthermore, the L2 cache is around 50% slower than the L1 cache on this machine so the performance of the mutator might be 50% worse due to poor locality of reference in the second benchmark. These figures suggest that improving locality might double performance and, therefore, make our solution as fast as OCaml.

To answer some of our questions, we wrote a simulation of the prototype (!) in the F# programming language and used it to gather relevant statistics. This proved to be extremely successful and led to several major insights.

Firstly, values die over time so longer gaps between GC cycles means a higher proportion of unreachable values. The following graph illustrates the relationship between the number of allocations performed between collections and the proportion of values that remain reachable:

If GC cycles are separated by at least 300 allocations then more than half of the allocated values become unreachable and are swept. Therefore, we can optimize the allocated list operations under the assumption that most of the elements of the list will not survive a GC cycle.

Secondly, we found that the algorithm used to remove an element from the allocated list does indeed dominate the locality and, therefore, the performance of the entire program. The following graph illustrates the probability density of deallocations as a function of the length of the run of contiguously-allocated values that a value is in:

These results show that the original algorithm for removing references from the allocated list led to 45% of values being deallocated alone and significantly fewer being deallocated in contiguous runs. Therefore, it is clear that the original algorithm was indeed destroying locality when it reordered the references in the allocated lists.

In contrast, removing values from the allocated list using the order-preserving sliding compaction retained locality. In that case, only 0.02% of values were deallocated alone. In fact, the new algorithm is so good at preserving locality that values are more likely to be deallocated with a few neighbors than alone. Specifically, values are 4× more likely to be deallocated as part of a run of 23 contiguously-allocated values than they are to be deallocated alone.

These new results lend credence to our conjecture that exploiting sparsity by deallocating contiguous runs of values rather than individual values is the key to achieving performance comparable to that of a generational GC. However, we also now know that this will only be possible if the allocation-collection cycle preserves locality as much as possible.

The simplest way to preserve the order of allocations and exploit sparse deallocations is to side step the problem by changing the data structures involved:

  • Replace the free list with a bitvector.
  • Replace the mark bits in each value with a per-region mark bitvector.
  • Replace the allocated list with queues of full and non-full regions.

With 512 bits in a cache line, we can reserve the first cache line of each region to use as a bitvector for the entire region because we previously found that regions containing around this many values give near-optimal performance. Allocating from a region is then a matter of finding the first (un)set bit in the bitvector and the associated location, flipping the bit and returning the location. A contiguous sequence of values can be deallocated from a region by computing and applying a bitmask to the bitvector.

With regions conveying where allocated values are, there is no longer any need for an explicit allocated list. Therefore, the allocated list may be replaced with a global queue of regions. When a local region is filled it is enqueued on the global queue of full-regions and a region is dequeued from the global queue of non-full regions or freshly allocated if there are no non-full regions in the global queue. When a GC cycle occurs, the global regions are dequeued, swept and enqueued again on the appropriate queue, i.e. if a full region becomes non-full then it changes queues.

Incredibly, sweeping a region is now as simple as applying bitwise operations to the allocated- and marked-bitvectors in order to remove unmarked locations from the allocated bitvector.

With this new design, the operations that limited the performance of the old design will now be substantially faster and the locality of reference for the mutator threads will be greatly improved.

We are currently prototyping this new design. As we shall see in a future post, this new GC algorithm not only naturally lends itself to both parallelism and concurrency but is also almost a drop-in replacement for HLVM's current memory management subsystem.

Wednesday, 5 January 2011

Paul Graham's accumulator generator

Paul Graham once published an article about what he called "accumulator generators". This problem requires the existence of an unspecified numeric tower. Lisp happens to have one and it happens to be adequate for Paul Graham's examples.

You can implement a numeric tower in F# either using a union type (like type number = Int of int | Float of float) or by boxing everything. The following solution uses the latter approach:

let add (x: obj) (y: obj) =
match x, y with
| (:? int as m), (:? int as n) -> box(m+n)
| (:? int as n), (:? float as x)
| (:? float as x), (:? int as n) -> box(x + float n)
| (:? float as x), (:? float as y) -> box(x + y)
| _ -> failwith "Run-time type error"

let acc x =
let x = ref x
fun (y: obj) ->
:= add !x y

let x : obj -> _ = acc(box 1)
do x(box 5)
do acc(box 3)
do printfn "%A" (x(box 2.3))

However, numeric towers are of little use in general purpose programming and usually do more harm than good. Trying to learn from these kinds of challenges can do more harm than good. The real question is: why we do not want a numeric tower, do not want to box and do not want run-time type promotion?

In other words, why didn't we just write:

let x = 1
let x = x + 5
let x = float x + 2.3

We know the type of x at every step. Every number is stored unboxed. And we know that this code cannot produce a run-time type error.