Saturday, 8 January 2011

The importance of locality and sparsity in memory management

Our previous articles describing the disadvantages of generational garbage collection and our prototype mark-region memory management system designed for HLVM originally showed that region-based allocation and deallocation has the potential to be only 4-20% slower than OCaml's generational collector. However, our more recent work that was designed to be more realistic by deferring deallocations to bulk GC cycles was significantly slower, around twice as slow as OCaml.

There are several differences between the stack-based deallocation scheme used in the first benchmark and the GC-like deallocation scheme used in the second benchmark that have the potential to account for this performance degradation:

  • The mock GC introduced mark bytes into the allocated values and marked them as unreachable when they fell out of scope in the mutator.
  • The mock GC allocated by popping a reference of the top of a region's free list.
  • The mock GC deallocated by pushing a reference onto the free list of the reference's region.
  • The mock GC added an "allocated list" that is used to record everything that has been allocated.
  • Upon deallocation, the mock GC removed a reference from the allocated list by overwriting it with the last element and reducing the length of the allocated list by one.

The overhead of the new "mark phase" that marks values as unreachable when they fall out of scope is minimal. The mark phase in HLVM's real GC accounts for less than 2% of the total running time of this benchmark and it does significantly more work (traversing the shadow stack) than the mark phase in this benchmark.

Using the free list as a stack would cause subsequent allocations to be contiguous in memory if and only if the free list happens to be ordered, i.e. allocations and deallocations are in LIFO order. This was the case in the first benchmark but not the second.

Upon collection, the allocated list was traversed sequentially. However, the way in which references were removed from the allocated list may well have been the critical design flaw. Specifically, moving an the element from the back of the allocated list into the middle, to overwrite a removed element changes the order of the list slightly. We suspected that the disorder would accumulate over time, destroying the locality of the references in the allocated list. Consequently, a sequential traversal of the allocated list is likely to have been of little benefit because the subsequent elements of the allocated list would have referenced random locations. Moreover, that disorder would have been passed on to the free lists, which would have seen values freed in random order rather than sequentially.

We had speculated that sparsity was largely responsible for the remaining performance gap between OCaml and all of the strategies based upon per-value deallocation because OCaml's generational GC is able to sweep runs of contiguously-allocated values from the nursery generation in constant time. On the basis of this, we predicted that a derivative of mark-region capable of deallocating contiguous runs of values from a region might get significantly closer to OCaml's performance on this benchmark. In particular, the performance profile of our prototype indicates that 18% of the total time is spent allocating and 17% is spent collecting. Furthermore, the L2 cache is around 50% slower than the L1 cache on this machine so the performance of the mutator might be 50% worse due to poor locality of reference in the second benchmark. These figures suggest that improving locality might double performance and, therefore, make our solution as fast as OCaml.

To answer some of our questions, we wrote a simulation of the prototype (!) in the F# programming language and used it to gather relevant statistics. This proved to be extremely successful and led to several major insights.

Firstly, values die over time so longer gaps between GC cycles means a higher proportion of unreachable values. The following graph illustrates the relationship between the number of allocations performed between collections and the proportion of values that remain reachable:

If GC cycles are separated by at least 300 allocations then more than half of the allocated values become unreachable and are swept. Therefore, we can optimize the allocated list operations under the assumption that most of the elements of the list will not survive a GC cycle.

Secondly, we found that the algorithm used to remove an element from the allocated list does indeed dominate the locality and, therefore, the performance of the entire program. The following graph illustrates the probability density of deallocations as a function of the length of the run of contiguously-allocated values that a value is in:

These results show that the original algorithm for removing references from the allocated list led to 45% of values being deallocated alone and significantly fewer being deallocated in contiguous runs. Therefore, it is clear that the original algorithm was indeed destroying locality when it reordered the references in the allocated lists.

In contrast, removing values from the allocated list using the order-preserving sliding compaction retained locality. In that case, only 0.02% of values were deallocated alone. In fact, the new algorithm is so good at preserving locality that values are more likely to be deallocated with a few neighbors than alone. Specifically, values are 4× more likely to be deallocated as part of a run of 23 contiguously-allocated values than they are to be deallocated alone.

These new results lend credence to our conjecture that exploiting sparsity by deallocating contiguous runs of values rather than individual values is the key to achieving performance comparable to that of a generational GC. However, we also now know that this will only be possible if the allocation-collection cycle preserves locality as much as possible.

The simplest way to preserve the order of allocations and exploit sparse deallocations is to side step the problem by changing the data structures involved:

  • Replace the free list with a bitvector.
  • Replace the mark bits in each value with a per-region mark bitvector.
  • Replace the allocated list with queues of full and non-full regions.

With 512 bits in a cache line, we can reserve the first cache line of each region to use as a bitvector for the entire region because we previously found that regions containing around this many values give near-optimal performance. Allocating from a region is then a matter of finding the first (un)set bit in the bitvector and the associated location, flipping the bit and returning the location. A contiguous sequence of values can be deallocated from a region by computing and applying a bitmask to the bitvector.

With regions conveying where allocated values are, there is no longer any need for an explicit allocated list. Therefore, the allocated list may be replaced with a global queue of regions. When a local region is filled it is enqueued on the global queue of full-regions and a region is dequeued from the global queue of non-full regions or freshly allocated if there are no non-full regions in the global queue. When a GC cycle occurs, the global regions are dequeued, swept and enqueued again on the appropriate queue, i.e. if a full region becomes non-full then it changes queues.

Incredibly, sweeping a region is now as simple as applying bitwise operations to the allocated- and marked-bitvectors in order to remove unmarked locations from the allocated bitvector.

With this new design, the operations that limited the performance of the old design will now be substantially faster and the locality of reference for the mutator threads will be greatly improved.

We are currently prototyping this new design. As we shall see in a future post, this new GC algorithm not only naturally lends itself to both parallelism and concurrency but is also almost a drop-in replacement for HLVM's current memory management subsystem.


Michael Robin said...

Makes sense...
Have you thought of using the GPU for this method of GC?

Michael Robin said...

Makes sense...
Have you thought of using the GPU for this method of GC?

Flying Frog Consultancy Ltd. said...

@Michael: Interesting idea! Hadn't thought of that...

I just completed the first working prototype last night and it is indeed the fastest of the general-purpose collectors I have written in C++, running just 10% slower than OCaml on functional code (and it will be up to 2-3× faster than OCaml on imperative code!).

The sweep phase really is a thing of beauty: load two cache lines, bitwise AND them and write it back and you've swept an entire region containing hundreds of heap-allocated blocks. The time taken to perform a sweep has fallen from 68µs with the traditional algorithm to just 0.1µs with our new bitwise algorithm! That means we could sweep 1Gb of heap in just 26ms, which is way faster than any VM I know of. Moreover, it is easily parallelized...

Allocation and marking will be slightly slower but I really think this is a big win overall.

Michael Robin said...

Yeah, seems like the GPU could do these bitwise GC ops to large regions well. (Pat. pending me :))

> The sweep phase really is a thing of beauty

Sounds awsome -- I think people would look forward to a formal paper and/or reference code at some point. (But do the reference code in MMIX just to annoy everybody...)

>Allocation and marking will be slightly slower

I think that's okay, as long as the front-end tools and VM work together to avoid heap allocation when not necesary.

Mat said...

Have you tried putting the free list in separate memory altogether? It seems as though you could improve collection times even further since you don't even touch the memory regions you want to mark/free.