Tuesday, 28 December 2010

Towards a mark-region GC for HLVM

Our previous article highlighted the advantages of the recent mark-region GC design and hinted at HLVM adopting this design. We just completed some preliminary tests using a prototype written in C++ to measure the performance of different allocation strategies. Our results are as follows with times normalized by the time an equivalent OCaml program takes (so 1.0 means as fast as OCaml):

The four columns in each section give the times relative to OCaml for solving the 8-, 9-, 10- and 11-queens problems.

The "Boehm" section refers to the conservative Boehm GC which is 40-70% slower than OCaml on this benchmark. The "malloc" section refers to allocating using the malloc function from glibc without ever freeing and is 2.2-3.1× slower than OCaml. The "free" section refers to allocating with malloc and freeing (manually) and is 1.9-2.3× slower than OCaml. The "bump" section refers to a naive bump allocator that never recycles memory and is 1.4-1.7× slower than OCaml. Finally, the "region" section refers to our prototype region-based algorithm, which is just 4-20% slower than OCaml on this benchmark!

This benchmark is a classic logic programming problem that allocates large numbers of short-lived values. This is a best-case benchmark for OCaml and a worst-case benchmark for the current HLVM. OCaml's generational garbage collector with its fast bump allocator and constant-time recycling of dead values from the nursery generation does extremely well on this benchmark: we have been unable to beat its performance from C/C++.

The Boehm garbage collector is another interesting point of comparison because it has been the subject of intense optimization for many years.

These new results are very enlightening. Recycling memory by calling free is significantly faster than leaking memory by only ever calling malloc. Specifically, leaking is around 3× slower than OCaml and proper manual memory management using malloc and free is around 2× slower than OCaml. Moreover, the performance of the Boehm GC is very similar to manual memory management but still 2× slower than OCaml.

Bump allocating from a huge preallocated pool without ever freeing is surprisingly slow: around 1.5× slower than OCaml. This early result was disappointing but it turned out that our new region allocator is very fast indeed. This is extremely encouraging because it means that a non-moving mark-region collector for HLVM might be able to offer the best of both worlds: the speed of C/C++/Fortran for imperative code using mutable data structures and the speed of OCaml/Haskell for functional code using immutable data structures.

Our prototype region allocator allocates aligned regions using the glibc memalign function. This allows a pointer to the start of the region to be obtained from any pointer inside the region using bitwise operations. Each region begins with a C++ vector that holds the free list, the list of pointers inside the region that are not currently allocated. The remainder of the region is a pool of fixed-size blocks that can be allocated and deallocated. To allocate, the last element is popped off the free list. To free, the free list associated with the pointer is obtained using bitwise operations and the pointer is pushed onto the back of the free list. In the prototype, if the allocator finds the current region to be full then it stores it in a global collection of regions and allocates a new local region. In a production version, the allocator would recycle one of the non-full regions from the global collection of regions rather than allocating a new region each time.

How big should a region be? The results shown above were obtained using 1MB regions, large enough that they were never filled and a new region was never needed. However, reducing the region size to 1kB causes the prototype to create 8,295 regions on the 11-queens problem but the program is only 5% slower and total memory consumption is around 99% lower than simply leaking, so memory is being recycled effectively.

Measuring the absolute performance of the 10-queens solver as a function of the region size gives the following results:

The smallest possible region size of 16 bytes allows a single allocation per region and makes the whole program run 7.6× slower. Increasing the region size improves the efficiency of the region allocator (except for an anomaly between 128 and 256 byte regions that is probably due to benchmark-specific allocation patterns). With 1,024-byte regions, performance is within a few percent of optimal for this benchmark. One might have expected to see significant performance gains from larger regions up to the size of the 6Mb L2 cache on this machine but the tiny working set required by this benchmark eliminated any performance difference beyond 1kB regions.

The following graph shows the number of regions allocated for different region sizes on the 10-queens benchmark:

Smaller regions means a larger number of regions are required, up to around ten million for 16-byte regions. The relationship here reflects the previous region size vs performance relationship because the most of the time is spent administering regions when they are small. The initial sharp drop-off occurs because allowing regions to contain just a few more values significantly increases their ability to recycle space. With 1kB regions, only 874 regions are created to solve this problem.

The product of the region size and number of regions used quantifies the total space allocated for regions using glibc. Doubling the region size from 64 bytes to 128 bytes reduces the total memory allocated by 33% and doubling the region size from 2kB to 4kB reduces the total memory allocated by 99%. Perhaps the accelerated efficiency is due to the generational hypothesis that predicts inverse hyper-exponential decay of the probability of death as a function of age.

In HLVM, a thread-safe allocator will try to use the thread-local region and resort to synchronization only when obtaining the current region is full whereupon an existing non-full region will be reused or a new empty region will be created. The deallocator must potentially access any region but, with HLVM's current design, it is only invoked from a single thread during the stop-the-world phase so it can be thread unsafe. This has two benefits over the current technique:

  • Single-threaded allocation and deallocation should be almost twice as fast as they are today.
  • Multi-threaded allocation should scale linearly with the number of cores whereas HLVM currently sees performance degradation from concurrent allocations.

However, our previous results indicated that HLVM's currently-dismal performance on this benchmark is actually due to the shadow stack and not to allocation. We anticipate that efficient concurrent allocation will be the next bottleneck after the performance of the shadow stack is addressed so this is still valuable work.

Two pieces of related work remain to be done:

  • Mimic the effects of HLVM's current GC more accurately by deallocating in chunks.
  • Extend the prototype to reuse existing non-full regions before allocating a new empty region.


Art said...

The SCC Platform Overview
4 -- Overview of the Tile and Cores
The SCC has 24 tiles, 48 cores, two cores to a tile. Each core has L1 and L2 caches. The L1 caches are on the core;
the L2 caches are on the tile next to the core. Each core has a 16KB
L1 instruction cache and a 16KB L1 data cache.
Each core’s L2 cache is 256KB. Figure 2
Top-Level Tile Architecture shows an overview of an individual tile.

jay paul said...

Nice post! Can’t wait for the next one. Keep stuff like this coming.

HP - ENVY 15.6" Laptop - 8GB Memory - 1TB Hard Drive - Silver

HP - Pavilion 15.6" Laptop - 4GB Memory - 750GB Hard Drive - Black (15-n030us)