Saturday, 31 March 2012

GC-safe points, mutator suspension and barriers

Reading the section about "GC-safe points and mutator suspension" in Richard Jones' excellent GC Handbook made me realize that HLVM's approach is both unusually simple and yet still very effective and, therefore, deserves to be described.

GC safe points are apparently regarded as a tricky subject by many people. Jones says that approaches to safe points may be classified into (effectively) eager and lazy preparation. He goes on to say that thread suspension complicates matters by interrupting the thread at uncontrolled points and mentions probabilistic advancement to the next safe point. Attention is given to the problem of devising stack maps expressive enough to convey locally-held references. Even the idea of compressing stack maps to save space is covered. Self-modifying code or "patching" as an alternative to polling is mentioned. The section describing where GC check points can go is particularly interesting:

"Beyond the minimal points needed for correctness, a system may wish to allow garbage collection at more locations so as to guarantee that garbage collection can proceed without an unbounded wait for the thread to reach its next safe point. To make this stronger guarantee there needs to be a safe point in each loop; a simple rule is to place a safe point at each backwards branch in a function. In addition, there needs to be a safe point in each function entry or each return since otherwise functions, especially recursive ones, could perform many calls and returns before encountering a safe point."

What Jones et al. do not mention is that all of these problems can be solved far more simply without sacrificing much efficiency. HLVM did this.

Positioning GC-safe points

How does HLVM choose where to put GC-safe points? First, it is critical to note that HLVM provides functions with guaranteed elimination of all function calls in tail position but does not provide any looping constructs, i.e. all loops are expressed as tail recursion. As a consequence of this unusually-simple intermediate representation, HLVM gets away with placing a single GC safe point in each function at its entry point.

Why the entry point and not the return point? Because tail call elimination can require multiple return points but there is only ever one entry point so choosing to instrument entry points rather than return points avoids unnecessary code duplication.

So HLVM-generated code is guaranteed to perform GC checks periodically and, therefore, will always respond to requests from the GC in a timely fashion. Given the requirement that each function must contain at least one GC check point in order to be correct, HLVM is optimal because it uses only one GC-safe point in each function. However, HLVM has admittedly shifted the goal posts by replacing loops with functions, i.e. HLVM code has more and smaller functions than other languages or VMs.

Blocking and external calls are instrumented in order to temporarily fake suspension, allowing the GC to continue without requiring cooperation from an uncooperative thread. For example, this is essential for user code to acquire locks because the call blocks until the lock is free and would otherwise prevent the GC from running at all even though it is perfectly safe to do so. Concretely, this boils down to a write of the thread's state before the call and a spinwait until the GC cycle (if any) completes after the call.

Novel optimizations

The first optimization HLVM performs in the context of GC-safe points is to amortize the costly GC check point at the entry point of each function. Each GC check point increments a thread-local counter and the expensive check of the synchronized global is done only when the counter reaches a predetermined level whereupon it is reset. Furthermore, accessing thread local data via the OS is slow so HLVM carries a pointer to the thread local data (including that counter and the shadow stack) through every function call, i.e. a global change to the calling convention.

The second optimization also appears to be a valuable new technique. Loop unrolling is a productive optimization seen in most language and VM implementations. As there are no loops in HLVM, it performs a similar transformation by inlining function bodies into a function's (self) recursive calls. This is actually a more powerful alternative to loop unrolling because it affects non-tail calls as well as the tail calls that are equivalent to loops. Moreover, this amortizes the cost of the GC safe points to the point that they are almost free and the performance of a single-threaded run-time is recovered. In practice, this can make tight non-tail recursive functions like Fibonacci and Ackermann several times faster.

Progressing a suspended thread to its next safe point

How does HLVM advance a suspended thread to its next safe point? HLVM threads cooperatively suspend themselves when necessary and can only do so at a GC check point. Concretely, the GC signals to all threads via a mutable global when it wants them to cooperatively suspend themselves. The next time each thread passes a check point that causes this global variable to be checked the thread will cooperatively suspend itself, signal to the GC that it is suspended and then sits in a loop waiting for the GC to signal back that the GC cycle has completed. This is just a few lines of code and, as our benchmarks have shown, is very efficient.

Stack maps

Locating pointers is also notoriously difficult. Once a thread has been suspended pending a GC cycle the GC must be able to locate all of the pointers on the thread's stack because they are global roots. According to Jones et al., there are a multitude of complicated techniques for doing this. The techniques generally revolve around the concept of stack maps, a data structure that locates the pointers in the stack frame of a given function that has been suspended at a given point. Due to the large number of combinations of functions and suspension points there can be a huge number of stack maps and this is when other people started to discuss compressing them.

How does HLVM find locally-held pointers? Again, HLVM adopted a much simpler solution that appears to be just as efficient. Specifically, HLVM uses a stack of pointers. When a reference is held locally it is pushed onto this shadow stack. At every return point (including tail calls) the shadow stack pointer is reset to the value it had upon entry to the function. So HLVM does not have to mess around with pointers to stack frames or stack maps to describe their layout.

HLVM currently pushes locally-held references onto the shadow stack eagerly. In the future, this could be optimized by pushing just before a function call and pushing only those values that are still required to be live after the call returns. The performance benefit is likely to be small but this would reduce the scope of references on the shadow stack and, therefore, reduce the amount of floating garbage.

Perhaps the most beneficial optimization yet to be applied will be to avoid pushing and popping the reference to an array during its iteration. Currently, arrays are iterated by tail-recursive functions that pass the reference to the array back to themselves and, consequently, push the reference upon entry and pop it before tail recursing. An interesting solution would be to augment the core type system (currently value and reference types) with a third class of types representing references that a caller has already made reachable to the GC. In its simplest form this means that only the function that initially acquires a reference (by allocation or reading a pointer from the heap) pushes it onto the shadow stack and it then passes its callees the "prepushed" form of the same reference so they do not push the same value again. A more sophisticated solution could infer that references derived via immutable types from prepushed references are also prepushed references. For example, there is no need to push every element of an immutable list once the head has been pushed because we know the GC will already traverse the entire list. As our benchmarks indicate that shadow stack manipulations can account for up to 70% of the total running time of a program, such approaches might dramatically improve performance.

Heap allocation and barriers

HLVM also used a very simple approach to heap allocation that allows it to operate without any read or write barriers. Unlike the novel techniques described above, this aspect of HLVM's design is known to have severely degraded performance. This approach may still be of interest due to its simplicity though.

HLVM allocates memory by calling malloc from the C standard libary and its GCs releases memory by calling free. As the GC is a simple non-moving mark-sweep collector there is no need to track cross-partition pointer writes so there is no write barrier. This actually gives HLVM a significant performance advantage when writing pointers because write barriers in production VMs like the CLR and OCaml are many times slower than a direct pointer write.

One surprising result observed during the creation of HLVM was the performance of maintaining the mark bits for each heap block in a hash table vs in the heap block itself. Surprisingly, the hash table was only slightly slower than the more invasive solution.

No comments: