Sunday, 26 January 2014

On the performance of boxed tuples

Stephen Dolan, a PhD student at the University of Cambridge, published an interesting article A “sane” design for multicore OCaml recently and the following discussion on Reddit proved to be quite informative. In particular, more than one person asserted that ints, floats and tuples are fast in OCaml. In this blog post I’m going to take a look at tuple performance.


As Stephen points out, one might reasonably expect unboxed tuples to be faster for passing arguments to functions and returning multiple values from functions because the elements can stay in registers but slower for storing in the heap because they require multi-word reads and writes instead of a single word (a pointer to the existing tuple). However, HLVM has shown that unboxed tuples can be extremely fast so why the discrepancy?


The performance charactistics of different heap topologies are not quite so simple in a garbage collected environment. Two aspects of garbage collection affect the results: the write barrier and survivors. The write barrier is a relatively-slow piece of code injected whenever a program writes a reference into the heap in order to keep the garbage collector apprised of the constantly-changing heap topology. Therefore, writing an unboxed pair of ints into the heap requires two int writes whereas writing a boxed pair of ints into the heap requires one pointer write and a write barrier. Now, in order to record information for the garbage collector the write barrier always performs at least one write in addition to other housekeeping work. Therefore, writing a pair of ints will always be slower if the pair is boxed. In fact, the F# Journal article Pathological garbage collector behaviour found that a write that incurs the write barrier is 2.4x slower than a write that does not. Moreover, .NET is heavily optimized for mutable code so it has a very efficient write barrier whereas OCaml is heavily optimized for purely functional code and has a notoriously slow write barrier.


The next issue that complicates the performance of boxed vs unboxed tuples in the heap is survivors. Both OCaml and .NET use generational garbage collectors. New objects like boxed tuples are allocated in a nursery generation. When the nursery is full, surviving objects are identified and physically copied to the next generation. If a program violates the generational hypothesis (that most objects die young) by allocating many objects that survive then it incurs a performance overhead for marking the survivors, copying them into the next generation and fixing up all pointers to those objects to point at their new locations. If tuples are unboxed then none of these overheads exist.


So it is instructive to measure the performance of writing newly-allocated tuples into progressively longer slices of an array. We have done this using boxed tuples in OCaml and F# as well as unboxed structs in F#. The following graph visualizes the performance as a function of the size of the array slice:

For array slices containing up to 1,000 elements none of the tuples survive the first generation and the performance of the boxed tuples is only slightly worse than for the unboxed representation (probably due to the write barrier). For more than 1,000 elements the performance of boxed tuples in both F# and OCaml rapidly worsens until they are 10x slower than unboxed tuples for 1,000,000 elements.


So tuples can indeed be fast in OCaml but only if they are short-lived temporaries. If tuples survive the nursery generation (which is 256kB by default) then performance is very bad.


The poor performance of long-lived tuples in OCaml has actually been worked around on several occasions. The Map implementation is almost identical to the Set implementation except the key-value pairs have been manually unboxed into the variant type representing the AVL tree, resulting in a large amount of unnecessary code duplication. The Hashtbl implementation uses a custom bucket type that is a list where the key-value pairs in each cons cell have been manually unboxed. Variant types are most elegantly represented as a tag and argument. For multiple arguments, the argument can just be a tuple. This simple and efficient representation is used in HLVM and it works very well. In OCaml, the compiler unboxes multiple arguments as a special case in order to combat the performance of boxed tuples. This results in a language wart where brackets around the arguments to a variant type constructor alter the meaning from multiple arguments to a single argument that is a tuple.


Locality is another important aspect of the performance of boxed vs unboxed tuples. Consider sorting an array of pairs and then enumerating the sorted array. With an unboxed representation the elements of the array are physically moved into place within a contiguous block of memory and enumeration is cache friendly. With a boxed representation, sorting scrambles the pointers and enumeration then has worst-case cache behaviour.


Suffice to say, the performance of boxed tuples is not as clear-cut as one might imagine.