Sunday, 26 January 2014

On the performance of boxed tuples

Stephen Dolan, a PhD student at the University of Cambridge, published an interesting article A “sane” design for multicore OCaml recently and the following discussion on Reddit proved to be quite informative. In particular, more than one person asserted that ints, floats and tuples are fast in OCaml. In this blog post I’m going to take a look at tuple performance.

 

As Stephen points out, one might reasonably expect unboxed tuples to be faster for passing arguments to functions and returning multiple values from functions because the elements can stay in registers but slower for storing in the heap because they require multi-word reads and writes instead of a single word (a pointer to the existing tuple). However, HLVM has shown that unboxed tuples can be extremely fast so why the discrepancy?

 

The performance charactistics of different heap topologies are not quite so simple in a garbage collected environment. Two aspects of garbage collection affect the results: the write barrier and survivors. The write barrier is a relatively-slow piece of code injected whenever a program writes a reference into the heap in order to keep the garbage collector apprised of the constantly-changing heap topology. Therefore, writing an unboxed pair of ints into the heap requires two int writes whereas writing a boxed pair of ints into the heap requires one pointer write and a write barrier. Now, in order to record information for the garbage collector the write barrier always performs at least one write in addition to other housekeeping work. Therefore, writing a pair of ints will always be slower if the pair is boxed. In fact, the F# Journal article Pathological garbage collector behaviour found that a write that incurs the write barrier is 2.4x slower than a write that does not. Moreover, .NET is heavily optimized for mutable code so it has a very efficient write barrier whereas OCaml is heavily optimized for purely functional code and has a notoriously slow write barrier.

 

The next issue that complicates the performance of boxed vs unboxed tuples in the heap is survivors. Both OCaml and .NET use generational garbage collectors. New objects like boxed tuples are allocated in a nursery generation. When the nursery is full, surviving objects are identified and physically copied to the next generation. If a program violates the generational hypothesis (that most objects die young) by allocating many objects that survive then it incurs a performance overhead for marking the survivors, copying them into the next generation and fixing up all pointers to those objects to point at their new locations. If tuples are unboxed then none of these overheads exist.

 

So it is instructive to measure the performance of writing newly-allocated tuples into progressively longer slices of an array. We have done this using boxed tuples in OCaml and F# as well as unboxed structs in F#. The following graph visualizes the performance as a function of the size of the array slice:

For array slices containing up to 1,000 elements none of the tuples survive the first generation and the performance of the boxed tuples is only slightly worse than for the unboxed representation (probably due to the write barrier). For more than 1,000 elements the performance of boxed tuples in both F# and OCaml rapidly worsens until they are 10x slower than unboxed tuples for 1,000,000 elements.

 

So tuples can indeed be fast in OCaml but only if they are short-lived temporaries. If tuples survive the nursery generation (which is 256kB by default) then performance is very bad.

 

The poor performance of long-lived tuples in OCaml has actually been worked around on several occasions. The Map implementation is almost identical to the Set implementation except the key-value pairs have been manually unboxed into the variant type representing the AVL tree, resulting in a large amount of unnecessary code duplication. The Hashtbl implementation uses a custom bucket type that is a list where the key-value pairs in each cons cell have been manually unboxed. Variant types are most elegantly represented as a tag and argument. For multiple arguments, the argument can just be a tuple. This simple and efficient representation is used in HLVM and it works very well. In OCaml, the compiler unboxes multiple arguments as a special case in order to combat the performance of boxed tuples. This results in a language wart where brackets around the arguments to a variant type constructor alter the meaning from multiple arguments to a single argument that is a tuple.

 

Locality is another important aspect of the performance of boxed vs unboxed tuples. Consider sorting an array of pairs and then enumerating the sorted array. With an unboxed representation the elements of the array are physically moved into place within a contiguous block of memory and enumeration is cache friendly. With a boxed representation, sorting scrambles the pointers and enumeration then has worst-case cache behaviour.

 

Suffice to say, the performance of boxed tuples is not as clear-cut as one might imagine.

 

8 comments:

Unknown said...

Which version of OCaml did you run the tuple test with? It's worth noting that the write barrier has been getting progressively faster since 3.12.1. For example, in 4.01.0, the page table lookups were optimized, so mfp's original benchmarks don't quite hold.

Jon Harrop said...

These results were obtained using the latest (at the time of writing) 4.01.0 version of OCaml.

Jules said...

What if you have a tuple of pointers? If that tuple is unboxed, then that will require 2 write barriers, right? And if the tuple is boxed, only 1...so it can also go the other way sometimes.

Also with unboxed types it's not very clear to me what the write barrier has to record. It can't just mark a location in the new generation as a root, because in order to traverse that location the garbage collector needs to know what the shape of that data is. Or do you use tags for all data?

Jon Harrop said...

@Jules: Unboxed data can potentially increase the number of write barriers, yes.

The write barrier typically uses either a remembered set or card marking. I'm not sure that either necessarily need to know the shape of the data. For example, a remembered set is commonly used to track pointers from the old heap into the young heap that must be updated when survivors are evacuated from the young heap into the old heap. In that case, the remembered set can just be a set of pointers to pointers and does not require any knowledge of the shape of the data.

Jules said...

I don't really understand how it works. Suppose some old generation pointer gets set to an unboxed pair in the young generation. Now that pointer is added to the remembered set or some card is marked. Later when the GC wants to do a GC on the young generation it looks at the remembered set or cards and uses them as additional roots to traverse the young generation. Now when the GC tries to follow that pointer, it needs to know what kind of data is behind that pointer, right? So either you need to tag each value with its type, or at the point where the pointer to the pair got inserted into the remembered set, its shape should be registered as well. Otherwise how can the GC traverse that correctly?

Jon Harrop said...

@Jules: "Suppose some old generation pointer gets set to an unboxed pair in the young generation". Perhaps the confusion is here. If the pair is unboxed then there is no pointer. If the unboxed pair is two pointers then the write barrier is incurred twice, once for each pointer. The thing recorded by the write barrier is a single pointer to a single pointer in every case.

Jules said...
This comment has been removed by the author.
Jules said...

I see, the part I missed is that in C#/F# there is no such thing as a pointer to an unboxed pair. There is always some tag in between the pointer and the unboxed pair, which means that if you got a pointer, you can always determine the structure of the value behind that pointer without any extra information.