Wednesday, 29 December 2010

Extensibility in functional programming languages

Most software developers are now familiar with inheritance and virtual methods as common techniques for extensibility from the object oriented paradigm. When faced with functional programming for the first time, these developers often ask how to write extensible code in this alien paradigm.

The functional paradigm actually only provides a single form of extensibility: higher-order functions. These allow you to factor out "inner" functions. For example, code that often appears with the same first and last code blocks:

let f x =
first x
stuff1 x
last x

let g x =
first x
stuff2 x
last x

can be factored into a general higher order function that is reused from the specific cases:

let hof stuff x =
first x
stuff x
last x

let f = hof stuff1 x

let g = hof stuff2 x

Applying this aggressively leads to design patterns such as parser combinators and is a very powerful and lightweight technique for making code extensible. However, it does not make data types extensible.

Consequently, functional programming languages almost always include language features to help with extensibility:

  • Common Lisp has the Common Lisp Object System (CLOS) and a macro system.
  • Standard ML has parametric polymorphism and a higher-order module system.
  • OCaml added polymorphic variants, objects, optional arguments and the Camlp4 macro system.
  • Haskell has parametric polymorphism and type classes, and Template Haskell adds macros.
  • Scala has Java-style OOP with some added features.

Read Chris Okasaki's excellent monograph Purely functional data structures for some great examples using higher-order modules in Standard ML and type classes in Haskell. Read Code reuse through polymorphic variants by Jacques Garrigue for a description of how that language feature can be used to attack the expression problem. However, these solutions are quite rare in the wild and, in particular, you can get a long way without them (e.g. in F#).

Historically, this diversity appeared because most functional programming languages were research projects and, consequently, they existed to add novel features. Therefore, we now have a wide variety of disparate forms of extensibility in today's functional programming languages.

F# is a different beast compared to its predecessors like OCaml and Haskell because its design requirements were seamless interoperability with the rest of .NET (which imposes .NET-style OOP) and pragmatism. Consequently, F# keeps the ML core with parametric polymorphism and adds .NET's object system. So you can benefit from the easy extensibility offered by generic higher-order functions and conventional OOP but not from any of the more esoteric features like higher-order modules, type classes and macros.

The only form of extensibility F# has pioneered is active patterns. These allow you to separate code that destructures via pattern matching from the concrete data representation. This is an important way to decouple code from data and, therefore, make it more reusable.

Distinctive traits of functional programming languages

The landscape of functional programming languages is remarkably diverse, with most of the major families having quite distinctive traits and dialects that bring their own quirks. Here are some of the major categorizations:

  • Evaluation strategy: non-strict (Miranda, Haskell) vs strict evaluation.

  • Type system: static (Standard ML, OCaml, F#, Haskell, Scala, C# 3) vs dynamic (Scheme, Lisp, Clojure, Erlang) typing and untyped (Mathematica).

  • Kind of static typing: structural (OCaml) vs nominal (F#, Haskell, Scala, C# 3) static typing.

  • Type inference: Damas-Milner (Standard ML, OCaml, F#, Haskell) vs "local" inference (Scala, C# 3).

  • Destructuring: pattern matching (Standard ML, OCaml, F#, Haskell, Erlang, Mathematica) vs manual deconstruction (Scheme, Lisp, C#).

  • Extensibility of algebraic types: always closed (Standard ML, Haskell) vs optionally closed (OCaml).

  • Pattern matching: linear (Standard ML, OCaml, Haskell) vs unbounded (F#, Mathematica).

  • Run-time code generation: meta-circular evaluator (Scheme, Lisp, Clojure) vs heterogeneous code generation (F# → CIL) vs nothing (Standard ML, OCaml, Haskell).

  • Macros: unhygenic macros (Common Lisp, OCaml, Template Haskell, Mathematica) vs hygenic macros (Scheme) vs no macros (Standard ML, F#).

  • Standardization: standardized (Standard ML, Haskell 98, Common Lisp, Scheme) vs proprietary (OCaml, F#, GHC Haskell, Erlang, Mathematica).

Why GC when you have reference counted smart pointers?

Reference counted smart pointers are a simple form of garbage collection usable from the C++ programming language. A recent question on Stack Exchange asks why anyone would want anything more when reference counted smart pointers are already available.

Other forms of garbage collection (most notably tracing GCs) have several advantages over reference counting:

  • Accuracy: Reference counting alone leaks cycles so reference counted smart pointers will leak memory in general unless other techniques are added to catch cycles. Once those techniques are added, reference counting's benefit of simplicity has vanished.

  • Throughput: Smart pointers are one of the least efficient forms of garbage collection, particularly in the context of multi-threaded applications when reference counts are bumped atomically. There are advanced reference counting techniques designed to alleviate this but tracing GCs are still the algorithm of choice in production environments.

  • Latency: Typical smart pointer implementations allow destructors to avalanche, resulting in unbounded pause times. Other forms of garbage collection are much more incremental and can even be real time, e.g. Baker's treadmill.

Many of the answers given perpetuate myths about garbage collection. There is a myth that scope-based reference counting guarantees that values are collected as soon as possible. In fact, tracing collectors can and do collect values before the end of their lexical scope if the value becomes unreachable sooner and a GC occurs. Another myth is that garbage collected languages cannot release resources deterministically. In fact, this is done in exactly the same way as in unmanaged languages. Finally, there is a myth that manual memory management minimizes latency. In fact, manual memory management often has poorer worst-case latency characteristics than garbage collection (this problem originally drove us from C++ to OCaml!) and optimizing latency in an unmanaged language is seriously hard work.

Tuesday, 28 December 2010

Towards a mark-region GC for HLVM

Our previous article highlighted the advantages of the recent mark-region GC design and hinted at HLVM adopting this design. We just completed some preliminary tests using a prototype written in C++ to measure the performance of different allocation strategies. Our results are as follows with times normalized by the time an equivalent OCaml program takes (so 1.0 means as fast as OCaml):

The four columns in each section give the times relative to OCaml for solving the 8-, 9-, 10- and 11-queens problems.

The "Boehm" section refers to the conservative Boehm GC which is 40-70% slower than OCaml on this benchmark. The "malloc" section refers to allocating using the malloc function from glibc without ever freeing and is 2.2-3.1× slower than OCaml. The "free" section refers to allocating with malloc and freeing (manually) and is 1.9-2.3× slower than OCaml. The "bump" section refers to a naive bump allocator that never recycles memory and is 1.4-1.7× slower than OCaml. Finally, the "region" section refers to our prototype region-based algorithm, which is just 4-20% slower than OCaml on this benchmark!

This benchmark is a classic logic programming problem that allocates large numbers of short-lived values. This is a best-case benchmark for OCaml and a worst-case benchmark for the current HLVM. OCaml's generational garbage collector with its fast bump allocator and constant-time recycling of dead values from the nursery generation does extremely well on this benchmark: we have been unable to beat its performance from C/C++.

The Boehm garbage collector is another interesting point of comparison because it has been the subject of intense optimization for many years.

These new results are very enlightening. Recycling memory by calling free is significantly faster than leaking memory by only ever calling malloc. Specifically, leaking is around 3× slower than OCaml and proper manual memory management using malloc and free is around 2× slower than OCaml. Moreover, the performance of the Boehm GC is very similar to manual memory management but still 2× slower than OCaml.

Bump allocating from a huge preallocated pool without ever freeing is surprisingly slow: around 1.5× slower than OCaml. This early result was disappointing but it turned out that our new region allocator is very fast indeed. This is extremely encouraging because it means that a non-moving mark-region collector for HLVM might be able to offer the best of both worlds: the speed of C/C++/Fortran for imperative code using mutable data structures and the speed of OCaml/Haskell for functional code using immutable data structures.

Our prototype region allocator allocates aligned regions using the glibc memalign function. This allows a pointer to the start of the region to be obtained from any pointer inside the region using bitwise operations. Each region begins with a C++ vector that holds the free list, the list of pointers inside the region that are not currently allocated. The remainder of the region is a pool of fixed-size blocks that can be allocated and deallocated. To allocate, the last element is popped off the free list. To free, the free list associated with the pointer is obtained using bitwise operations and the pointer is pushed onto the back of the free list. In the prototype, if the allocator finds the current region to be full then it stores it in a global collection of regions and allocates a new local region. In a production version, the allocator would recycle one of the non-full regions from the global collection of regions rather than allocating a new region each time.

How big should a region be? The results shown above were obtained using 1MB regions, large enough that they were never filled and a new region was never needed. However, reducing the region size to 1kB causes the prototype to create 8,295 regions on the 11-queens problem but the program is only 5% slower and total memory consumption is around 99% lower than simply leaking, so memory is being recycled effectively.

Measuring the absolute performance of the 10-queens solver as a function of the region size gives the following results:

The smallest possible region size of 16 bytes allows a single allocation per region and makes the whole program run 7.6× slower. Increasing the region size improves the efficiency of the region allocator (except for an anomaly between 128 and 256 byte regions that is probably due to benchmark-specific allocation patterns). With 1,024-byte regions, performance is within a few percent of optimal for this benchmark. One might have expected to see significant performance gains from larger regions up to the size of the 6Mb L2 cache on this machine but the tiny working set required by this benchmark eliminated any performance difference beyond 1kB regions.

The following graph shows the number of regions allocated for different region sizes on the 10-queens benchmark:

Smaller regions means a larger number of regions are required, up to around ten million for 16-byte regions. The relationship here reflects the previous region size vs performance relationship because the most of the time is spent administering regions when they are small. The initial sharp drop-off occurs because allowing regions to contain just a few more values significantly increases their ability to recycle space. With 1kB regions, only 874 regions are created to solve this problem.

The product of the region size and number of regions used quantifies the total space allocated for regions using glibc. Doubling the region size from 64 bytes to 128 bytes reduces the total memory allocated by 33% and doubling the region size from 2kB to 4kB reduces the total memory allocated by 99%. Perhaps the accelerated efficiency is due to the generational hypothesis that predicts inverse hyper-exponential decay of the probability of death as a function of age.

In HLVM, a thread-safe allocator will try to use the thread-local region and resort to synchronization only when obtaining the current region is full whereupon an existing non-full region will be reused or a new empty region will be created. The deallocator must potentially access any region but, with HLVM's current design, it is only invoked from a single thread during the stop-the-world phase so it can be thread unsafe. This has two benefits over the current technique:

  • Single-threaded allocation and deallocation should be almost twice as fast as they are today.
  • Multi-threaded allocation should scale linearly with the number of cores whereas HLVM currently sees performance degradation from concurrent allocations.

However, our previous results indicated that HLVM's currently-dismal performance on this benchmark is actually due to the shadow stack and not to allocation. We anticipate that efficient concurrent allocation will be the next bottleneck after the performance of the shadow stack is addressed so this is still valuable work.

Two pieces of related work remain to be done:

  • Mimic the effects of HLVM's current GC more accurately by deallocating in chunks.
  • Extend the prototype to reuse existing non-full regions before allocating a new empty region.

Thursday, 23 December 2010

When generational GC goes bad

For many years, generational collection was the defacto-standard GC architecture. Based upon the observation that the distribution of value lifetimes is heavily skewed towards short lifetimes (most values die young), generational garbage collectors allocate into a nursery generation and survivors are copied out into an old generation.

Many practical language implementations use generational garbage collection including OCaml, GHC and .NET. Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.

Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation. Specifically, the time taken to enqueue and dequeue 32-bit ints on this 2.1GHz 2352 Opteron jumps from 0.33μs to 0.68-1.13μs. Where is this time being wasted?

When a boxed value (such as a 32-bit integer) is allocated in OCaml, it is augmented with a 1-word header and another for the forwarding pointer and that whole block is bump allocated from the nursery. When that value is written into the Queue in the old generation, a write barrier is incurred which stores a copy of the reference in the remembered set. When the nursery is filled, a minor collection is performed that traces from the global roots and remembered set throughout the reachable values in the nursery. These values are then copied into the old generation, their forwarding pointers are set and all locally-held references to them are updated via the forwarding pointers to point into their copies in the old generation. The nursery is then swept by resetting the bump allocator to the start of the nursery.

Suffice to say, this is a lot of overhead when the values allocated into the nursery do not die quickly enough. In that case, all of this effort is a complete waste of time and we would have been better off allocating directly into the old generation in the first place. What can be done to address this problem?

Fortunately, McKinley et al. made a breakthrough in GC design in recent years with their invention of a new class of GC algorithms known as mark-region GCs. It all began with their invention of the Beltway GC in 2002, a generalization of several existing GC designs, and culminated in their Immix GC in 2007. In effect, this GC design allows a nursery full of reachable values to be migrated to the old heap implicitly without any copying and a new nursery is allocated to replace it. The old generation is then effectively a collection of surviving nurseries. The precise placement policy is more complicated because it is possible to reuse old nurseries in order to avoid gross fragmentation but the basic concept is simple enough.

A Google Summer of Code project had an Immix variant implemented for the Glasgow Haskell Compiler. They found the results to be underwhelming but that is not so surprising given that this GC design should be most effective when filling mutable data structures such as queues, caches, hash sets and hash tables. We believe that a simple mark-region variant should be able to dramatically improve HLVM's performance on parallel functional code without degrading the performance of imperative code as generational garbage collectors like OCaml's do.

Wednesday, 15 December 2010

Getting paid to remove features

Although the Industrial Haskell Group has yet to garner its first industrial member since its inception almost two years ago, they have managed the impressive feat of getting paid to remove a feature from Haskell. Specifically, to make it easier to build programs written in Haskell that do not rely upon the GNU Multiprecision library for arbitrary-precision arithmetic (bignums).

We made this interesting observation when considering adding bignums using GMP as a primitive type for HLVM. Apparently, having bignums in the language is not very useful beyond irrelevant microbenchmarks like computing the digits of π.

The slides here also criticize the CAML Consortium (which has garnered 11 members) for charging too little and states that the IHG aimed to garner five members each paying £12k per annum. Why has this target not yet been reached? Our guess is insufficient sales and marketing directed at decision makers in industry who could benefit from using Haskell. As an aside, we believe this same mistake is why the founders of Stack Overflow found it so difficult to monetize despite having millions of non-paying users. In contrast, Rich Hickey managed to garner funding from a whopping 427 people and several companies for his own language, Clojure.

Regardless, the fact that they are trying to build a business around the development of Haskell itself is admirable and should at least prompt more professionals to take a look at what is on offer.

Sunday, 5 December 2010

Texas Multicore Technologies on Haskell

US-based startup Texas Multicore Technologies have published some of the results they obtained using Haskell for parallel programming. Their results mirror our own:

Thanks to Manuel Chakravarty of the University of New South Wales for drawing this to our attention.