Wednesday, 29 December 2010

Extensibility in functional programming languages

Most software developers are now familiar with inheritance and virtual methods as common techniques for extensibility from the object oriented paradigm. When faced with functional programming for the first time, these developers often ask how to write extensible code in this alien paradigm.

The functional paradigm actually only provides a single form of extensibility: higher-order functions. These allow you to factor out "inner" functions. For example, code that often appears with the same first and last code blocks:

let f x =
first x
stuff1 x
last x

let g x =
first x
stuff2 x
last x

can be factored into a general higher order function that is reused from the specific cases:

let hof stuff x =
first x
stuff x
last x

let f = hof stuff1 x

let g = hof stuff2 x

Applying this aggressively leads to design patterns such as parser combinators and is a very powerful and lightweight technique for making code extensible. However, it does not make data types extensible.

Consequently, functional programming languages almost always include language features to help with extensibility:

  • Common Lisp has the Common Lisp Object System (CLOS) and a macro system.
  • Standard ML has parametric polymorphism and a higher-order module system.
  • OCaml added polymorphic variants, objects, optional arguments and the Camlp4 macro system.
  • Haskell has parametric polymorphism and type classes, and Template Haskell adds macros.
  • Scala has Java-style OOP with some added features.

Read Chris Okasaki's excellent monograph Purely functional data structures for some great examples using higher-order modules in Standard ML and type classes in Haskell. Read Code reuse through polymorphic variants by Jacques Garrigue for a description of how that language feature can be used to attack the expression problem. However, these solutions are quite rare in the wild and, in particular, you can get a long way without them (e.g. in F#).

Historically, this diversity appeared because most functional programming languages were research projects and, consequently, they existed to add novel features. Therefore, we now have a wide variety of disparate forms of extensibility in today's functional programming languages.

F# is a different beast compared to its predecessors like OCaml and Haskell because its design requirements were seamless interoperability with the rest of .NET (which imposes .NET-style OOP) and pragmatism. Consequently, F# keeps the ML core with parametric polymorphism and adds .NET's object system. So you can benefit from the easy extensibility offered by generic higher-order functions and conventional OOP but not from any of the more esoteric features like higher-order modules, type classes and macros.

The only form of extensibility F# has pioneered is active patterns. These allow you to separate code that destructures via pattern matching from the concrete data representation. This is an important way to decouple code from data and, therefore, make it more reusable.

Distinctive traits of functional programming languages

The landscape of functional programming languages is remarkably diverse, with most of the major families having quite distinctive traits and dialects that bring their own quirks. Here are some of the major categorizations:

  • Evaluation strategy: non-strict (Miranda, Haskell) vs strict evaluation.

  • Type system: static (Standard ML, OCaml, F#, Haskell, Scala, C# 3) vs dynamic (Scheme, Lisp, Clojure, Erlang) typing and untyped (Mathematica).

  • Kind of static typing: structural (OCaml) vs nominal (F#, Haskell, Scala, C# 3) static typing.

  • Type inference: Damas-Milner (Standard ML, OCaml, F#, Haskell) vs "local" inference (Scala, C# 3).

  • Destructuring: pattern matching (Standard ML, OCaml, F#, Haskell, Erlang, Mathematica) vs manual deconstruction (Scheme, Lisp, C#).

  • Extensibility of algebraic types: always closed (Standard ML, Haskell) vs optionally closed (OCaml).

  • Pattern matching: linear (Standard ML, OCaml, Haskell) vs unbounded (F#, Mathematica).

  • Run-time code generation: meta-circular evaluator (Scheme, Lisp, Clojure) vs heterogeneous code generation (F# → CIL) vs nothing (Standard ML, OCaml, Haskell).

  • Macros: unhygenic macros (Common Lisp, OCaml, Template Haskell, Mathematica) vs hygenic macros (Scheme) vs no macros (Standard ML, F#).

  • Standardization: standardized (Standard ML, Haskell 98, Common Lisp, Scheme) vs proprietary (OCaml, F#, GHC Haskell, Erlang, Mathematica).

Why GC when you have reference counted smart pointers?

Reference counted smart pointers are a simple form of garbage collection usable from the C++ programming language. A recent question on Stack Exchange asks why anyone would want anything more when reference counted smart pointers are already available.

Other forms of garbage collection (most notably tracing GCs) have several advantages over reference counting:

  • Accuracy: Reference counting alone leaks cycles so reference counted smart pointers will leak memory in general unless other techniques are added to catch cycles. Once those techniques are added, reference counting's benefit of simplicity has vanished.

  • Throughput: Smart pointers are one of the least efficient forms of garbage collection, particularly in the context of multi-threaded applications when reference counts are bumped atomically. There are advanced reference counting techniques designed to alleviate this but tracing GCs are still the algorithm of choice in production environments.

  • Latency: Typical smart pointer implementations allow destructors to avalanche, resulting in unbounded pause times. Other forms of garbage collection are much more incremental and can even be real time, e.g. Baker's treadmill.

Many of the answers given perpetuate myths about garbage collection. There is a myth that scope-based reference counting guarantees that values are collected as soon as possible. In fact, tracing collectors can and do collect values before the end of their lexical scope if the value becomes unreachable sooner and a GC occurs. Another myth is that garbage collected languages cannot release resources deterministically. In fact, this is done in exactly the same way as in unmanaged languages. Finally, there is a myth that manual memory management minimizes latency. In fact, manual memory management often has poorer worst-case latency characteristics than garbage collection (this problem originally drove us from C++ to OCaml!) and optimizing latency in an unmanaged language is seriously hard work.

Tuesday, 28 December 2010

Towards a mark-region GC for HLVM

Our previous article highlighted the advantages of the recent mark-region GC design and hinted at HLVM adopting this design. We just completed some preliminary tests using a prototype written in C++ to measure the performance of different allocation strategies. Our results are as follows with times normalized by the time an equivalent OCaml program takes (so 1.0 means as fast as OCaml):

The four columns in each section give the times relative to OCaml for solving the 8-, 9-, 10- and 11-queens problems.

The "Boehm" section refers to the conservative Boehm GC which is 40-70% slower than OCaml on this benchmark. The "malloc" section refers to allocating using the malloc function from glibc without ever freeing and is 2.2-3.1× slower than OCaml. The "free" section refers to allocating with malloc and freeing (manually) and is 1.9-2.3× slower than OCaml. The "bump" section refers to a naive bump allocator that never recycles memory and is 1.4-1.7× slower than OCaml. Finally, the "region" section refers to our prototype region-based algorithm, which is just 4-20% slower than OCaml on this benchmark!

This benchmark is a classic logic programming problem that allocates large numbers of short-lived values. This is a best-case benchmark for OCaml and a worst-case benchmark for the current HLVM. OCaml's generational garbage collector with its fast bump allocator and constant-time recycling of dead values from the nursery generation does extremely well on this benchmark: we have been unable to beat its performance from C/C++.

The Boehm garbage collector is another interesting point of comparison because it has been the subject of intense optimization for many years.

These new results are very enlightening. Recycling memory by calling free is significantly faster than leaking memory by only ever calling malloc. Specifically, leaking is around 3× slower than OCaml and proper manual memory management using malloc and free is around 2× slower than OCaml. Moreover, the performance of the Boehm GC is very similar to manual memory management but still 2× slower than OCaml.

Bump allocating from a huge preallocated pool without ever freeing is surprisingly slow: around 1.5× slower than OCaml. This early result was disappointing but it turned out that our new region allocator is very fast indeed. This is extremely encouraging because it means that a non-moving mark-region collector for HLVM might be able to offer the best of both worlds: the speed of C/C++/Fortran for imperative code using mutable data structures and the speed of OCaml/Haskell for functional code using immutable data structures.

Our prototype region allocator allocates aligned regions using the glibc memalign function. This allows a pointer to the start of the region to be obtained from any pointer inside the region using bitwise operations. Each region begins with a C++ vector that holds the free list, the list of pointers inside the region that are not currently allocated. The remainder of the region is a pool of fixed-size blocks that can be allocated and deallocated. To allocate, the last element is popped off the free list. To free, the free list associated with the pointer is obtained using bitwise operations and the pointer is pushed onto the back of the free list. In the prototype, if the allocator finds the current region to be full then it stores it in a global collection of regions and allocates a new local region. In a production version, the allocator would recycle one of the non-full regions from the global collection of regions rather than allocating a new region each time.

How big should a region be? The results shown above were obtained using 1MB regions, large enough that they were never filled and a new region was never needed. However, reducing the region size to 1kB causes the prototype to create 8,295 regions on the 11-queens problem but the program is only 5% slower and total memory consumption is around 99% lower than simply leaking, so memory is being recycled effectively.

Measuring the absolute performance of the 10-queens solver as a function of the region size gives the following results:

The smallest possible region size of 16 bytes allows a single allocation per region and makes the whole program run 7.6× slower. Increasing the region size improves the efficiency of the region allocator (except for an anomaly between 128 and 256 byte regions that is probably due to benchmark-specific allocation patterns). With 1,024-byte regions, performance is within a few percent of optimal for this benchmark. One might have expected to see significant performance gains from larger regions up to the size of the 6Mb L2 cache on this machine but the tiny working set required by this benchmark eliminated any performance difference beyond 1kB regions.

The following graph shows the number of regions allocated for different region sizes on the 10-queens benchmark:

Smaller regions means a larger number of regions are required, up to around ten million for 16-byte regions. The relationship here reflects the previous region size vs performance relationship because the most of the time is spent administering regions when they are small. The initial sharp drop-off occurs because allowing regions to contain just a few more values significantly increases their ability to recycle space. With 1kB regions, only 874 regions are created to solve this problem.

The product of the region size and number of regions used quantifies the total space allocated for regions using glibc. Doubling the region size from 64 bytes to 128 bytes reduces the total memory allocated by 33% and doubling the region size from 2kB to 4kB reduces the total memory allocated by 99%. Perhaps the accelerated efficiency is due to the generational hypothesis that predicts inverse hyper-exponential decay of the probability of death as a function of age.

In HLVM, a thread-safe allocator will try to use the thread-local region and resort to synchronization only when obtaining the current region is full whereupon an existing non-full region will be reused or a new empty region will be created. The deallocator must potentially access any region but, with HLVM's current design, it is only invoked from a single thread during the stop-the-world phase so it can be thread unsafe. This has two benefits over the current technique:

  • Single-threaded allocation and deallocation should be almost twice as fast as they are today.
  • Multi-threaded allocation should scale linearly with the number of cores whereas HLVM currently sees performance degradation from concurrent allocations.

However, our previous results indicated that HLVM's currently-dismal performance on this benchmark is actually due to the shadow stack and not to allocation. We anticipate that efficient concurrent allocation will be the next bottleneck after the performance of the shadow stack is addressed so this is still valuable work.

Two pieces of related work remain to be done:

  • Mimic the effects of HLVM's current GC more accurately by deallocating in chunks.
  • Extend the prototype to reuse existing non-full regions before allocating a new empty region.

Thursday, 23 December 2010

When generational GC goes bad

For many years, generational collection was the defacto-standard GC architecture. Based upon the observation that the distribution of value lifetimes is heavily skewed towards short lifetimes (most values die young), generational garbage collectors allocate into a nursery generation and survivors are copied out into an old generation.

Many practical language implementations use generational garbage collection including OCaml, GHC and .NET. Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.

Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation. Specifically, the time taken to enqueue and dequeue 32-bit ints on this 2.1GHz 2352 Opteron jumps from 0.33μs to 0.68-1.13μs. Where is this time being wasted?

When a boxed value (such as a 32-bit integer) is allocated in OCaml, it is augmented with a 1-word header and another for the forwarding pointer and that whole block is bump allocated from the nursery. When that value is written into the Queue in the old generation, a write barrier is incurred which stores a copy of the reference in the remembered set. When the nursery is filled, a minor collection is performed that traces from the global roots and remembered set throughout the reachable values in the nursery. These values are then copied into the old generation, their forwarding pointers are set and all locally-held references to them are updated via the forwarding pointers to point into their copies in the old generation. The nursery is then swept by resetting the bump allocator to the start of the nursery.

Suffice to say, this is a lot of overhead when the values allocated into the nursery do not die quickly enough. In that case, all of this effort is a complete waste of time and we would have been better off allocating directly into the old generation in the first place. What can be done to address this problem?

Fortunately, McKinley et al. made a breakthrough in GC design in recent years with their invention of a new class of GC algorithms known as mark-region GCs. It all began with their invention of the Beltway GC in 2002, a generalization of several existing GC designs, and culminated in their Immix GC in 2007. In effect, this GC design allows a nursery full of reachable values to be migrated to the old heap implicitly without any copying and a new nursery is allocated to replace it. The old generation is then effectively a collection of surviving nurseries. The precise placement policy is more complicated because it is possible to reuse old nurseries in order to avoid gross fragmentation but the basic concept is simple enough.

A Google Summer of Code project had an Immix variant implemented for the Glasgow Haskell Compiler. They found the results to be underwhelming but that is not so surprising given that this GC design should be most effective when filling mutable data structures such as queues, caches, hash sets and hash tables. We believe that a simple mark-region variant should be able to dramatically improve HLVM's performance on parallel functional code without degrading the performance of imperative code as generational garbage collectors like OCaml's do.

Wednesday, 15 December 2010

Getting paid to remove features

Although the Industrial Haskell Group has yet to garner its first industrial member since its inception almost two years ago, they have managed the impressive feat of getting paid to remove a feature from Haskell. Specifically, to make it easier to build programs written in Haskell that do not rely upon the GNU Multiprecision library for arbitrary-precision arithmetic (bignums).

We made this interesting observation when considering adding bignums using GMP as a primitive type for HLVM. Apparently, having bignums in the language is not very useful beyond irrelevant microbenchmarks like computing the digits of π.

The slides here also criticize the CAML Consortium (which has garnered 11 members) for charging too little and states that the IHG aimed to garner five members each paying £12k per annum. Why has this target not yet been reached? Our guess is insufficient sales and marketing directed at decision makers in industry who could benefit from using Haskell. As an aside, we believe this same mistake is why the founders of Stack Overflow found it so difficult to monetize despite having millions of non-paying users. In contrast, Rich Hickey managed to garner funding from a whopping 427 people and several companies for his own language, Clojure.

Regardless, the fact that they are trying to build a business around the development of Haskell itself is admirable and should at least prompt more professionals to take a look at what is on offer.

Sunday, 5 December 2010

Texas Multicore Technologies on Haskell

US-based startup Texas Multicore Technologies have published some of the results they obtained using Haskell for parallel programming. Their results mirror our own:

Thanks to Manuel Chakravarty of the University of New South Wales for drawing this to our attention.

Saturday, 6 November 2010

Mono 2.8: a step closer to a reliable foundation

We previously complained about the use of Boehm's conservative garbage collector in earlier versions of Mono because it is fundamentally flawed and prone to causing unpredictable memory leaks that result in applications dying with out-of-memory errors when there is plenty of garbage left to be reclaimed. Specifically, we gave a simple 9-line example program that fills and forgets ten hash tables that ran out of memory when run on Mono 2.4. What happens when this program is run on Mono 2.8 using the new SGen garbage collector?

Running the test with Mono 2.8 using the default Boehm GC often reproduces the same leak that we saw before, as expected. Repeating our previous test using the new SGen garbage collector we find that the program does not die after four iterations with an out-of-memory error but gets as far as eight of the intended ten iterations before dying with a segmentation fault:

$ mono-sgen TailCall.exe
m[42] = 42
Took 3.40511s

m[42] = 42
Took 3.41273s

m[42] = 42
Took 3.20464s

m[42] = 42
Took 3.96534s

m[42] = 42
Took 3.14944s

m[42] = 42
Took 3.10114s

m[42] = 42
Took 3.14187s

m[42] = 42
Took 3.27123s


  at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) <0x00003>
  at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_vector (intptr,intptr,intptr) <0x00003>
  at (wrapper alloc) object.AllocVector (intptr,intptr) <0x000ac>
  at System.Collections.Generic.Dictionary`2<double, double>.Resize () <0x001bc>
  at System.Collections.Generic.Dictionary`2<double, double>.set_Item (double,double) <0x0014f>
  at <StartupCode$TailCall>.$Program.main@ () <0x0007c>
  at (wrapper runtime-invoke) object.runtime_invoke_void (object,intptr,intptr,intptr) <0x0007d>

Native stacktrace:

        mono-sgen [0x80dec34]
        mono-sgen [0x812b2cb]
        mono-sgen [0x8174e17]
        mono-sgen [0x8175428]
        mono-sgen [0x8065318]
        mono-sgen(mono_runtime_invoke+0x40) [0x81a9aa0]
        mono-sgen(mono_runtime_exec_main+0xd6) [0x81ad1f6]
        mono-sgen(mono_main+0x1a41) [0x80bb501]
        mono-sgen [0x805b388]
        /lib/tls/i686/cmov/ [0xb7451b56]
        mono-sgen [0x805b131]

Debug info from gdb:

[Thread debugging using libthread_db enabled]
[New Thread 0xb7103b70 (LWP 8401)]
0xb76f3430 in __kernel_vsyscall ()
  2 Thread 0xb7103b70 (LWP 8401)  0xb76f3430 in __kernel_vsyscall ()
* 1 Thread 0xb7439720 (LWP 8400)  0xb76f3430 in __kernel_vsyscall ()

Thread 2 (Thread 0xb7103b70 (LWP 8401)):
#0  0xb76f3430 in __kernel_vsyscall ()
#1  0xb75a9f75 in sem_wait@@GLIBC_2.1 ()
    at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/sem_wait.S:80
#2  0x0822c778 in mono_sem_wait (sem=0x89ce64c, alertable=0)
    at mono-semaphore.c:102
#3  0x081560c7 in finalizer_thread (unused=0x0) at gc.c:1048
#4  0x08183065 in start_wrapper (data=0xa37c760) at threads.c:747
#5  0x0821a7df in thread_start_routine (args=0xa36762c) at wthreads.c:285
#6  0x0816da8b in gc_start_thread (arg=0xa37c808) at sgen-gc.c:5350
#7  0xb75a380e in start_thread (arg=0xb7103b70) at pthread_create.c:300
#8  0xb75078de in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

Thread 1 (Thread 0xb7439720 (LWP 8400)):
#0  0xb76f3430 in __kernel_vsyscall ()
#1  0xb75aac8b in read () from /lib/tls/i686/cmov/
#2  0x080dedfc in read (signal=11, ctx=0xb72fcd0c)
    at /usr/include/bits/unistd.h:45
#3  mono_handle_native_sigsegv (signal=11, ctx=0xb72fcd0c)
    at mini-exceptions.c:1935
#4  0x0812b2cb in mono_arch_handle_altstack_exception (sigctx=0xb72fcd0c,
    fault_addr=0x8, stack_ovf=0) at exceptions-x86.c:1163
#5  <signal handler called>
#6  alloc_large_inner (vtable=<value optimised out>,
    size=<value optimised out>) at sgen-los.c:368
#7  0x08174e17 in mono_gc_alloc_obj_nolock (vtable=0xa3af948, size=0)
    at sgen-gc.c:3219
#8  0x08175428 in mono_gc_alloc_vector (vtable=0xa3af948, size=147681864,
    max_length=18460231) at sgen-gc.c:3437
#9  0xb72ecb0b in ?? ()
#10 0xb72e97d5 in ?? ()
#11 0xb72ec695 in ?? ()
#12 0xb72ec2a8 in ?? ()
#13 0xb72e8d9d in ?? ()
#14 0xb72e8fd6 in ?? ()
#15 0x08065318 in mono_jit_runtime_invoke (method=0xa330bdc, obj=0x0,
    params=0xbfd1aafc, exc=0x0) at mini.c:5392
#16 0x081a9aa0 in mono_runtime_invoke (method=0xa330bdc, obj=0x0,
    params=0xbfd1aafc, exc=0x0) at object.c:2709
#17 0x081ad1f6 in mono_runtime_exec_main (method=0xa330bdc, args=0xb6c00638,
    exc=0x0) at object.c:3838
#18 0x080bb501 in main_thread_handler (argc=2, argv=0xbfd1ace4) at driver.c:999
#19 mono_main (argc=2, argv=0xbfd1ace4) at driver.c:1836
#20 0x0805b388 in mono_main_with_options (argc=2, argv=0xbfd1ace4) at main.c:66
#21 main (argc=2, argv=0xbfd1ace4) at main.c:97

Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.


Seven years after the Mono team described their use of the Boehm garbage collector as "an interim measure", the SGen collector is still experimental. Hopefully these issues will be resolved and the Mono platform will benefit from a reliable garbage collector in the not too-distant future. However, we cannot help but wonder why the Mono team have not chosen to release a simple but reliable garbage collector that people could use while they wait for SGen to be stabilized. After all, multicore-friendly garbage collection can be easy.

Wednesday, 27 October 2010

"The F# Asynchronous Programming Model" by Don Syme et al.

The creator of the F# programming language at Microsoft Research in Cambridge, Don Syme, recently had a paper accepted for the Practical Aspects of Declarative Languages conference. This paper provides a great introduction to asynchronous workflows and the MailboxProcessor in F#.

In fact, the use of monads to sequence asynchronous operations has a long history. For example, this approach has been used in the OCaml programming language, from which F# is descended, for at least 8 years. Specifically, Jérôme Vouillon's LWT library for OCaml made asynchronous programming easy. For example, the first F# sample given in this paper:

async { let! html = getWebPage ""
return html.Length }

Could have been written in OCaml as follows in 2002:

getWebPage "" >>=

In 2005, Jacques Carette et al.'s pa_monad syntax extension even added the same kind of syntactic sugar that F# provides for its asynchronous workflows, allowing the sample to be written in OCaml as:

html <-- getWebPage ""
return String.length html

For more information on asynchronous programming in F#, read Visual F# 2010 for Technical Computing.

Wednesday, 20 October 2010

What is the difference between parallel and concurrent programming?

Concurrent programming regards operations that appear to overlap and is primarily concerned with the complexity that arises due to non-deterministic control flow. The quantitative costs associated with concurrent programs are typically both throughput and latency. Concurrent programs are often IO bound but not always, e.g. concurrent garbage collectors are entirely on-CPU. The pedagogical example of a concurrent program is a web crawler. This program initiates requests for web pages and accepts the responses concurrently as the results of the downloads become available, accumulating a set of pages that have already been visited. Control flow is non-deterministic because the responses are not necessarily received in the same order each time the program is run. This characteristic can make it very hard to debug concurrent programs. Some applications are fundamentally concurrent, e.g. web servers must handle client connections concurrently. Erlang is a language designed specifically for distributed concurrent programming with fault tolerance but many other languages provide features for concurrent programming, such as asynchronous workflows in the F# programming language.

Parallel programming concerns operations that are overlapped for the specific goal of improving throughput. The difficulties of concurrent programming are evaded by making control flow deterministic. Typically, programs spawn sets of child tasks that run in parallel and the parent task only continues once every subtask has finished. This makes parallel programs much easier to debug. The hard part of parallel programming is performance optimization with respect to issues such as granularity and communication. The latter is still an issue in the context of multicores because there is a considerable cost associated with transferring data from one cache to another. Dense matrix-matrix multiply is a pedagogical example of parallel programming and it can be solved efficiently by using Strassen's divide-and-conquer algorithm and attacking the sub-problems in parallel. Cilk is pioneered the most promising techniques for high-performance parallel programming on shared-memory computers (including multicores) and its technology is now offered by Intel in their Threaded Building Blocks (TBB) and Microsoft in .NET 4. So this is also easily accessible from the F# programming language.

Don Syme on "Functional approaches to parallelism and concurrency"

Don Syme, creator of the F# programming language, recently gave a superb lecture on parallel and concurrent programming using F# at QCon 2010. Video and slides hosted by InfoQ here.

Parallel programming continues to be a hot topic in the face of multicore computing but, as Don points out, the world is also moving steadily towards more concurrent programming.

Work continues on our forthcoming "Multicore .NET" book that studies parallel programming using C# and F# in detail...

Tuesday, 19 October 2010

Can you repro this 64-bit .NET GC bug?

Update: Maoni Stephens of Microsoft, the author of this new "Background" garbage collector in .NET 4, has now reproduced the bug and acknowledged that we really are seeing GC pauses lasting several minutes.

Whilst developing low latency software on .NET, we have discovered a serious bug in the .NET 4 concurrent workstation garbage collector that can cause applications to hang for up to several minutes at a time.

On three of our machines the following simple C# program causes the GC to leak memory until none remains and a single mammoth GC cycle kicks in, stalling the program for several minutes (!) while 11Gb of heap is recycled:

static void Main(string[] args) {
  var q = new System.Collections.Generic.Queue(); 
  while (true) { 
    if (q.Count > 1000000) 

You need to compile for x64 on a 64-bit Windows OS with .NET 4 and run with the default (concurrent workstation) GC using the default (interactive) latency setting.

Here's what the Task Manager looks like when running this program on this machine:

Note that 11Gb of heap have been leaked here when this program requires no more than 100Mb of memory.

We have now accumulated around a dozen repros of this bug, written in F# as well as C#, and it appears to be related to a bug in the GC write barrier when most of gen0 survives. However, Microsoft have not yet been able to reproduce it. Can you?

Saturday, 16 October 2010

ARM-based iPads choke Intel-based netbook sales

The recent news that Apple are selling around 18 million of their ARM-based iPads per year reminded us of our article Will Intel lose the computer market to ARM in 2012? from January. Following their success, there are now a growing number of competitors itching to release ARM-based tablet PCs of their own, like Marvell's $99 Moby tablet.

Compare just those iPad sales to the 35 million netbooks of all brands sold in 2009 and the predicted 36 million netbooks to be sold in 2010 and it looks as though Intel may at least lose the mobile market to ARM in 2012.

Sunday, 10 October 2010

Towards concurrent garbage collection for GHC

Simon Marlow of Microsoft Research recently published a blog post entitled First results from GHC's new garbage collector. As his beautiful graphs show so clearly, this is a first step towards concurrent garbage collection. The blog post describes this advancement entirely from the perspective of throughput because the ability to collect per-thread nursery generations independently removes some of the blocking that was wasting mutator time in the previous version.

However, we believe that concurrent programming may become a killer application domain for Haskell and, in that context, latency can be critical. If GHC's garbage collector is made more concurrent, by allowing the old generation to be collected independently as well, then pause times could be greatly reduced and Haskell would have a considerable advantage over competing technologies like .NET.

We have found that even the best-behaved .NET programs that allocate still suffer GC pauses of around 20ms, over an order of magnitude longer than the 600µs pause times indicated on the graphs for this new version of GHC. Real .NET applications that were not designed from the ground up to attain low latency suffer stalls lasting several seconds!

Sunday, 26 September 2010

Lessons learned from HLVM

Although our HLVM project is intended to bring modern VM techniques to modern programming languages without requiring any new research, we have ended up producing several enlightening new results. This article takes a look at some of the weird and wonderful techniques we used in HLVM.

GC as a metaprogram (good!)

Whereas most VMs use a garbage collector written in C or C++, the GC in HLVM is written in its own intermediate representation. This unusual design has a variety of benefits:

  • The GC acts as a test for the compiler.
  • The GC benefits from improvements we make to the compiler.
  • The GC is simplified by being able to use features like tail call elimination and higher-order functions.

Overall, implementing the GC in HLVM's own IR proved to be hugely beneficial: the GC was very easy to write and is easy to maintain.

Fat references (bad!)

Most VMs store header data in each heap-allocated block that describes the size and type of a value in order to allow the garbage collector to traverse it. In HLVM, we moved this header data into the reference itself. This allows C data structures to be used directly without having to be copied in order add the header, simplifying and improving the efficiency of the foreign function interface (FFI).

This adds a 4-word space overhead for duplicated references. Fortunately, duplicate references are rare and no significant space overhead has ever been observed in practice.

However, there are two disadvantages. Firstly, we have observed significant performance degradation as fat references are pushed onto the stadow stack, requiring 4× more bandwidth than 1-word references would. Secondly, our fat references cannot be updated atomically so our VM is not memory safe.

Lack of memory safety is a major concern and the best solution we have come up with to date is to resort to 1-word references as pointers to heap-allocated header information. Note that HLVM currently calls malloc twice for each allocation (once to allocate a mark bit and again to allocate the actual data) so moving the header information back into the heap need not require more allocations. The obvious solution of placing the header alongside the mark bit would incur false sharing when the GC thread writes to the mark bit and a mutator reads the header information. If that proves to be a problem then a better solution might be to store an index that can be used to look up the mark bit and header data from two separate arrays.

Shadow stacks (adequate!)

Xavier Leroy dismissed our proposed shadow stack, saying "probably the worst approach performance-wise; even a conservative GC should work better". Our motivation for using shadow stacks was always simplicity rather than performance. Shadow stacks are simply because LLVM's GC support is highly experimental (to date, only two people claim to have managed to get it to work) and shadow stacks facilitate accurate GC in an uncooperative environment). In fact, our results have already proven that Xavier's statement is not true in general and, in fact, we believe it may even be wrong in general.

Specifically, of the six benchmarks from our benchmark suite than handle references in the inner loop (thus incurring manipulation of the shadow stack), HLVM is significantly faster than OCaml on four of them. So our shadow stack cannot be degrading performance that much! Moreover, OCaml is 9× faster than HLVM on our n-queens benchmark only because its generational collector allows short-lived values to be recycled very efficiently and that has nothing to do with the shadow stack.

So we believe shadow stacks are an adequate solution.

Saturday, 25 September 2010

The effectiveness of Boehm's GC

Many people still seem to be trying to use Boehm's garbage collector. This is surprising because that GC is conservative, meaning it is incapable of accurately distinguishing between integers and pointers and, consequently, it is prone to memory leaks due to false positives where integers are assumed to be pointers into an allocated heap block, preventing the block from being reclaimed. Consequently, Boehm's GC is a notorious source of memory leaks.

Moreover, 32-bit machines with 4Gb of RAM and programs that use a significant proportion of that RAM are still very common and the proportion of the address space in use is, therefore, high so the probability of false positives is very high.

Imagine a 32-bit program using 40Mb of heap contains a random integer. The probability of that random integer coincidentally pointing into an allocated heap block is approximately 1% because 1% of the 4Gb address space is covered by heap blocks. Now imagine a 32-bit program containing n random ints and using a proportion p of the address space. The probability that one or more of those ints point into allocated heap blocks is 1-(1-p)n. In our previous example, if there were 100 random ints then the probability of a false positive is a whopping 63%!

This is why 32-bit programs that use Boehm's GC are so prone to memory leaks. Hash tables are particularly susceptible to this problem because hashes are effectively random ints and the spines of hash tables are large heap blocks. Indeed, this appears to explain the memory leak we observed when using hash tables on Mono back in July.

Sunday, 19 September 2010

Are multicore-capable garbage collectors hard to write?

In this era of multicore computing, garbage collectors need to allow user threads (aka mutators) to run in parallel on separate cores simultaneously in order to facilitate efficient shared memory parallel programming.

There are two relevant phrases from garbage collection terminology here:

  • Parallel GC means the garbage collector itself has been parallelised in order to speed up garbage collections. For example, a stop-the-world collector might traverse the heap in parallel when marking in an attempt to reduce pause times.

  • Concurrent GC means the garbage collector runs at the same time as the user threads (aka mutators). For example, Dijkstra's algorithm and derivatives like Doligez-Leroy use fine-grained synchronization to keep a concurrently-running collector apprised of the constantly-changing heap topology.

However, we are talking about neither parallel GC nor concurrent GC but, rather, the simpler challenge of just allowing mutators to run in parallel. In the absence of any established terminology, we call any GC that allows mutator threads to run in parallel with each other a multicore-friendly garbage collector.

Frustratingly, many people are perpetuating the myth that it is difficult to write parallel or concurrent or even just multicore-friendly garbage collectors. In particular, this is happening around the OCaml language as a result of Xavier Leroy's (in)famous "standard lecture on threads" from 2002 where he explained that they were not creating a multicore-friendly garbage collector for OCaml because multicores would never become popular and he described Intel's hyperthreading as "the last convulsive movement of SMP's corpse". Xavier is a great guy and had done a lot of great work but, of course, he was completely wrong about this. Not only are multicores ubiquitous and hyperthreading is very common but shared memory parallelism is here to stay: even if distributed parallelism becomes essential in the future when cache coherence breaks down we will be doing distributed parallel programming between multicores because shared memory parallelism is so much more efficient than distributed parallelism most of the time.

The JVM and .NET CLR obviously provide multicore-friendly garbage collectors so people sometimes assert that creating such a garbage collector requires huge resources and is beyond a small group such as the OCaml team at INRIA. This is simply not true. The simplest multicore-friendly garbage collector design is a stop-the-world collector that pauses all user threads while the entire heap is marked and swept safe in the knowledge that the heap topology is static. Our own HLVM project currently uses exactly this design and it took just a few days to write and is under 100 lines of code! And we are not alone. Simon Marlow has written several far more sophisticated multicore-friendly garbage collectors for the Glasgow Haskell Compiler by himself. The PolyML project developed a multicore-friendly garbage collector without benefit of funding from a multinational corporation. Same for Manticore.

Even some of the more sophisticated mostly-concurrent garbage collector designs are remarkably simple. For example, the Very Concurrent Garbage Collector (VCGC) uses a breathtakingly-elegant approach based around three epochs (instead of the usual tricolor marking scheme) that completely avoids the error-prone and inefficient fine-grained synchronization originally proposed by Dijkstra and followed by almost everyone else. The entire algorithm for this mostly concurrent garbage collector can be expressed in a single page of code!

So please do not let these myths put you off trying to write your own multicore-friendly garbage collector: this is an interesting, rewarding and entirely tractable challenge!

Sunday, 12 September 2010

Burton Smith vs Erik Meijer

We were recently led to a fascinating Channel 9 interview where the insideous Erik Meijer walks in on parallelism-expert Burton Smith, culminating in a fight the likes of which have not been seen since Chris Curry vs Sir Clive Sinclair in the Baron of Beef here in Cambridge.

A pragmatic Burton points out that lazy evaluation renders performance unpredictable and that, in turn, makes it impossible to tune the granularity of parallel programs to ensure that more effort is spent doing actual work than is wasted administering parallel tasks. The puritanical Erik points out that strictness is essentially another form of side effect because it affects issues such as non-termination. The conclusion is irreconcilable differences between what parallel programming requires and what purely functional programming can provide.

Erik and Burton were joined by an elephant in the room though: caches. They are the main difference between supercomputers and multicores and are a game changer, yet people do not discuss the effect caches have on programming and that effect will only grow as the memory gap continues to widen. Today, effective utilization of caches offers an order of magnitude in performance yet their effective use from purely functional languages like Haskell is essentially impossible precisely because memory has been successfully abstracted away. Indeed, as we have explained before, this not only remains a show-stopping unsolved problem for parallel Haskell but is not even being addressed by researchers. Today's state-of-the-art advice for multicore programmers is still to use mutable data structures in order to ensure caches are used effectively based upon the quantification of asymptotic cache complexity in the context of a simple theoretical model of CPU caches.

Tuesday, 10 August 2010

New Scala consultancy company

We have uncovered an interesting development since publishing our previous article about Martin Odersky's claim that Scala is "foremost an industrial language". Apparently, Martin Odersky stated at Scala Days 2010 that he intends to create a startup offering commercial Scala support. He also mentioned it in a comment here. Needless to say, this would be a huge step forwards for Scala!

As an interesting aside, the Scala in the Enterprise web page lists some industrial users of Scala including Électricité de France Trading, Twitter, Xebia, Xerox, FourSquare, Sony, Siemans, GridGain, AppJet, Reaktor, Nature, Managed Gaming Solutions, Tiental, Sygneca, AD Holdings, SAIC, Mimesis Republic and WattzOn.

Monday, 9 August 2010

"I think F# is very cool" - Rich Hickey

Very interesting video interview here with Rich Hickey of Clojure and Joe Pamer of F# side-by-side. When the interviewer asks Rich Hickey what he thinks of F#, he says "I think F# is very cool" and then explains why.

Rich Hickey is, of course, a visionary among programming language developers having single-handedly built a self-sustaining business around his own Clojure language. Remarkably, there are already more Clojure jobs than OCaml jobs according to IT Jobs Watch and there are now twice as many Google searches for Clojure as there are for OCaml:

Saturday, 7 August 2010

"Scala is foremost an industrial language"

In a recent interview about Scala and Clojure, Martin Odersky of Scala gave some interesting answers including the following:

Q: Rich Hickey is as well-read in the academic papers as anyone, but it’s Scala that has gained the perception as an “academic language”. Why do you think that has happened?

A: I think it’s mostly people who want to put Scala down making that comment. They take my EPFL affiliation and the papers we publish as an excuse to attach that label to the language. What’s funny is that often senior industrial Scala programmers get also accused as being academic. All this is rather ironical because Scala is foremost an industrial language with many well known companies using it. By contrast it’s much less taught at universities than Haskell or Lisp, let alone Java!

This raises the obvious question: in what sense is Scala "foremost an industrial language"?

As we understand it, Scala is developed by an academic team led by professor Odersky at an academic institution with academic funding for the sole purpose of academic research and this has culminated in a new academic paper every 7½ months on average over the past decade. That is not true of any industrial programming languages. Indeed, from our experiences with Scala this is reflected as a growth in esoteric language features when basic IDE support is neglected. Some people are trying to use Scala in industry, but that is true of many academic languages. Some funding has come from industry, including a surprising grant from Microsoft to update the .NET port of Scala, but that is presumably a tiny fraction of the total funding that has been spent on Scala. So this seems like rather an odd claim.

Friday, 6 August 2010

More OCaml trends

Paolo from Italy pointed out that the number of blog posts on OCaml has continued to increase in recent years (6,420, 10,500 and 12,100 in 2007/8/9 according to Google blog search) and referred to the success of this year's OCaml meeting with 80 delegates in France. These are certainly encouraging results but it may be worth bringing more data to the table.

Firstly, Google Trends can be used to graph the proportion of Google searches for different search terms over time. The following graph shows the trends for the keywords OCaml and F# since 2004:

As you can see, the proportion of searches for OCaml (blue) is in steady decline whereas the proportion of searches for F# (red) is on the increase and the two crossed over in 2007. In fact, we have found that Google Trends correlates very strongly with our revenue.

Secondly, we can examine statistics about the job market. The following bar chart illustrates the change in UK jobs from 2008 to 2010 for four functional programming languages (source IT Jobs Watch):
As you can see, every language saw tremendous growth in this boom period for functional programming except OCaml which actually saw a decline in the number of jobs on offer.

The deal breaker for us was, of course, revenue. The proportion of our revenue coming from OCaml has fallen steadily from 80% in 2007 to just 10% today.

Thursday, 5 August 2010

Pure F# now only 2× slower than OCaml

One of the concerns expressed by some of the remaining OCaml users, such as Yaron Minsky of Jane St Capital, is the relative performance of F# in the context of purely functional data structures. Specifically, language implementations like OCaml are heavily optimized for this use case due to their legacy and on-going use for applications such as theorem proving, which benefit greatly from the efficient handling of purely functional data structures. Historically, most users of imperative languages such as Java and C# have not required this kind of performance and, consequently, their implementations have not been optimized for this style of programming. However, the recently released .NET 4 includes a new garbage collector and we have found that it provides substantial performance improvements in the context of purely functional data structures which is of particular benefit to F# developers.

We recently examined the performance of four different purely functional heap implementations across five different benchmarks written in both OCaml and F# (see Data structures: heaps). Where our previous findings in similar benchmarks showed F# to be 5× slower than OCaml, our new results indicate that F# on .NET 4 is now only 2× slower than OCaml on average. In one case, a leftist min heap with elements inserted in descending order, F# even ran 20% faster than OCaml.

This is a surprising and encouraging result not only because it makes F# competitive for an even wider variety of tasks but because it also implies that Microsoft are taking F# so seriously that they are optimizing the .NET garbage collector for it!

Sunday, 1 August 2010

Parallel generic quicksort in Haskell

Haskell has a history of making easy problems difficult. Perhaps the most infamous example was the Sieve of Eratosthenes, which is easily implemented in any imperative language but was so difficult to write in Haskell that almost all of the solutions that had been taught in universities and used in research for the preceding 18 years had been wrong until Melissa O'Neill published a seminal paper The Genuine Sieve of Eratosthenes that gave a beautiful description of what they had been doing wrong and how it should be corrected. Melissa's solution was to use a priority queue to implement a rolling wheel of numbers. The correct solution turned out to be 10× longer than a much simpler F# solution and a whopping 100× longer than the original bastardized algorithm in Haskell.

Today, quicksort is the new Sieve of Eratosthenes. Again, the academics have addressed Haskell's failure by bastardizing the algorithm, trading orders of magnitude in performance for something that Haskell can express easily:

qsort [] = [] qsort (x:xs) = qsort (filter (< x) xs) ++ [x] ++ qsort (filter (>= x) xs)

This completely fails to capture the essence of the real quicksort algorithm (see Tony Hoare's original 1962 quicksort paper) that makes it so efficient. Specifically, the in-place partitioning using swaps.

Faced with the challenge of writing a parallel generic quicksort in Haskell, Jim Apple (who is doing a PhD on Haskell at UC Davis) kicked of proceedings with the following code:

import Data.HashTable as H import Data.Array.IO import Control.Parallel.Strategies import Control.Monad import System exch a i r = do tmpi <- readArray a i tmpr <- readArray a r writeArray a i tmpr writeArray a i tmpi bool a b c = if c then a else b quicksort arr l r = if r <= l then return () else do i <- loop (l-1) r =<< readArray arr r exch arr i r withStrategy rpar $ quicksort arr l (i-1) quicksort arr (i+1) r where loop i j v = do (i', j') <- liftM2 (,) (find (>=v) (+1) (i+1)) (find (<=v) (subtract 1) (j-1)) if (i' < j') then exch arr i' j' >> loop i' j' v else return i' find p f i = if i == l then return i else bool (return i) (find p f (f i)) . p =<< readArray arr i main = do [testSize] <- fmap (fmap read) getArgs arr <- testPar testSize ans <- readArray arr (testSize `div` 2) print ans testPar testSize = do x <- testArray testSize quicksort x 0 (testSize - 1) return x testArray :: Int -> IO (IOArray Int Double) testArray testSize = do ans <- newListArray (0,testSize-1) [fromIntegral $ H.hashString $ show i | i <- [1..testSize]] return ans

This solution uses Haskell's parallel "strategies". This concept was introduced to give Haskell programmers more control over parallelization but the only available implementation was found to leak memory and nobody was able to get it to work in this case: Jim's solution contains a concurrency bug that causes it to return incorrect results almost every time it is called.

Peaker then posted his own Haskell solution:

import Data.Array.IO import Control.Monad import Control.Concurrent bool t _f True = t bool _t f False = f swap arr i j = do (iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j) writeArray arr i jv writeArray arr j iv parallel fg bg = do m <- newEmptyMVar forkIO (bg >> putMVar m ()) fg >> takeMVar m sort arr left right = when (left < right) $ do pivot <- read right loop pivot left (right - 1) (left - 1) right where read = readArray arr sw = swap arr find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i move op d i pivot = bool (return op) (sw (d op) i >> return (d op)) =<< liftM (/=pivot) (read i) loop pivot oi oj op oq = do i <- find (+1) (const (<pivot)) oi j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj if i < j then do sw i j p <- move op (+1) i pivot q <- move oq (subtract 1) j pivot loop pivot (i + 1) (j - 1) p q else do sw i right forM_ (zip [left..op-1] [i-1,i-2..]) $ uncurry sw forM_ (zip [right-1,right-2..oq+1] [i+1..]) $ uncurry sw let ni = if left >= op then i + 1 else right + i - oq nj = if right-1 <= oq then i - 1 else left + i - op let thresh = 1024 strat = if nj - left < thresh || right - ni < thresh then (>>) else parallel sort arr left nj `strat` sort arr ni right main = do arr <- newListArray (0, 5) [3,1,7,2,4,8] getElems arr >>= print sort (arr :: IOArray Int Int) 0 5 getElems arr >>= print

This solution also turned out to be buggy. Firstly, it contains a more subtle concurrency bug causes it to return incorrect results only occassionally. Peaker corrected this bug to give the following code:

import System.Time import System.Random import Data.Array.IO import Control.Monad import Control.Concurrent import Control.Exception import qualified Data.List as L bool t _ True = t bool _ f False = f swap arr i j = do (iv, jv) <- liftM2 (,) (readArray arr i) (readArray arr j) writeArray arr i jv writeArray arr j iv background task = do m <- newEmptyMVar forkIO (task >>= putMVar m) return $ takeMVar m parallel fg bg = do wait <- background bg fg >> wait sort arr left right = when (left < right) $ do pivot <- read right loop pivot left (right - 1) (left - 1) right where read = readArray arr sw = swap arr find n pred i = bool (find n pred (n i)) (return i) . pred i =<< read i move op d i pivot = bool (return op) (sw (d op) i >> return (d op)) =<< liftM (/=pivot) (read i) swapRange px x nx y ny = if px x then sw x y >> swapRange px (nx x) nx (ny y) ny else return y loop pivot oi oj op oq = do i <- find (+1) (const (<pivot)) oi j <- find (subtract 1) (\idx cell -> cell>pivot && idx/=left) oj if i < j then do sw i j p <- move op (+1) i pivot q <- move oq (subtract 1) j pivot loop pivot (i + 1) (j - 1) p q else do sw i right nj <- swapRange (<op) left (+1) (i-1) (subtract 1) ni <- swapRange (>oq) (right-1) (subtract 1) (i+1) (+1) let thresh = 1024000 strat = if nj - left < thresh || right - ni < thresh then (>>) else parallel sort arr left nj `strat` sort arr ni right timed act = do TOD beforeSec beforeUSec <- getClockTime x <- act TOD afterSec afterUSec <- getClockTime return (fromIntegral (afterSec - beforeSec) + fromIntegral (afterUSec - beforeUSec) / 1000000000000, x) main = do let n = 1000000 putStrLn "Making rands" arr <- newListArray (0, n-1) =<< replicateM n (randomRIO (0, 1000000) >>= evaluate) elems <- getElems arr putStrLn "Now starting sort" (timing, _) <- timed $ sort (arr :: IOArray Int Int) 0 (n-1) print . (L.sort elems ==) =<< getElems arr putStrLn $ "Sort took " ++ show timing ++ " seconds"

This solution does run correctly on small inputs but increasing the problem size to 1,000,000 elements results in a stack overflow. Two attempts were made to diagnose this bug, here and here, but both turned out to be wrong. The bug is actually in the getElems function of the Haskell standard library which stack overflows on long arrays.

Surprisingly, more bug fixing seems to have culminated in the world's first parallel generic quicksort written in Haskell. Furthermore, the resulting Haskell solution is only around 55% slower than the equivalent F# solution. Note that this requires the latest GHC that was only released in recent weeks.