Wednesday, 23 December 2009

Another serious perf bug uncovered in Haskell's garbage collector

The developers of Haskell at Microsoft Research may claim that "Haskell is the world’s finest imperative programming language" but the flagship Haskell implementation GHC still suffers from serious long-standing bugs in its garbage collector that make it impossible to solve many important problems efficiently in Haskell.

We previously rediscovered a serious bug in GHC's garbage collector that rendered every write into an array O(n) instead of O(1) because the GC marks the entire array as dirty and traverses every element at the next minor GC. In particular, this makes it impossible to implement hash tables and sorts efficiently in Haskell. Surprisingly, this bug turned out to be many years old and had been reported by many people before us. Failure to fix this bug in GHC has been forcing Haskell programmers to resort to elaborate workarounds such as manual memory management via the FFI as seen in the Haskell implementation of the knucleotide benchmark from the Computer Language Shootout.

Yesterday, Robert van Herk wrote on the GHC users mailing list that attempts to optimize his Haskell code were also culminating in unexpectedly poor performance even though his arrays only had eight elements each. Simon Marlow clarified that their current implementation of GHC's garbage collector does indeed contain another serious performance bug: every mutable array seen by a mutator is inserted into the remembered set and examined at the next minor collection. Thus, Haskell programs that handle large numbers of small mutable arrays will experience a different kind of asymptotic performance degradation in Haskell.

These bugs effectively render Haskell itself inefficient because imperative programming and mutable data structures are the only performant solution to many common problems.

Monday, 21 December 2009

C++ is dying

As the world moves on to higher-level programming languages like F#, demand for C++ programmers is at an all-time low. According to IT Jobs Watch, the number of UK job postings citing C++ is continuing to fall at a rate of 40% per year. Market share of C++ among programming languages has fallen from 30% to 15% over the past four years:

This has prompted employers to offer salaries up to $208k in order to secure members of the increasingly-rare breed that is C++ programmers.

Saturday, 14 November 2009

Google's Go programming language

Google recently released a "systems" programming language called Go (not to be confused with the existing programming language called Go!) that is designed to be simple, fast, safe and concurrent.

Google give an impressive feature list:

  • Accurate concurrent garbage collection.
  • Goroutines and channels for parallelism and concurrency.
  • Multiple return values.
  • Fast compilation times.
  • Structurally-typed object oriented programming.
  • First-class functions.

There are two different kinds of compiler available to Go programmers. The more common 8g and 6g compilers for the x86 and x86-64 architectures, respectively, are simple "splat" compilers that translate source directly into assembler. These compilers offer fast compilation but the generated code is not optimized and is generally quite slow, often several times slower than C. Moreover, in the absence of an accurate way to traverse the heap, these compilers use a GC that resorts to conservative collection (blindly traversing the entire heap looking for pointer-like ints) and the current allocator and collector implementations are extremely slow and never return memory to the OS. The gcc-go compiler is a Go front-end for the GCC compiler that offers optimization but currently has no garbage collection at all.

The use of goroutines and channels is really the defining characteristic of the Go language. This feature stems from previous work by Go's creator, Rob Pike, on the Limbo programming language that was released in 1995. Goroutines are similar to tasks in Cilk or the Microsoft TPL except that tasks return a value whereas goroutines do not. Channels are a mechanism for synchronized communication using message passing.

Go may become an interesting language in the future but, for now, the only available implementations are nowhere near competitive and the language itself is missing many common features (non-null references, generics, exceptions and union types). Our impression is that Google are developing the Go programming language largely for itself, first as a language for high-performance web programming but also potentially as a language for Google's proprietary NaCl Chrome plugin that allows x86 code to be downloaded over the web and executed by a browser.

Thursday, 12 November 2009

How does HLVM's GC work?

HLVM injects code into managed user-code functions that keep two data structures up to date:

  • The shadow stack lists all references held on the stack or in registers and acts as the set of global roots into the heap. If a function accepts an argument of a reference type or allocates then the value is pushed onto the shadow stack. On function exit, the shadow stack is reset back to the way it was when the function was entered.
  • The allocated list gives all of the references that have been allocated by the program. When a function allocates, the value is appended onto the allocated list.

When the allocation quota is exceeeded a garbage collection occurs. This is composed of two phases:

  1. Mark. Starting from the global roots on the shadow stack, all reachable data in the heap are marked as reachable (and the rest left marked as unreachable).
  2. Sweep. The allocated list is traversed, unreachable values are freed and all values have their mark state set back to unreachable.

This functionality requires only one non-trivial feature: given a value of a reference type it must be possible to find out what it references. In OCaml, the uniform representation makes it possible to traverse any value using the same code. HLVM uses a different approach: all values contain a pointer to a function that knows how to traverse values of that type. I called this the "visit function".

A block of OCaml code defines the def_visit, visit and def_visit_array functions. The def_visit function takes an HLVM description of a type and creates an HLVM function that finds all references contained within a value of that type. There are two kinds of reference type in HLVM: unions and arrays. The visit function does this for a non-array type and the def_visit_array function does this for an array type by creating an auxiliary function to traverse the elements of the array.

A block of HLVM code defines gc_mark_all, gc_mark, gc_sweep and gc functions. The gc_mark function traverses the shadow stack and then calls the gc_mark_all function to recursively traverse the entire heap marking values as reachable. The gc_sweep function traverses the allocated list freeing unreachable values and resetting all mark states back to unreachable. Finally, the gc function simply calls gc_mark and then gc_sweep to perform a complete garbage collection.

This simple algorithm, implemented in only a few dozen lines of OCaml code, is all that is required for HLVM to reliably recycle unreachable values. The next stage in the evolution of HLVM's garbage collector will be support for parallel threads by stopping all mutator threads when performing a collection. Once this is complete, HLVM will finally be a competitive platform for high performance parallel numerics. A further improvement will be the implementation of a variant of the Very Concurrent Garbage Collector (VCGC) that allows mutator threads and garbage collection threads to run concurrently, offering both better pause times and overall performance.

The HLVM project is described in detail in the OCaml Journal.

Sunday, 4 October 2009

F# for Technical Computing out now!

Our new book F# for Technical Computing has been published and is now available for order!

This is currently the only published book to cover the latest version of the F# programming language and also covers the following exciting topics:

  • Windows Presentation Foundation for interactive 2D and 3D graphics.
  • The Task Parallel Library for shared-memory parallel programming on multicores and MPI for distributed parallelism on clusters.
  • LINQ for the dissection of XML data.
  • Sequence expressions.
  • Asynchronous workflows for concurrent programming.
  • Functional design patterns (tail calls, untying the recursive knot and continuation passing style).
  • Purely functional data structures (balanced trees, tries, lazy streams and queues).
  • Named and optional arguments.
  • .NET interfaces including IEnumerable, IComparable and IDisposable.
  • Performance in the context of caches and multicores.
  • Reflection.

Every graph in the book was created with our own F# for Visualization library and the complete source code including code to interactively generate all of the graphs is provided as a Visual Studio 2008 project when you buy F# for Technical Computing!

Monday, 28 September 2009

F# vs OCaml: Image Processing

While asking on the OCaml mailing list about the relative merits of our HLVM project, David McClain described a computation that he had been required to perform in the context of scientific computing in the past:

"I remember dealing with stacks of images taken from an infrared imaging sensor -- perhaps 256 images, each of which is, say 512 x 512 pixels. I needed to obtain median pixel values from the stack and produce one median image..."

This problem is easily solved in both OCaml and F# using idiomatic functional programming. For example, the following is a one-line solution written in OCaml: ( (fun gs -> Array.sort compare gs; gs.(m/2))) images

This simply sorts each pixel's sequence into non-descending order and extracts the middle (median) pixel, mapped across the columns and then the rows of the image.

This tiny implementation solves David's example with 226 pixels in 32 seconds, which is fast enough for many applications. However, the performance of this OCaml implementation is heavily degraded by the extensive use of polymorphism at every stage because the current OCaml implementation uses static compilation that can handle polymorphism at run-time and, consequently, requires a uniform run-time representation of all types. This leads to the tagging of integers (which is why OCaml has a 31-bit int type) and the boxing of other types such as float into separate heap-allocated values, both of which are extremely inefficient. This general execution model is always used even if type information is statically available. The only workaround is to inline the Array.sort and functions by hand.

Fortunately, this inefficiency can be overcome by using Just-In-Time (JIT) compilation instead of static compilation and partially specializing polymorphism away before JIT compilation. This is the intended design for polymorphism in HLVM and the inspiration was drawn from Microsoft's excellent implementation of the CLR.

Consequently, the equivalent F# program is 100× faster than the OCaml:

images |> (fun xs -> Array.sortInPlaceWith compare xs; xs.[m/2])

Moreover, the Task Parallel Library makes it easy to parallelize this implemention to improve the performance even further:

Parallel.For(0, n, fun y ->
  for x=0 to n-1 do
    Array.sortInPlaceWith compare images.[y, x])
images |> (fun xs -> xs.[m/2])

This parallel implementation is a whopping 821× faster than the OCaml and, in fact, is a superlinear speedup with respect to the 8 cores because it is able to split the data across more caches.

Friday, 17 July 2009

OCaml vs F#: Burrows Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a block-sorting data compression algorithm that acts as a preconditioner to dramatically improve the compression ratios of subsequent compression phases in many cases. This algorithm is the core of the bzip2 data compression utility that is ubiquitous on Unix and, in particular, is often much more effective than zip and gzip.

The core of the BWT algorithm is easily written in F#:

let cmp (str: _ array) i j =
let rec cmp i j =
if i=str.Length then 1 else
if j=str.Length then -1 else
let c = compare str.[i] str.[j] in
if c<>0 then c else
cmp (i+1) (j+1)
cmp i j

let bwt (str: byte array) =
let n = str.Length
let a = Array.init n (fun i -> i)
Array.sortInPlaceWith (cmp str) a
Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

However, this implementation is very slow and, in particular, is many times slower than the extremely heavily optimized C implementation found in the bzip2 program. The main cause of this performance degradation is the use of a higher-order sortInPlaceWith function that compares array elements using a closure derived from the cmp function. Closure invocation is substantially more expensive than ordinary function invocation.

Fortunately, this is easily solved by writing a generic higher-order sort function in F# and marking it inline to ensure that it will be inlined, resulting in specialization over both the array element type and the comparison function. Incredibly, this simple change creates an F# program that runs at a similar speed to the industrial-strength bzip2 program:

open System.Threading

let inline sort cmp (a: _ array) =
  let inline swap i j =
    let t = a.[i]
    a.[i] <- a.[j]
    a.[j] <- t
  let rec qsort l u =
    if l < u then
      swap l ((l + u) / 2)
      let mutable m = l
      for i=l+1 to u do
        if cmp a.[i] a.[l] < 0 then
          m <- m + 1
          swap m i
      swap l m
      if u-l > 1000 then
        let m = m
        let f = Tasks.Future.Create(fun () -> qsort l (m-1))
        qsort (m+1) u
        qsort l (m-1)
        qsort (m+1) u
  qsort 0 (a.Length-1)

let inline cmp (str: _ array) i j =
  let rec cmp i j =
    if i=str.Length then 1 else
      if j=str.Length then -1 else
        let c = compare str.[i] str.[j] in
        if c<>0 then c else
          cmp (i+1) (j+1)
  cmp i j

let bwt (str: byte array) =
  let n = str.Length
  let a = Array.init n (fun i -> i)
  sort (fun i j -> cmp str i j) a
  Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

The ability to inline higher-order functions in this way, to perform partial specialization, is one of the main advantages of the F# programming language over the previous generations of languages including OCaml. In fact, a direct port of this program to OCaml is 150× slower and, even after considerable effort, we were only able to make the OCaml 2.8× slower than F# and C. This is our best attempt so far:

open Bigarray

type int_array = (int, int_elt, c_layout) Array1.t

type byte_array = (int, int8_unsigned_elt, c_layout) Array1.t

exception Cmp of int

let cmp (str: byte_array) i j =
  let n = Array1.dim str in
  let i = ref i and j = ref j in
    while true do
      if !i = n then raise(Cmp 1) else
        if !j = n then raise(Cmp(-1)) else
          let si = str.{!i} and sj = str.{!j} in
          if si < sj then raise(Cmp(-1)) else
            if si > sj then raise(Cmp 1) else
                incr i;
                incr j
  with Cmp c -> c

let swap (a: int_array) i j =
  let t = a.{i} in
  a.{i} <- a.{j};
  a.{j} <- t

let invoke (f : 'a -> 'b) x : unit -> 'b =
  let input, output = Unix.pipe() in
  match Unix.fork() with
  | -1 -> (let v = f x in fun () -> v)
  | 0 ->
      Unix.close input;
      let output = Unix.out_channel_of_descr output in
      Marshal.to_channel output (try `Res(f x) with e -> `Exn e) [];
      close_out output;
      exit 0
  | pid ->
      Unix.close output;
      let input = Unix.in_channel_of_descr input in
      fun () ->
        let v = Marshal.from_channel input in
        ignore (Unix.waitpid [] pid);
        close_in input;
        match v with
        | `Res x -> x
        | `Exn e -> raise e;;

let sort str (a: int_array) =
  let rec qsort l u =
    if l < u then
        swap a l ((l + u) / 2);
        let m = ref l in
        for i=l+1 to u do
          if cmp str a.{i} a.{l} < 0 then
              incr m;
              swap a !m i
        swap a l !m;
        if u - l > 30000 then
            let f () =
              qsort l (!m - 1);
              Array1.sub a l (!m-l-1) in
            let a' = invoke f () in
            qsort (!m + 1) u;
            let a' = a'() in
            for i=l to !m-2 do
              a.{i} <- a'.{i - l}
            qsort l (!m - 1);
            qsort (!m + 1) u
      end in
  ignore(qsort 0 (Array1.dim a - 1))

let () =
  let file = try Sys.argv.(1) with _ -> "input.txt" in
  let desc = Unix.openfile file [] 777 in
  let str = Array1.map_file desc int8_unsigned c_layout false (-1) in
  let n = Array1.dim str in

  let a = Array1.create int c_layout n in
  for i=0 to n-1 do
    a.{i} <- i
  sort str a;
  for i=0 to n-1 do
    print_char(Char.chr str.{(a.{i} + n - 1) mod n})

  Unix.close desc

Sunday, 12 July 2009

OCaml vs F#: QR decomposition

Recent articles in the OCaml and F#.NET Journals derived high-level implementations of QR decomposition via Householder reductions. This numerical method has many applications, most notably in the computation of best fit parameters of linear sums.

Imperfections in OCaml

The OCaml programming language allows this algorithm to be expressed very elegantly in only 15 lines of code:

# let qr a =
    let m, n = Matrix.dim a in
    let rec qr_aux k q r qa =
      if k = n then q, Matrix.mul r a else
        let v =
          let u = Array.init m (fun i -> if i<k then 0.0 else Matrix.get qa i k) in
          u.(k) <- u.(k) -. norm u;
 (( *. ) (1.0 /. norm u)) u in
        let qr_aux2 m u v = Matrix.sub m (Matrix.outer u v) in
        qr_aux (k+1)
          (qr_aux2 q (Matrix.transform q (2. *| v)) v)
          (qr_aux2 r (2. *| v) (Matrix.transform (Matrix.transpose r) v))
          (qr_aux2 qa v (Matrix.transform (Matrix.transpose qa) (2. *| v))) in
    let i = Matrix.init m m (fun i j -> if i=j then 1.0 else 0.0) in
    qr_aux 0 i i a;;
val qr : Matrix.t -> Matrix.t * Matrix.t = <fun>

This is remarkably concise compared to the equivalent 2,077 lines of Fortran 77 code in LAPACK. However, the OCaml implementation suffers from two main drawbacks:

  1. Abstractions such as higher-order functions degrade performance. If core routines such as arithmetic operations are abstracted (which is necessary to generalize the algorithm over different numeric types such as reals, complexes and symbolic expressions) then the whole program can become over 10× slower.
  2. OCaml's serial run-time prohibits threads from running in parallel, leaving us with few options for parallel programming and none that are competitively performant. However, Unix fork can be used to coarsely parallelize the extraction of the Q and R matrices which can almost double performance for m×n input matrices when m≫n.

Consequently, the OCaml implementation is consistently slower than the specialized routines found in MATLAB.

Perfection in F#

The F# programming language includes a seemingly-innocuous feature called inline that allows functions marked inline to be inlined at compile time. However, inline in F# has some remarkable properties:

  • Functions marked inline are structurally typed so this feature can be used to provide abstraction over different numeric types without any performance degradation at all.
  • This feature even allows functional abstraction with optimal performance as well: by inlining higher-order functions and their function arguments in order to statically-expand completely specialized implementations.

Furthermore, F# inherits a concurrent garbage collector from .NET and Microsoft's Task Parallel Library provides easy-to-use parallel tasks based upon state-of-the-art wait-free work-stealing deques as seen in Cilk. This addresses the remaining imperfection in OCaml.

These features, that are unique to the F# programming language, allow the construction of parallel numerical code that combines the brevity of OCaml with the performance of Fortran. In fact, our generic implementation of QR decomposition in F# is up to 3× faster than the equivalent specialized function in MATLAB.

Tuesday, 7 July 2009

Problems with fslex and fsyacc

Back in 2004, we wrote a mini Mathematica implementation in OCaml in only four days. For fun, we decided to port this OCaml program to F#. Incredibly, porting just the lexer and parser from OCaml to F# has taken longer than it took us to write the entire original OCaml code! The reason is simply that OCaml has an incredibly powerful and mature suite of compiler tools for generating lexers and parsers whereas F# does not.
Given that our original OCaml lexer and parser for Mathematica source code were written using ocamllex and ocamlyacc, it seemed obvious to translate the code into fslex and fsyacc. However, there are some serious problems with the F# versions of these tools:
  1. Incomplete: the F# tools were only ever intended for the F# compiler itself and not for wider use. Consequently, features are missing, e.g. in the regular expression implementation in fslex.
  2. Unintegrated: there are Visual Studio support for these tools and, consequently, development is cumbersome compared to OCaml.
  3. Unreliable: the fslex and fsyacc tools are not officially part of the F# distribution and, although they are used in the F# compiler itself, they are buggy: reporting bogus errors where there are none.
  4. Slow: where OCaml takes 0.05s to generate and compile the lexer and parser, F# takes over 10s. That is 200× slower! Suffice to say, that seriously bogs down development.
One potential advantage of fslex is that simply adding --unicode on the command line generates a lexer for unicode rather than ASCII text. However, we have no use for unicode and, consequently, have not tested this. In contrast, the conventional OCaml solution is to drop the byte-only ocamllex tool in favour of a different lexer generator such as the ulex unicode lexer-generating syntax extension.
There is another choice for lexing and parsing in F#: the FParsec monadic parser combinator library by Stephan Tolksdorf. The current release of this library does not compile with the F# May 2009 CTP but the latest version in the development repository does. The project is split into C# and F# code in order to write some performance-critical sections in C# but this means that it is distributed as two separate DLLs. We shall investigate this option in the future...

Tuesday, 26 May 2009

F# vs Mathematica: Pythagoras tree

The F# News blog recently published an article about the Pythagoras tree that included the complete F# source code for a program that renders the tree. Interestingly, the Wolfram Demonstrations Project contains a similar implementation written in the Mathematica language, which is a domain-specific language designed for interactive technical computing.

Surprisingly, the implementation written in the general purpose F# programming language is not only much faster but is also substantially shorter and clearer. Here is the F# code:

And here is the substantially more complicated Mathematica code:

This example really highlights just how versatile the F# programming language has become without sacrificing the benefits of more conventional languages.

Friday, 8 May 2009

Mathematica bug afflicting our product

Our Mathematica library for wavelet-based time-frequency analysis does not work with the first release of Mathematica 7 due to a bug in Mathematica's FFT routines that silently corrupts data. Any customers using this product are advised to avoid Mathematica version 7.0.0.

Specifically, the Fourier function and all related functions drop the imaginary parts of complex numbers in the result. For example, Mathematica 7.0.0 gives the wrong result here:

In[1]:= Fourier[{0.007 + 0.01 I, -0.002 - 0.0024 I}]
Out[1]= {0.00353553, 0.00636396}

The correct answer in this case is:

In[1]:= Fourier[{0.007 + 0.01 I, -0.002 - 0.0024 I}]
Out[1]= {0.00353553 + 0.00537401 I, 0.00636396 + 0.00876812 I}

If anyone would be interested in us porting this software to other environments, such as Matlab or F#, please let us know.

Sunday, 26 April 2009

When celebrity programmers attack: Guido on tail calls

Guido van Rossum, the celebrity programmer responsible for the Python programming language, published a horribly misinformed blog post last week that was meant to explain why tail calls are "unpythonic" and do not deserve a place in the Python language.

Given the simplicity of the subject matter and the widespread availability of accurate information from experts like Will Clinger, it seems remarkable that anyone would still be confused about tail calls. Although the presence or absence of tail calls in Python is unimportant, this is still serious because Python is the new BASIC and, consequently, Guido is abusing his celebrity status by misguiding a huge number of young programmers. Indeed, this is very visible on the ensuing Reddit discussions where the majority of comments are factually incorrect.


A tail call is a call that appears as the last thing a function does before it returns. In functional languages, this means the caller is returning the result of a function call (the tail call).

Tail call elimination, aka Tail Call Optimization (TCO), optimizes the stack space consumed by tail calls in order to allow the callee to be thought of by the programmer as a continuation of the caller, as if the body of the callee had been placed at the end of caller.

Dispelling the myths

Many people believe that tail calls exist as a workaround for the absence of loops in functional languages, allowing loops to be rewritten as tail recursive functions. This is a myth. In reality, many functional languages including both OCaml and F# provide loops as core language constructs and both loops and tail calls are widely used in such languages.

Many old programmers believe that tail call elimination is an academic curiosity. This is no longer true. In reality, tail call elimination has been implemented in .NET for eight years and will become a fully supported mainstream feature when Visual Studio 2010 is released next year. LLVM (used by Apple) also has full support for tail call elimination.

Some people misinterpret the "optimization" in TCO to mean that tail calls are supposed to be fast. In fact, TCO is an optimization in space to reduce memory consumption and tail calls can even be slower than non-tail calls, e.g. ILX tail calls on .NET.

Many people assume that the function called by a tail call must be statically known and some people even believe that tail calls may only call the function they are in. These are myths. In reality, all calls in tail position are tail calls and tail call elimination should handle all tail calls whether they are to statically known functions or not, i.e. given a variable containing a function you can tail call that function safe in the knowledge that the space consumption will be eliminated. Indeed, this is the essence of the revelation.

The revelation

Almost all non-functional programmers are unaware that tail calls facilitate a programming paradigm that they have never seen.

The ability to tail call to functions that are not statically known is the foundation that makes many combinators useful. This is a style of programming where functions are composed in order to create a pipeline for values to flow through. Without tail call elimination, every stage in the pipeline would leak stack space and the whole approach becomes unusably unreliable. The only workaround is tantamount to replacing the calling convention and destroys many valuable features including interoperability and the ability to dynamically load libraries.

This is the real reason why tail calls are valuable.

Friday, 17 April 2009

What O'Reilly don't want authors to know

Mike Hendrickson has been busy updating O'Reilly's analysis of the state of the computer book market by programming language. That means it is time for us to reiterate how authors of decent books can earn far more for their work by cutting out the middlemen including trade publishers like O'Reilly.

Traditional book publishers are a dying breed. Aside from e-books, they have been driven out by an increasing number of so-called "self-published" books. In the context of software development, this is particularly common around non-mainstream subjects and includes titles such as OCaml for Scientists and Programming in Scala. O'Reilly's analysis excluded all such books even though they are far more profitable for authors.

In order to make a case for self-publishing it is necessary to present some information about a variety of existing books:

As the OCaml and F# markets are similar sizes and the contents of OCaml for Scientists and F# for Scientists are similar, it is interesting to note that OCaml for Scientists was 22× more profitable for its author in Q4 2008 even though F# for Scientists was a new book published through a major publisher and advertised and sold on Amazon. The reason is that John Wiley & Sons have done virtually nothing to market this F# book besides placing it on Amazon and, in fact, Wiley only managed to sell one copy direct outside the US! This really highlights just how little value a trade publisher and Amazon add.

Haskell is currently receiving far more press than any other functional programming language and, consequently, Haskell books are among the best sellers with 1,491 copies sold in 2008. However, with prices like that of Real World Haskell, they would need to be selling 20,000 copies a year to compete with OCaml for Scientists in terms of the profit made by the author.

Despite being self-published, Programming in Scala is outselling Real World Haskell according to its Amazon sales rank.

Finally, companies like IdTechEx prove that self-publishing can be the foundation of a serious business on the scale of a major trade publisher like O'Reilly. This is a consequence of the characteristic V-curve of profit found in many markets where cheap and expensive luxury products give locally maximal profits. Self-publishing naturally lends itself to higher quality and higher value books.

The moral of this story is that self-publishing is the most rewarding way for good authors to reach their readers. If you are the author of a decent book, please consider publishing it yourself.

Friday, 10 April 2009

Sadek Drobi interviews Don Syme about the history of F#

Sadek Drobi has interviewed many interesting people in the past and his interviews are of particular interest because Sadek is not afraid to ask difficult questions.

Sadek recently published an interview with Don Syme, a lead developer of .NET generics and F#. In particular, Sadek asks why the experiments at Microsoft Research in writing SML and Haskell implementations on top of .NET led Don to build an eager language rather than a non-strict language like Haskell. Don describes the "dissonance" between non-strict evaluation and .NET that culminates in excessive use of monads making it unnecessarily tedious to reuse the framework, as well as the efficiency and debugging problems imposed by non-strict evaluation.

Don then goes on to discuss some of the other features of Haskell that F# does draw upon, such as type classes. In particular, Don explains how he carefully separated the proven foundations of such features, such as the use of type classes for overloading as they were originally intended, from the experimental aspects that bring a "surprising number of technical problems".

Overall this was an extremely interesting interview that touched upon many fascinating aspects of state-of-the-art language design in the context of industrial-strength programming language implementations.

Wednesday, 8 April 2009

Haskell in industry!!!

A dutch startup called Tupil just announced that their first commercial application written in Haskell has gone live.

Tupil describe themselves as:

Tupil is Chris Eidhof and Eelco Lempsink. We code for a living, using functional programming languages like Haskell. Technology is our passion and we just love building great software tools. After working for years as web developers at very nice companies, we decided to go solo, together.

Credible success stories of bleeding-edge functional languages like Haskell being used in industry are always great to see. Keep up the good work!

Tuesday, 7 April 2009

More on Haskell's hash table problems

We recently proved that Haskell's defacto-standard hash table implementation is extremely slow. So slow that optimized native-code compiled Haskell is even slower than interpreted Python and 32× slower than a decent hash table implementation! If that result were not remarkable enough, we have now discovered that the history of this issue is even more surprising...

Haskell programmers have been complaining about the poor performance of its hash table implementation for at least 3 years. The cause of the problem, a bug in the garbage collector, was identified a long time ago but that bug was never fixed. Instead, it appears that the Haskell developers went on a crusade to persuade the world that decent hash table implementations are not advantageous. For example, the book "Real World Haskell" by Don Stewart et al. explicitly states:

"Compared to a hash table, a well-implemented purely functional tree data structure will perform competitively. You should not approach trees with the assumption that your code will pay a performance penalty."

Perhaps more surprising, given Haskell's uniquely bad support for hash tables, is the fact that "Real World Haskell" is one of only three books cited from the Wikipedia article on this subject (!).

Suffice to say, there is overwhelming evidence that trees are not competitively performant when compared to hash tables. Indeed, that is precisely why hash tables are such a ubiquitous data structure. In general, we have found that trees are typically 10× slower than hash tables in practice for construction and lookup with around a million elements. That is supported by our recent benchmark where a hash table in F# took 0.7 and the fastest tree-based implementation written in Haskell (which is extremely heavily optimized for trees) was still 11× slower, taking over 7s to complete the same task.

Also, the Haskell entry for the k-nucleotide benchmark on the shootout contains the comment:

-- Note: Hash tables are not generally used in functional languages, so there
-- are no available library implementations for Haskell. This benchmark
-- requires a hash table. This is why I have implemented the hash table here.

And, again, this is just completely factually incorrect. Hash tables are just as popular in functional languages (like OCaml, F#, Scala, Clojure, Mathematica) as they are in imperative and object oriented languages for the simple reason that they are a great data structure. For example, the hash table implementation bundled with OCaml is so efficient that even the uniprocessor OCaml implementation of knucleotide thrashes the parallel Haskell implementation.

Saturday, 4 April 2009

Another serious bug in GHC

Peter Verswyvelen of Nazooka, one of the only industrial users of Haskell, uncovered another serious bug in GHC, the defacto-standard Haskell compiler. The bug afflicts floating point arithmetic in external libraries such as Qt, GTK, OpenGL, BLAS, LAPACK and FFTW on x86.

Peter said "It's insane this bug did not cause any more problems. Every call into every C function that uses floating point could have been affected". Given that this bug had gone unnoticed for so long, the only logical conclusion is that the GHC FFI is immature because it has very few users. This concurs with our previous findings that no mature software has ever been written in Haskell.

Moreover, this bug joins the list of serious FFI-related bugs in GHC along with the subtle FFI bug in GHC that helped to destroy Darcs and a memory alignment bug in GHC's FFI that caused segmentation faults.

Suffice to say, the moral of this story is that even the most advanced high-level programming language implementations have a real achilles heel when it comes to interoperability with software written in existing unsafe languages like C and C++. This is one of the major issues we are addressing in HLVM through the use of LLVM's mature and well-tested C FFI.

Friday, 3 April 2009

F# vs OCaml vs Haskell: hash table performance

I just stumbled upon an interesting benchmark. Put mappings i -> i for i in 1..10M in a hash table, print the entry for 100. Don Stewart's optimized Haskell:

import Control.Monad
import qualified Data.HashTable as H

main = do
m <- (==) H.hashInt
forM_ [1..10000000] $ \n -> H.insert m n n
v <- H.lookup m 100
print v

My OCaml translation:

let n = 10000000

let () =
let m = Hashtbl.create n in
for n=1 to n do
Hashtbl.add m n n
Printf.printf "%d\n%!" (Hashtbl.find m 100)

My F# translation:

let m = System.Collections.Generic.Dictionary()
for i = 1 to 10000000 do
m.[i] <- i
printf "%d\n" m.[100]


Haskell: 22.8s
OCaml: 2.82s
F#: 0.706s

I found it remarkable that the performance difference is so large: F# is 32× faster than the optimized Haskell here!

Thanks to everyone who has replied. Apparently this is a fundamental design flaw in GHC's garbage collector which is incapable of handling the mutation of arrays of boxed values (e.g. hash table buckets) efficiently. Unfortunately, this means that Haskell's defacto-standard compiler is incapable of handling any data structure that maps values onto values efficiently. The nearest alternative is a purely functional tree-based map but that is not only still 11× slower than F# but it culminates in a data structure that is then asymptotically slower to search as well:

import Data.IntMap
import Data.List

main = print $ m ! 100
where m = fromDistinctAscList [(i,i) | i <- [1..10000000]]

This program still takes 7.519s.

Perhaps the most remarkable find is not that OCaml and F# thrash Haskell on this benchmark (which is to be expected: OCaml and F# are not toys) but that even a Python one-liner thrashes the most heavily optimized Haskell code:

$ time python -c 'import psyco;psyco.full();print dict((i,i) for i in xrange(10000000))[100]'

real 0m5.738s
user 0m5.428s
sys 0m0.308s

If only Haskell programmers could take some time away from writing Fibonacci functions perhaps they could also build some kind of adequate compiler.

Sunday, 29 March 2009

Current shadow stack overheads in HLVM

Many people, including the developers of Haskell (see here) and OCaml (see here) compilers, have been quick to dismiss the shadow stack algorithm for performance reasons.

Despite these concerns, we chose to use a shadow stack in HLVM for several reasons:

  • GC performance is low on the list of priorities for HLVM.
  • Shadow stacks are easy to implement entirely from within HLVM whereas conventional strategies entail injecting stack maps into stack frames on the normal stack and that requires low-level hacking with LLVM's experimental GC API in C++.
  • Shadow stacks are easy to debug because they are just another data structure in the heap.
  • HLVM provides typed nulls and the ability to read array lengths or type constructor tags without an indirection. This requires the use of an unconventional struct-based reference type that is incompatible with LLVM's current GC API that can only handle individual pointers.

Consequently, it is interesting to use HLVM to measure just how expensive its GC and shadow stack implementations are. This cannot be done directly because the shadow stack is essential for garbage collections to work so removing the shadow stack must also remove the whole GC. However, we can benchmark with full GC enabled, with shadow stack updates but no collections and with GC completely disabled:

The difference between the last two measures gives an estimate of the time spent updating the shadow stack and we can present that as a ratio of the original running time:

This is only an estimate because disabling the GC affects the performance of the rest of the system, most notably leaving more allocated blocks for malloc to handle and degrading cache coherence by cold starting all allocated values.

These results show that the time spent manipulating the shadow stack is:

  • Insignificant for the fib, ffib, mandel and mandel2 benchmarks, which is expected because their inner loops act on value types that the GC is oblivious to.
  • Between 10 and 25% for the sieve, Array.fold, List.init and gc benchmarks, which is expected because they use reference types in their inner loops.
  • Around 70% in the case of the 10-queens benchmark, which was entirely unexpected!

We can also see that the List.init benchmark is spending most of its time performing collections.

These results are very encouraging for two reasons:

  • HLVM's comparatively poor performance on the List.init benchmark is not due to the shadow stack but, rather, is due to inefficiencies in our collection algorithm. Specifically, the mark phase uses a hash table with a fixed number of buckets that scales poorly when the heap contains many allocated blocks. Increasing the size of the hash table provides a 36% performance improvement.
  • HLVM's comparatively poor performance on the list-based 10-queens benchmark is due to the presence of reference types in the inner loop: the List.filter function. Specifically, the environment of the predicate closure and the list itself. Inlining and/or CSE will go a long way to eliminating this overhead.

Suffice to say, we are very happy with these results and intend to continue using the shadow stack algorithm.

Saturday, 28 March 2009

Hackage download stats vs OCaml

Donald Bruce Stewart recently published an in-depth analysis of Hackage's download statistics that shows many interesting characteristics:

  • All of the 1,200 Haskell softwares on Hackage combined have been downloaded as many times in total as a single OCaml program, MLDonkey, from Sourceforge (let alone OCaml's other popular software like FFTW and Unison).
  • Over half of the packages on Hackage have been downloaded under 300 times.
  • Dependencies are multiply counted so these figures are a substantial over estimate.

This confirms our suspicions that the few lines of real-world Haskell software ever written have been tested by only a handful of people. Perhaps Haskell will become more mainstream in the future but it seems to be quite a long way off today.

HLVM's first front-end

Alp Mestan has begun work on the first front-end for HLVM. This project has many important goals:

  • Provide a high-level language that HLVM developers can write test code, benchmarks and garbage collectors in.
  • Serve as a tutorial for the people porting existing compilers such as Moscow ML, NekoML and PolyML to HLVM.
  • Track new features as they are added to HLVM, such as closures, parametric polymorphism, parallelism and so on.

Alp intends to apply to Jane St. Capital to fund a summer project that will port the entire OCaml compiler to HLVM.

Performance: OCaml vs HLVM beta 0.4

A quick update due to a surprise performance result for HLVM!

We knew that manipulating the shadow stack was a major slowdown in HLVM from the change in our benchmark results when the GC was first introduced a few weeks ago but we did not expect that a simple local optimization, unrolling, could go so far towards recovering performance. Moreover, this optimization was implemented in only a few lines of OCaml code.

The following results prove that HLVM now produces substantially faster programs than ocamlopt for numerical tasks on x86:

Interestingly, now that HLVM supports standalone compilation using LLVM's IR optimization passes as well as unoptimized JIT compilation, we see that LLVM's optimization passes only give small performance improvements in many cases and even substantially degrade performance on our 10-queens benchmark. This is a direct result of HLVM producing near optimal code directly (which greatly reduces compile times) and has eliminated our desire to add support for LLVM's optimization passes. We shall focus on inlining and then high-level optimization passes instead, most notably the elimination of closures.

Friday, 27 March 2009

Memory footprints: OCaml vs HLVM

Someone recently published a blog article about the sizes of values in the Ruby and OCaml programming languages.

This is also interesting in the context of HLVM because our high performance goals also require an efficient memory representation but our solution is quite different to OCaml's because HLVM is designed for numerical performance whereas OCaml was designed for symbolic performance.

The following table lists the sizes of values of several different types in OCaml and HLVM on 32-bit architectures:

unit32 bits32 bits
bool32 bits8 bits
int3132 bitsNone
int3296 bits32 bits
int64128 bits64 bits
float32None32 bits
float64128 bits64 bits
float * float320 bits128 bits
Enumeration32 bits96 bits
float64 array64+64n bits96+64n bits

Note how HLVM uses the most memory efficient representations possible for ints, floating point numbers and tuples but uses less efficient representations for reference types such as arrays and variant types. A side-effect is that heaps in HLVM contain far fewer pointers than heaps in OCaml and this greatly reduces the stress on the garbage collector.

Tuesday, 17 March 2009

Performance: OCaml vs HLVM beta 0.3

Our HLVM project is evolving rapidly and recently saw its first release with a working garbage collector. This has allowed us to make some performance measurements and, as we had hoped, the results are extremely compelling.

The following graph illustrates the time taken to perform each of the benchmarks in the HLVM test suite on one of the 2.1GHz Opteron 2352 cores in an 8-core Dell PowerEdge:

The individual benchmarks are:

  • fib: The usual Fibonacci function.
  • ffib: A floating-point version of the Fibonacci function.
  • sieve: Sieve of Eratosthenes to find all of the primes under 10^8 and print the last one using a byte array.
  • mandel: Mandelbrot rendering with float arithmetic.
  • mandel2: Mandelbrot rendering with abstracted complex arithmetic where complex numbers are pairs of floating point numbers.
  • Array.fold: Initialize a 10^8-element float array and fold of a float * float pair over it.
  • List.init/fold: Initialize a 10^6-element int list and fold over it (allocation intensive).
  • 10-queens: Solve the 10-queens problem by filtering lists (allocation and GC intensive).

These results show that the latest HLVM is already faster than OCaml on three tests (ffib, mandel and mandel2), gives comparable performance on three more tests (fib, sieve and Array.fold) and is substantially slower on the allocation and GC-intensive list and 10-queens benchmarks.

The third set of results labeled "HLVM unsafe" refer to results obtained by manually avoiding manipulation of the shadow stack when it is not necessary (note that the GC still runs exactly as before). This is the next major optimization that HLVM is likely to automate. With this optimization, HLVM also outperforms OCaml on the sieve and Array.fold benchmarks.

Suffice to say, we are very happy with these results and are now encouraging others to invest in building upon HLVM now that the basic infrastructure is in place. We have already had interest from the developers of HaXe and PolyML in using HLVM as a new VM for their compilers and another group are applying to the Jane St. Summer Project 2009 for funding to port INRIA's OCaml compiler to HLVM.

The next major step for HLVM is support for parallel threads. This will be implemented by cooperatively suspending all threads, performing a GC using the current implementation and then resuming the threads. This alone will make HLVM significantly more capable than OCaml when it comes to parallel computing and our next set of benchmark results will include parallel programs!

Saturday, 7 March 2009

HLVM has been released

We have made the first public release of our much anticipated High-Level Virtual Machine (HLVM) project. The complete source code is now available from the HLVM project page on the OCaml Forge via the subversion repository.

HLVM makes extensive use of some features only recently added to LLVM and, consequently, requires a patched version of LLVM 2.4 or later.

Even at this early stage, our performance results are extremely compelling. HLVM is already several times faster on a variety of benchmarks than some notably high-performance functional programming languages such as OCaml. HLVM is specifically designed for technical computing and we believe it will offer the best performance for numerical computing of any high-level language including easy-to-use support for parallelism. We also intend to provide extensive support for visualization using OpenGL and make HLVM a viable platform for both free and commercial software in the future.

This open source project has been solely funded by Flying Frog Consultancy. If you would like to contribute funding, resources or help to develop HLVM yourself please get in contact with us.

Saturday, 31 January 2009

Mono 2.2 still leaks memory

We have previously discussed the fact that Mono is still built upon a conservative garbage collector (Boehm's GC). This means that Mono is not capable of identifying exactly what data is reachable and, consequently, has to resort to conservative guesses that can fail to deallocate garbage, i.e. leaking memory.

Boehm's own literature describes situations where the GC might be expected to leak (lazy lists and queues) but claims that no case has even been found in practice and they could not even construct a contrived example where memory was actually leaked. Readers of our previous posts have stated that our claims of memory leaks are "bogus". So we decided to put this issue to rest.

The following trivial F# program creates a cyclic list representing a queue, adds one element and then repeatedly adds one element and removes it again:

type 'a cell = { content: 'a; mutable next: 'a cell option }

let mutable tail = None
if tail = None then
let cell = { content = [||]; next = None } <- Some cell
tail <- Some cell
while true do
let tail' = Option.get tail
let cell = Some { content = [|1.;2.;3.;4.|]; next = tail'.next }
tail'.next <- cell
tail <- cell
let tail' = Option.get tail
tail'.next <- (Option.get tail'.next).next

This obviously requires only enough memory for at most two queue items, so any memory leaks will be obvious. Running this program on .NET, its memory consumption is steady at 11Mb. Running this program on Mono 2.2, the entire memory of the computer is leaked away in 60 seconds, the OS goes to swap and everything grinds to a halt.

We have also described situations where Mono 2.2 leaks stack space until the stack overflows. These results may be of interest to anyone else trying to find a usable VM to build upon.

Friday, 30 January 2009

Mono does not support tail call elimination

Tail call elimination is an essential feature for functional programming languages. This feature simply refers to the ability for one function to end with a function call without leaking stack space. Tail call elimination can be essential in heavily functional code such as parser combinators and libraries written in continuation passing style.

Lack of tail calls on the defacto-standard JVM has raised concerns around languages like Scala and Clojure. Fortunately, work is underway to address this issue by implementing tail call elimination on MLVM/OpenJDK.

Microsoft's .NET has had tail call elimination for eight years and the Mono project is designed to replicate the Common Language Infrastructure (CLI) that is responsible for this feature. The Mono team have been advertising tail call elimination on their VM for many years.

However, when we tested the elimination of tail calls on several VMs we discovered that tail calls are seriously broken on Mono but work on other VMs including .NET, LLVM and, of course, OCaml. Specifically, we ran the following trivial F# program that relies upon a tail call to a dynamic location, a function odd that is passed as an argument to the higher-order even function:

let even odd n = odd(n+1)

let odd n =
printf "%d\n" n
even odd (n+1)

let (_: int) =
even odd 0

This program prints odd integers indefinitely in OCaml or F# on .NET but attempting to run the executable generated by F# on Mono 2.2 results in a stack overflow almost immediately, as Mono leaks stack space until the stack is exhausted. So Mono cannot be relied upon to execute any functional programs correctly including F# programs.

This is of particular interest to us because several people have asked us for customer support when trying to run our F# code, such as the code from our book F# for Scientists, under Mono. Our advice is to steer well clear of Mono until it becomes stable and feature complete.

Interestingly, we also tried the same test on LLVM 2.2, writing the program in LLVM IR, and the tail calls are handled flawlessly provided the fast calling convention is used. As we have shown before, LLVM also generates vastly faster code than Mono 2.2.

Mono 2.2 vs OCaml vs .NET vs LLVM vs JDK

Mono 2.2 was recently shipped and a new JIT code generator was the main selling point promoted by the Mono project because it provides the first significant performance improvement Mono has seen for many years. We used the F# version of the SciMark2 benchmark to benchmark this new version of Mono against Mono 2.0 and some competing VMs using their native SciMark2 implementations:

The composite results show that the new code generator in Mono is indeed substantially better at producing code that is 40% faster on average than the previous code generator from Mono 2.0 according to these results. However, Mono 2.2 is still a long way behind its competitors. Specifically, both LLVM and Sun's JVM are over 2.2× faster than the latest version of Mono.

SciMark2 is composed of five different tests: Fast Fourier transform, successive over-relaxation, Monte Carlo, sparse matrix-vector multiply and LU matrix decomposition. We can get more detailed comparison of the VMs by looking at the individual scores on each of these five tests:

We can see immediately that the Monte Carlo test from the SciMark2 benchmark had anomalous results. Specifically, Mono 2.0 was 12× slower than LLVM and Mono 2.2 is now only 3× slower than LLVM. The Monte Carlo test is an int intensive random number generator.

These performance improvements in Mono are compelling but we are still deterred by some fundamental design flaws that undermine Mono's reliability.