Monday, 28 September 2009

F# vs OCaml: Image Processing

While asking on the OCaml mailing list about the relative merits of our HLVM project, David McClain described a computation that he had been required to perform in the context of scientific computing in the past:

"I remember dealing with stacks of images taken from an infrared imaging sensor -- perhaps 256 images, each of which is, say 512 x 512 pixels. I needed to obtain median pixel values from the stack and produce one median image..."

This problem is easily solved in both OCaml and F# using idiomatic functional programming. For example, the following is a one-line solution written in OCaml:

Array.map (Array.map (fun gs -> Array.sort compare gs; gs.(m/2))) images

This simply sorts each pixel's sequence into non-descending order and extracts the middle (median) pixel, mapped across the columns and then the rows of the image.

This tiny implementation solves David's example with 226 pixels in 32 seconds, which is fast enough for many applications. However, the performance of this OCaml implementation is heavily degraded by the extensive use of polymorphism at every stage because the current OCaml implementation uses static compilation that can handle polymorphism at run-time and, consequently, requires a uniform run-time representation of all types. This leads to the tagging of integers (which is why OCaml has a 31-bit int type) and the boxing of other types such as float into separate heap-allocated values, both of which are extremely inefficient. This general execution model is always used even if type information is statically available. The only workaround is to inline the Array.sort and Array.map functions by hand.

Fortunately, this inefficiency can be overcome by using Just-In-Time (JIT) compilation instead of static compilation and partially specializing polymorphism away before JIT compilation. This is the intended design for polymorphism in HLVM and the inspiration was drawn from Microsoft's excellent implementation of the CLR.

Consequently, the equivalent F# program is 100× faster than the OCaml:

images |> Array2D.map (fun xs -> Array.sortInPlaceWith compare xs; xs.[m/2])

Moreover, the Task Parallel Library makes it easy to parallelize this implemention to improve the performance even further:

Parallel.For(0, n, fun y ->
  for x=0 to n-1 do
    Array.sortInPlaceWith compare images.[y, x])
images |> Array2D.map (fun xs -> xs.[m/2])

This parallel implementation is a whopping 821× faster than the OCaml and, in fact, is a superlinear speedup with respect to the 8 cores because it is able to split the data across more caches.

6 comments:

Richard Minerich said...

Very cool. I would love to see more detailed information on how this JIT optimization happens.

Art said...

... and beyound!!

Jules said...

Interesting speedup!

You store images indexed first by pixel coordinate then by image?

Jules said...

Is the OCaml benchmark running on HLVM? In the beginning of the post you make it sound like you're benchmarking HLVM, but then you say "the current OCaml implementation".

Flying Frog Consultancy Ltd. said...

@Richard

The JIT optimization is actually quite straightforward: whenever the JIT compiler sees a polymorphic function it determines the type variables of callees from those of the caller and calls into a version that was compiled specifically for those types. Reference types (classes) are treated equivalently in this context because they always have the same uniform representation: a pointer.


@Jules

The images are indexed first by pixel coordinate then by image, yes.

The OCaml benchmark was compiled with ocamlopt. Polymorphism and this optimization are not yet implemented in HLVM.

Laurent Le Brun said...

Hi,

Interesting results and blog post.

I tried your benchmark on Linux. For information, here are the results:
- F# 1.9.7.8 and Mono 2.4.2.3: 6,682 sec.
- Ocamlopt 3.11.0: 13,679 sec.

A good speedup, but not the factor 100 (mono is not as optimized as MS .Net).