Friday, 17 July 2009

OCaml vs F#: Burrows Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a block-sorting data compression algorithm that acts as a preconditioner to dramatically improve the compression ratios of subsequent compression phases in many cases. This algorithm is the core of the bzip2 data compression utility that is ubiquitous on Unix and, in particular, is often much more effective than zip and gzip.

The core of the BWT algorithm is easily written in F#:

let cmp (str: _ array) i j =
let rec cmp i j =
if i=str.Length then 1 else
if j=str.Length then -1 else
let c = compare str.[i] str.[j] in
if c<>0 then c else
cmp (i+1) (j+1)
cmp i j

let bwt (str: byte array) =
let n = str.Length
let a = Array.init n (fun i -> i)
Array.sortInPlaceWith (cmp str) a
Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

However, this implementation is very slow and, in particular, is many times slower than the extremely heavily optimized C implementation found in the bzip2 program. The main cause of this performance degradation is the use of a higher-order sortInPlaceWith function that compares array elements using a closure derived from the cmp function. Closure invocation is substantially more expensive than ordinary function invocation.

Fortunately, this is easily solved by writing a generic higher-order sort function in F# and marking it inline to ensure that it will be inlined, resulting in specialization over both the array element type and the comparison function. Incredibly, this simple change creates an F# program that runs at a similar speed to the industrial-strength bzip2 program:

open System.Threading

let inline sort cmp (a: _ array) =
  let inline swap i j =
    let t = a.[i]
    a.[i] <- a.[j]
    a.[j] <- t
  let rec qsort l u =
    if l < u then
      swap l ((l + u) / 2)
      let mutable m = l
      for i=l+1 to u do
        if cmp a.[i] a.[l] < 0 then
          m <- m + 1
          swap m i
      swap l m
      if u-l > 1000 then
        let m = m
        let f = Tasks.Future.Create(fun () -> qsort l (m-1))
        qsort (m+1) u
        qsort l (m-1)
        qsort (m+1) u
  qsort 0 (a.Length-1)

let inline cmp (str: _ array) i j =
  let rec cmp i j =
    if i=str.Length then 1 else
      if j=str.Length then -1 else
        let c = compare str.[i] str.[j] in
        if c<>0 then c else
          cmp (i+1) (j+1)
  cmp i j

let bwt (str: byte array) =
  let n = str.Length
  let a = Array.init n (fun i -> i)
  sort (fun i j -> cmp str i j) a
  Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

The ability to inline higher-order functions in this way, to perform partial specialization, is one of the main advantages of the F# programming language over the previous generations of languages including OCaml. In fact, a direct port of this program to OCaml is 150× slower and, even after considerable effort, we were only able to make the OCaml 2.8× slower than F# and C. This is our best attempt so far:

open Bigarray

type int_array = (int, int_elt, c_layout) Array1.t

type byte_array = (int, int8_unsigned_elt, c_layout) Array1.t

exception Cmp of int

let cmp (str: byte_array) i j =
  let n = Array1.dim str in
  let i = ref i and j = ref j in
    while true do
      if !i = n then raise(Cmp 1) else
        if !j = n then raise(Cmp(-1)) else
          let si = str.{!i} and sj = str.{!j} in
          if si < sj then raise(Cmp(-1)) else
            if si > sj then raise(Cmp 1) else
                incr i;
                incr j
  with Cmp c -> c

let swap (a: int_array) i j =
  let t = a.{i} in
  a.{i} <- a.{j};
  a.{j} <- t

let invoke (f : 'a -> 'b) x : unit -> 'b =
  let input, output = Unix.pipe() in
  match Unix.fork() with
  | -1 -> (let v = f x in fun () -> v)
  | 0 ->
      Unix.close input;
      let output = Unix.out_channel_of_descr output in
      Marshal.to_channel output (try `Res(f x) with e -> `Exn e) [];
      close_out output;
      exit 0
  | pid ->
      Unix.close output;
      let input = Unix.in_channel_of_descr input in
      fun () ->
        let v = Marshal.from_channel input in
        ignore (Unix.waitpid [] pid);
        close_in input;
        match v with
        | `Res x -> x
        | `Exn e -> raise e;;

let sort str (a: int_array) =
  let rec qsort l u =
    if l < u then
        swap a l ((l + u) / 2);
        let m = ref l in
        for i=l+1 to u do
          if cmp str a.{i} a.{l} < 0 then
              incr m;
              swap a !m i
        swap a l !m;
        if u - l > 30000 then
            let f () =
              qsort l (!m - 1);
              Array1.sub a l (!m-l-1) in
            let a' = invoke f () in
            qsort (!m + 1) u;
            let a' = a'() in
            for i=l to !m-2 do
              a.{i} <- a'.{i - l}
            qsort l (!m - 1);
            qsort (!m + 1) u
      end in
  ignore(qsort 0 (Array1.dim a - 1))

let () =
  let file = try Sys.argv.(1) with _ -> "input.txt" in
  let desc = Unix.openfile file [] 777 in
  let str = Array1.map_file desc int8_unsigned c_layout false (-1) in
  let n = Array1.dim str in

  let a = Array1.create int c_layout n in
  for i=0 to n-1 do
    a.{i} <- i
  sort str a;
  for i=0 to n-1 do
    print_char(Char.chr str.{(a.{i} + n - 1) mod n})

  Unix.close desc

Sunday, 12 July 2009

OCaml vs F#: QR decomposition

Recent articles in the OCaml and F#.NET Journals derived high-level implementations of QR decomposition via Householder reductions. This numerical method has many applications, most notably in the computation of best fit parameters of linear sums.

Imperfections in OCaml

The OCaml programming language allows this algorithm to be expressed very elegantly in only 15 lines of code:

# let qr a =
    let m, n = Matrix.dim a in
    let rec qr_aux k q r qa =
      if k = n then q, Matrix.mul r a else
        let v =
          let u = Array.init m (fun i -> if i<k then 0.0 else Matrix.get qa i k) in
          u.(k) <- u.(k) -. norm u;
 (( *. ) (1.0 /. norm u)) u in
        let qr_aux2 m u v = Matrix.sub m (Matrix.outer u v) in
        qr_aux (k+1)
          (qr_aux2 q (Matrix.transform q (2. *| v)) v)
          (qr_aux2 r (2. *| v) (Matrix.transform (Matrix.transpose r) v))
          (qr_aux2 qa v (Matrix.transform (Matrix.transpose qa) (2. *| v))) in
    let i = Matrix.init m m (fun i j -> if i=j then 1.0 else 0.0) in
    qr_aux 0 i i a;;
val qr : Matrix.t -> Matrix.t * Matrix.t = <fun>

This is remarkably concise compared to the equivalent 2,077 lines of Fortran 77 code in LAPACK. However, the OCaml implementation suffers from two main drawbacks:

  1. Abstractions such as higher-order functions degrade performance. If core routines such as arithmetic operations are abstracted (which is necessary to generalize the algorithm over different numeric types such as reals, complexes and symbolic expressions) then the whole program can become over 10× slower.
  2. OCaml's serial run-time prohibits threads from running in parallel, leaving us with few options for parallel programming and none that are competitively performant. However, Unix fork can be used to coarsely parallelize the extraction of the Q and R matrices which can almost double performance for m×n input matrices when m≫n.

Consequently, the OCaml implementation is consistently slower than the specialized routines found in MATLAB.

Perfection in F#

The F# programming language includes a seemingly-innocuous feature called inline that allows functions marked inline to be inlined at compile time. However, inline in F# has some remarkable properties:

  • Functions marked inline are structurally typed so this feature can be used to provide abstraction over different numeric types without any performance degradation at all.
  • This feature even allows functional abstraction with optimal performance as well: by inlining higher-order functions and their function arguments in order to statically-expand completely specialized implementations.

Furthermore, F# inherits a concurrent garbage collector from .NET and Microsoft's Task Parallel Library provides easy-to-use parallel tasks based upon state-of-the-art wait-free work-stealing deques as seen in Cilk. This addresses the remaining imperfection in OCaml.

These features, that are unique to the F# programming language, allow the construction of parallel numerical code that combines the brevity of OCaml with the performance of Fortran. In fact, our generic implementation of QR decomposition in F# is up to 3× faster than the equivalent specialized function in MATLAB.

Tuesday, 7 July 2009

Problems with fslex and fsyacc

Back in 2004, we wrote a mini Mathematica implementation in OCaml in only four days. For fun, we decided to port this OCaml program to F#. Incredibly, porting just the lexer and parser from OCaml to F# has taken longer than it took us to write the entire original OCaml code! The reason is simply that OCaml has an incredibly powerful and mature suite of compiler tools for generating lexers and parsers whereas F# does not.
Given that our original OCaml lexer and parser for Mathematica source code were written using ocamllex and ocamlyacc, it seemed obvious to translate the code into fslex and fsyacc. However, there are some serious problems with the F# versions of these tools:
  1. Incomplete: the F# tools were only ever intended for the F# compiler itself and not for wider use. Consequently, features are missing, e.g. in the regular expression implementation in fslex.
  2. Unintegrated: there are Visual Studio support for these tools and, consequently, development is cumbersome compared to OCaml.
  3. Unreliable: the fslex and fsyacc tools are not officially part of the F# distribution and, although they are used in the F# compiler itself, they are buggy: reporting bogus errors where there are none.
  4. Slow: where OCaml takes 0.05s to generate and compile the lexer and parser, F# takes over 10s. That is 200× slower! Suffice to say, that seriously bogs down development.
One potential advantage of fslex is that simply adding --unicode on the command line generates a lexer for unicode rather than ASCII text. However, we have no use for unicode and, consequently, have not tested this. In contrast, the conventional OCaml solution is to drop the byte-only ocamllex tool in favour of a different lexer generator such as the ulex unicode lexer-generating syntax extension.
There is another choice for lexing and parsing in F#: the FParsec monadic parser combinator library by Stephan Tolksdorf. The current release of this library does not compile with the F# May 2009 CTP but the latest version in the development repository does. The project is split into C# and F# code in order to write some performance-critical sections in C# but this means that it is distributed as two separate DLLs. We shall investigate this option in the future...