Friday, 17 July 2009

OCaml vs F#: Burrows Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a block-sorting data compression algorithm that acts as a preconditioner to dramatically improve the compression ratios of subsequent compression phases in many cases. This algorithm is the core of the bzip2 data compression utility that is ubiquitous on Unix and, in particular, is often much more effective than zip and gzip.

The core of the BWT algorithm is easily written in F#:

let cmp (str: _ array) i j =
let rec cmp i j =
if i=str.Length then 1 else
if j=str.Length then -1 else
let c = compare str.[i] str.[j] in
if c<>0 then c else
cmp (i+1) (j+1)
cmp i j

let bwt (str: byte array) =
let n = str.Length
let a = Array.init n (fun i -> i)
Array.sortInPlaceWith (cmp str) a
Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

However, this implementation is very slow and, in particular, is many times slower than the extremely heavily optimized C implementation found in the bzip2 program. The main cause of this performance degradation is the use of a higher-order sortInPlaceWith function that compares array elements using a closure derived from the cmp function. Closure invocation is substantially more expensive than ordinary function invocation.

Fortunately, this is easily solved by writing a generic higher-order sort function in F# and marking it inline to ensure that it will be inlined, resulting in specialization over both the array element type and the comparison function. Incredibly, this simple change creates an F# program that runs at a similar speed to the industrial-strength bzip2 program:

open System.Threading

let inline sort cmp (a: _ array) =
  let inline swap i j =
    let t = a.[i]
    a.[i] <- a.[j]
    a.[j] <- t
  let rec qsort l u =
    if l < u then
      swap l ((l + u) / 2)
      let mutable m = l
      for i=l+1 to u do
        if cmp a.[i] a.[l] < 0 then
          m <- m + 1
          swap m i
      swap l m
      if u-l > 1000 then
        let m = m
        let f = Tasks.Future.Create(fun () -> qsort l (m-1))
        qsort (m+1) u
        f.Value
      else
        qsort l (m-1)
        qsort (m+1) u
  qsort 0 (a.Length-1)

let inline cmp (str: _ array) i j =
  let rec cmp i j =
    if i=str.Length then 1 else
      if j=str.Length then -1 else
        let c = compare str.[i] str.[j] in
        if c<>0 then c else
          cmp (i+1) (j+1)
  cmp i j

let bwt (str: byte array) =
  let n = str.Length
  let a = Array.init n (fun i -> i)
  sort (fun i j -> cmp str i j) a
  Array.init n (fun i -> str.[(a.[i] + n - 1) % n])

The ability to inline higher-order functions in this way, to perform partial specialization, is one of the main advantages of the F# programming language over the previous generations of languages including OCaml. In fact, a direct port of this program to OCaml is 150× slower and, even after considerable effort, we were only able to make the OCaml 2.8× slower than F# and C. This is our best attempt so far:

open Bigarray

type int_array = (int, int_elt, c_layout) Array1.t

type byte_array = (int, int8_unsigned_elt, c_layout) Array1.t

exception Cmp of int

let cmp (str: byte_array) i j =
  let n = Array1.dim str in
  let i = ref i and j = ref j in
  try
    while true do
      if !i = n then raise(Cmp 1) else
        if !j = n then raise(Cmp(-1)) else
          let si = str.{!i} and sj = str.{!j} in
          if si < sj then raise(Cmp(-1)) else
            if si > sj then raise(Cmp 1) else
              begin
                incr i;
                incr j
              end
    done;
    0
  with Cmp c -> c

let swap (a: int_array) i j =
  let t = a.{i} in
  a.{i} <- a.{j};
  a.{j} <- t

let invoke (f : 'a -> 'b) x : unit -> 'b =
  let input, output = Unix.pipe() in
  match Unix.fork() with
  | -1 -> (let v = f x in fun () -> v)
  | 0 ->
      Unix.close input;
      let output = Unix.out_channel_of_descr output in
      Marshal.to_channel output (try `Res(f x) with e -> `Exn e) [];
      close_out output;
      exit 0
  | pid ->
      Unix.close output;
      let input = Unix.in_channel_of_descr input in
      fun () ->
        let v = Marshal.from_channel input in
        ignore (Unix.waitpid [] pid);
        close_in input;
        match v with
        | `Res x -> x
        | `Exn e -> raise e;;

let sort str (a: int_array) =
  let rec qsort l u =
    if l < u then
      begin
        swap a l ((l + u) / 2);
        let m = ref l in
        for i=l+1 to u do
          if cmp str a.{i} a.{l} < 0 then
            begin
              incr m;
              swap a !m i
            end
        done;
        swap a l !m;
        if u - l > 30000 then
          begin
            let f () =
              qsort l (!m - 1);
              Array1.sub a l (!m-l-1) in
            let a' = invoke f () in
            qsort (!m + 1) u;
            let a' = a'() in
            for i=l to !m-2 do
              a.{i} <- a'.{i - l}
            done
          end
        else
          begin
            qsort l (!m - 1);
            qsort (!m + 1) u
          end
      end in
  ignore(qsort 0 (Array1.dim a - 1))

let () =
  let file = try Sys.argv.(1) with _ -> "input.txt" in
  let desc = Unix.openfile file [] 777 in
  let str = Array1.map_file desc int8_unsigned c_layout false (-1) in
  let n = Array1.dim str in

  let a = Array1.create int c_layout n in
  for i=0 to n-1 do
    a.{i} <- i
  done;
  sort str a;
  for i=0 to n-1 do
    print_char(Char.chr str.{(a.{i} + n - 1) mod n})
  done;

  Unix.close desc

12 comments:

Jules said...

Are you going to implement inline in HLVM, or is the compiler smart enough to inline automatically?

Flying Frog Consultancy Ltd. said...

@Jules

Good question. HLVM and LLVM already do ordinary inlining but they are not capable of inlining a function passed by pointer as an argument to another function, as F# does.

I'm not sure this could be automated but it is certainly a feature I'd love to see in HLVM.

Even if this could be automated I think it would make performance unpredictable. So it would probably be wise to stick with the approach F# uses.

combinator said...

Would not it be nice if you could use inlining in the first version, reusing the library source instead of rewriting quicksort?

Flying Frog Consultancy Ltd. said...

@combinator

Yes, that is exactly what we would like to do but it requires both the comparison and the sort to be marked inline. The built-in sort is not inline, so our only option is to implement our own sort and mark that "inline".

Pål-Kristian said...

I've always thought that placing the 'inline' suggestion at the declaration is wrong. It should be at the use-site! Given that compilers can't auto-inline for you, the decision should be for the library user, not the library designer.

(inline qsort) (inline compare) ...

Flying Frog Consultancy Ltd. said...

@Pål-Kristian:

You'd have to export all function bodies in case any user wanted anything inlined. Or you could require that both library and user mark a definition "inline" for it to actually be inlined.

I like the F# approach because it gives the user reliable and significant speedups without them having to do anything at all.

Pål-Kristian said...

Not at all. The compiler knows the 'code' for any given function, and it can easily inline anything for you - though perhaps it is as a link-time inlining step.

Think about it - sometimes it is faster to leave a piece of code in a function, sometimes it is faster to inline it. It depends on the use-case. Why duplicate code?

Flying Frog Consultancy Ltd. said...

@Pål-Kristian:

The compiler can reverse engineer the CIL for a function but that is not good enough. For example, the benefit here is in inlining a function argument applied to a higher-order function. The compiler needs the F# source code to do that and it cannot it for any function unless the source code is bundled into all DLLs for all functions. Suffice to say, many commercial users don't want that!

Pål-Kristian said...

There's nothing magical about higher-ordered functions. After all, jitters perform this optimization at run-time.

Flying Frog Consultancy Ltd. said...

@Pål-Kristian: If you were reinventing a VM (as I am with HLVM) then that would certainly be a good idea. However, quite a bit of magic does go into higher-order functions in F# and it is not at all easy to implement this on top of their existing infrastructure.

Abel Braaksma said...

The static function Tasks.Future.Create does not exist in System.Threading (anymore?). What is its equivalent?

jay paul said...

Really nice post, you got great blog and Thank you for sharing This excellently written content. Waiting for next one.

HP - PRO 15.6" Laptop - 4GB Memory - 500GB Hard Drive

HP - PRO 15.6" Laptop - 4GB Memory - 500GB Hard Drive - Tungsten