Sunday, 12 July 2009

OCaml vs F#: QR decomposition

Recent articles in the OCaml and F#.NET Journals derived high-level implementations of QR decomposition via Householder reductions. This numerical method has many applications, most notably in the computation of best fit parameters of linear sums.

Imperfections in OCaml

The OCaml programming language allows this algorithm to be expressed very elegantly in only 15 lines of code:

# let qr a =
    let m, n = Matrix.dim a in
    let rec qr_aux k q r qa =
      if k = n then q, Matrix.mul r a else
        let v =
          let u = Array.init m (fun i -> if i<k then 0.0 else Matrix.get qa i k) in
          u.(k) <- u.(k) -. norm u;
 (( *. ) (1.0 /. norm u)) u in
        let qr_aux2 m u v = Matrix.sub m (Matrix.outer u v) in
        qr_aux (k+1)
          (qr_aux2 q (Matrix.transform q (2. *| v)) v)
          (qr_aux2 r (2. *| v) (Matrix.transform (Matrix.transpose r) v))
          (qr_aux2 qa v (Matrix.transform (Matrix.transpose qa) (2. *| v))) in
    let i = Matrix.init m m (fun i j -> if i=j then 1.0 else 0.0) in
    qr_aux 0 i i a;;
val qr : Matrix.t -> Matrix.t * Matrix.t = <fun>

This is remarkably concise compared to the equivalent 2,077 lines of Fortran 77 code in LAPACK. However, the OCaml implementation suffers from two main drawbacks:

  1. Abstractions such as higher-order functions degrade performance. If core routines such as arithmetic operations are abstracted (which is necessary to generalize the algorithm over different numeric types such as reals, complexes and symbolic expressions) then the whole program can become over 10× slower.
  2. OCaml's serial run-time prohibits threads from running in parallel, leaving us with few options for parallel programming and none that are competitively performant. However, Unix fork can be used to coarsely parallelize the extraction of the Q and R matrices which can almost double performance for m×n input matrices when m≫n.

Consequently, the OCaml implementation is consistently slower than the specialized routines found in MATLAB.

Perfection in F#

The F# programming language includes a seemingly-innocuous feature called inline that allows functions marked inline to be inlined at compile time. However, inline in F# has some remarkable properties:

  • Functions marked inline are structurally typed so this feature can be used to provide abstraction over different numeric types without any performance degradation at all.
  • This feature even allows functional abstraction with optimal performance as well: by inlining higher-order functions and their function arguments in order to statically-expand completely specialized implementations.

Furthermore, F# inherits a concurrent garbage collector from .NET and Microsoft's Task Parallel Library provides easy-to-use parallel tasks based upon state-of-the-art wait-free work-stealing deques as seen in Cilk. This addresses the remaining imperfection in OCaml.

These features, that are unique to the F# programming language, allow the construction of parallel numerical code that combines the brevity of OCaml with the performance of Fortran. In fact, our generic implementation of QR decomposition in F# is up to 3× faster than the equivalent specialized function in MATLAB.


Art said...


LA said...

For numerical work, the real benchmark would be LAPACK (which for all I know is what Matlab is calling internally). Any comparison of the speed of the generic F# version vs an equivalent call to one of the high performance BLAS (eg MKL) libraries?

Flying Frog Consultancy Ltd. said...


On small matrices (e.g. 100x100), our most heavily optimized F# code is up to 3x faster than MATLAB R2008b (which is using Intel MKL) on a twin quadcore Xeon workstation.

However, our F# code is not (yet) cache friendly so it is much slower on larger matrices. Specifically, 10x slower than MATLAB on 1,000x1,000. We expect to get the same memory-bound performance from MATLAB by using cache oblivious algorithms in the future.

Suffice to say, these performance results from a really high-level language are extremely encouraging!

LA said...

That all makes sense and I agree that the performance numbers are encouraging. That said, for those situations where a BLAS/LAPACK function is available, I think a nice high level wrapper is a better effort/benefit trade. This will probably be more and more the case as multicore chips become increasingly complicated.

Of course every numerically intensive systems has lots of places where a BLAS/LAPACK routines doesn't fit, so nice to know that you can drop down and do some numerically intensive programming in F# without paying too high a price.

Thanks for the interesting F# details.

jay paul said...

Really nice post, you got great blog and Thank you for sharing This excellently written content. Waiting for next one.

HP - PRO 15.6" Laptop - 4GB Memory - 500GB Hard Drive

HP - PRO 15.6" Laptop - 4GB Memory - 500GB Hard Drive - Tungsten