Friday, 18 June 2010

Haskell's hash table woes (again)

Alessandro Stamatto recently asked a question on Stack Overflow about the current state of Haskell's hash tables following our rediscovery last year that Haskell's hash tables were up to 32× slower than common alternatives like C++ and .NET and even slower than Python.

Don Stewart edited Alessandro's original question, deleting the citation to the original study that provided a verifiable experiment with complete code, before writing an answer claiming that "the original citation doesn't present any code or benchmarks". Although Don provided a variety of Haskell-based performance measurements, he omitted any measurements of decent hash table implementations like those readily available from C++ or any .NET language that are still 10× faster than anything that can be written in Haskell.

Norman Ramsey's answer took this a step further by drawing the conclusion that "functional data structures perform pretty well" despite the overwhelming evidence to the contrary. Norman also claimed that the beliefs of a clique of self-proclaimed experts might be relied upon instead of verifying our findings by running the code we had already provided.

And around we go.

In reality, GHC 6.12.2 got a workaround to the performance bug in the garbage collector that was believed to have been responsible for this problem but it failed to fix the dire performance of Haskell's hash tables, which are still so slow that even purely functional trees can be faster, and that version still hasn't filtered through to the Haskell Platform. Here are our updated results using the complete code we already provided on a 2.0GHz E5405 Xeon running 32-bit Windows Vista with Visual F# 2010 and the latest Haskell Platform (with GHC 6.12.1) using ghc -O2 --make to compile:

Haskell IntMap: 32.4s
Haskell hash table: 19.2s
.NET Dictionary: 0.8s

So Haskell is still 24x slower than widely-used alternatives...

Wednesday, 2 June 2010

Regular, shape-polymorphic, parallel arrays in Haskell

This recent paper describes the state-of-the-art in Data Parallel Haskell (DPH). Some of the code referred to in the paper is available here. This post is in response to a question from one of the coauthors, Roman Leshchinskiy, asking what is wrong with the paper.

The problem with this paper is, in a word, cache. Specifically, the widespread adoption of CPU caches two decades ago forced a change in the way programmers traverse data structures when performance is of interest. The old style was to iterate directly using a for loop. The new style is either a cache aware set of loops that exploit knowledge of the sizes of caches in the machine, or a cache oblivious style that uses divide and conquer to subdivide the problem until it fits in-cache regardless of the cache's size.

This paper uses neither of the modern approaches, opting instead to use direct loops that have not been seen in real numerical code for several decades. The paper incorrectly asserts that these are "widely used array algorithms".

This is a particularly important mistake in this context because cache misses are not only extremely expensive on modern hardware (so the absolute serial performance will be dire) but cache misses from multiple cores must contend for access to main memory. So scalability when parallelized will also be dire and this work is entirely about parallelism. In fact, the extreme cache unfriendliness (due to the complete lack of temporal locality) in their implementation of the Laplace solver results in a maximum speedup of only 2.6×, on 5 cores, and performance degradation with more cores. This is extremely poor scalability but the abstract of the paper still asserts that "our implementation scales well up to 8 processor cores". Common implementations often attain 5-7× speedup on 8 cores and keep getting faster well beyond 8 cores.

In the face of such results, one might expect a humble explanation of what went wrong but, instead, the paper blames the hardware ("insufficient bandwidth of the Xeon") and describes the one-line OpenMP pragma required to parallelize the matrix multiplication in C as "considerable additional effort". Finally, the paper states that they achieved absolute performance comparable to handwritten C when a more accurate statement would be that they managed to write C code that achieved absolute performance comparable to Haskell given that they had to shun 20 years worth of efficient and elegant C implementations in order to justify the conclusion they set out to draw: that Haskell is good.