Wednesday, 2 June 2010

Regular, shape-polymorphic, parallel arrays in Haskell

This recent paper describes the state-of-the-art in Data Parallel Haskell (DPH). Some of the code referred to in the paper is available here. This post is in response to a question from one of the coauthors, Roman Leshchinskiy, asking what is wrong with the paper.

The problem with this paper is, in a word, cache. Specifically, the widespread adoption of CPU caches two decades ago forced a change in the way programmers traverse data structures when performance is of interest. The old style was to iterate directly using a for loop. The new style is either a cache aware set of loops that exploit knowledge of the sizes of caches in the machine, or a cache oblivious style that uses divide and conquer to subdivide the problem until it fits in-cache regardless of the cache's size.

This paper uses neither of the modern approaches, opting instead to use direct loops that have not been seen in real numerical code for several decades. The paper incorrectly asserts that these are "widely used array algorithms".

This is a particularly important mistake in this context because cache misses are not only extremely expensive on modern hardware (so the absolute serial performance will be dire) but cache misses from multiple cores must contend for access to main memory. So scalability when parallelized will also be dire and this work is entirely about parallelism. In fact, the extreme cache unfriendliness (due to the complete lack of temporal locality) in their implementation of the Laplace solver results in a maximum speedup of only 2.6×, on 5 cores, and performance degradation with more cores. This is extremely poor scalability but the abstract of the paper still asserts that "our implementation scales well up to 8 processor cores". Common implementations often attain 5-7× speedup on 8 cores and keep getting faster well beyond 8 cores.

In the face of such results, one might expect a humble explanation of what went wrong but, instead, the paper blames the hardware ("insufficient bandwidth of the Xeon") and describes the one-line OpenMP pragma required to parallelize the matrix multiplication in C as "considerable additional effort". Finally, the paper states that they achieved absolute performance comparable to handwritten C when a more accurate statement would be that they managed to write C code that achieved absolute performance comparable to Haskell given that they had to shun 20 years worth of efficient and elegant C implementations in order to justify the conclusion they set out to draw: that Haskell is good.


David Tweed said...

I think it's unfortunate that there's clearly bad feelings on both sides between you and some Haskell users (and I don't make judgements about who's more at fault), because as someone very interested in processing arrays the technical content of the post is very interesting to me and a big issue with the Haskell framework, but it's overwhelmed by the emotional content of your post.

Sticking only to the technical stuff.
Firstly, I think you are perhaps being overly generous to scientists who write C code: there are an awful lot who aren't aware of cache issues, and even if they are it's often very nontrivial to figure out a reordering for very good access. As such, it's not like the code C they're comparing against isn't being written, it's just that there's also better thought out C code being written, which ought to be pointed out but it's difficult to say the sentence as written is incorrect. Secondly, looking at it as a representational framework that should be improved (particularly if it's going to be the officially blessed library), part of the issue is the tension between expressing compositionality of operations and specifying more details of how to order things to improve cache usage. (I should say here that the kind of programming I do has much less regular access patterns than, eg, Laplace solvers, so that "textbook" blocking or cache-oblivious traversals don't directly apply and figuring it out automatically seems incredibly optimistic.) From my passing knowledge of SAC, I don't think they've got an answer to this problem either. Do any of the languages you prefer have any good answer to this problem, or do you have any personal ideas?

(FWIW, I'm not interested in this for Haskell but rather for the array processing aspects.)

Flying Frog Consultancy Ltd. said...

@David: Thanks for the excellent response!

My personal idea is that a thorough survey of the algorithms needs to be done first. I'd start by translating as much efficient code into a strict functional language as possible and then boiling it down and factoring it. Then look at the results and try to spot patterns emerging in the way the code has been structured.

One obvious pattern that emerges immediately from cache oblivious algorithms is that they always try to bisect something (more specifically, something that relates to indices). That lends itself naturally to factoring out higher-order functions to provide the same functionality as a pair of parallel for loops but using recursive subdivision. You also have optional data dependence occurring in there with some algorithms allowing every quad to be done simultaneously and others requiring the top left, followed by both the top right and bottom left simultaneously followed by the bottom right in series again.

We're already using some ideas along these lines in our commercial products but it is early days yet. In particular, we make heavy use of a parallel for loop that accepts an additional function argument that conveys the relative complexity of subdivisions, allowing it to subdivide non-uniform workloads efficiently. That was instrumental in beating the performance of the Intel MKL from F#.

I'm writing a book on it right now but every topic I look at seems to raise exponentially more stuff to learn! :-)

David Tweed said...

In case you're looking for examples of other array processing, consider the computing an "integral image"/"summed area table"

This has some subdivisions where some blocks can be computed in parallel given other blocks have been computed first, but it seems sufficiently different to other problems that it often doesn't fit in "frameworks".

Flying Frog Consultancy Ltd. said...

@David: No, that looks like the existing pattern found in many problems from dynamic programming such as longest common subsequence and Levenshtein edit distance. Check out the cache oblivious implementation that was presented by Frigo et al. in 2007 here. Basically, you divide into quads Q0, Q1, Q2 and Q3; do Q0 by itself, then Q1 and Q2; finally Q3 by itself. That gives you good cache complexity and you can often do Q1 and Q2 in parallel so its a great class of algorithms for multicores.

I managed to work that one out myself quite easily but I got stuck on iterative algorithms over structured grids like the Laplace solver until Matteo kindly pointed me at his solution for that: recursively subdivide spacetime into trapezoids and pyramids. Beautiful!

Itkovian said...

I think that today's code is optimised by the compiler. So, programmers (even numerical ones) are not really concerning themselves with the nitty-gritty details, but with the high-level algorithm. The compiler will take care of optimising away things, performing loop transformations, etc. Even when you write low-level C, you cannot be really certain that the compiler will not turn your code into something completely different.

So, stating that C is vastly superior to high-level FP code because people program cache-aware, is simply a fallacy, since even in your cache-aware C-code, the compiler will change things, more than you think.

Paolo G. Giarrusso said...

@Itkovian: I think you miss the fact that cache-oblivious and blocked algorithms are different algorithms from a high-level point of view.

The discussion compared dumb and optimized algorithms, which you can write in any language - I guess the blog owner implemented them successfully in OcaML or F#, since these are the languages the author uses/used. Finally, nobody really ever stated that C code is superior - you just misunderstood "20 years worth of efficient and elegant C implementations".

You are basically saying that C is not any more the high-level assembler many people expect, which is true, quite obvious here, and irrelevant.
It just means that everybody doing optimization for real should know it, benchmark code, check assembler (he should), and use a compiler which give them control over loop transformations - then of course there are also people just pretending to optimize code; but benchmarks separate winners from losers.