This recent paper describes the state-of-the-art in Data Parallel Haskell (DPH). Some of the code referred to in the paper is available here. This post is in response to a question from one of the coauthors, Roman Leshchinskiy, asking what is wrong with the paper.
The problem with this paper is, in a word, cache. Specifically, the widespread adoption of CPU caches two decades ago forced a change in the way programmers traverse data structures when performance is of interest. The old style was to iterate directly using a for loop. The new style is either a cache aware set of loops that exploit knowledge of the sizes of caches in the machine, or a cache oblivious style that uses divide and conquer to subdivide the problem until it fits in-cache regardless of the cache's size.
This paper uses neither of the modern approaches, opting instead to use direct loops that have not been seen in real numerical code for several decades. The paper incorrectly asserts that these are "widely used array algorithms".
This is a particularly important mistake in this context because cache misses are not only extremely expensive on modern hardware (so the absolute serial performance will be dire) but cache misses from multiple cores must contend for access to main memory. So scalability when parallelized will also be dire and this work is entirely about parallelism. In fact, the extreme cache unfriendliness (due to the complete lack of temporal locality) in their implementation of the Laplace solver results in a maximum speedup of only 2.6×, on 5 cores, and performance degradation with more cores. This is extremely poor scalability but the abstract of the paper still asserts that "our implementation scales well up to 8 processor cores". Common implementations often attain 5-7× speedup on 8 cores and keep getting faster well beyond 8 cores.
In the face of such results, one might expect a humble explanation of what went wrong but, instead, the paper blames the hardware ("insufficient bandwidth of the Xeon") and describes the one-line OpenMP pragma required to parallelize the matrix multiplication in C as "considerable additional effort". Finally, the paper states that they achieved absolute performance comparable to handwritten C when a more accurate statement would be that they managed to write C code that achieved absolute performance comparable to Haskell given that they had to shun 20 years worth of efficient and elegant C implementations in order to justify the conclusion they set out to draw: that Haskell is good.