Sunday, 12 September 2010

Burton Smith vs Erik Meijer

We were recently led to a fascinating Channel 9 interview where the insideous Erik Meijer walks in on parallelism-expert Burton Smith, culminating in a fight the likes of which have not been seen since Chris Curry vs Sir Clive Sinclair in the Baron of Beef here in Cambridge.

A pragmatic Burton points out that lazy evaluation renders performance unpredictable and that, in turn, makes it impossible to tune the granularity of parallel programs to ensure that more effort is spent doing actual work than is wasted administering parallel tasks. The puritanical Erik points out that strictness is essentially another form of side effect because it affects issues such as non-termination. The conclusion is irreconcilable differences between what parallel programming requires and what purely functional programming can provide.

Erik and Burton were joined by an elephant in the room though: caches. They are the main difference between supercomputers and multicores and are a game changer, yet people do not discuss the effect caches have on programming and that effect will only grow as the memory gap continues to widen. Today, effective utilization of caches offers an order of magnitude in performance yet their effective use from purely functional languages like Haskell is essentially impossible precisely because memory has been successfully abstracted away. Indeed, as we have explained before, this not only remains a show-stopping unsolved problem for parallel Haskell but is not even being addressed by researchers. Today's state-of-the-art advice for multicore programmers is still to use mutable data structures in order to ensure caches are used effectively based upon the quantification of asymptotic cache complexity in the context of a simple theoretical model of CPU caches.


Art said...

Theory: Asleep at the Switch to Many-core (ppt).

Trumping the Multicore Memory Hierarchy with Hi-Spade (ppt).

Mat said...

I'm trying to understand your critique of Haskell with respect to poor cache usage. Are you saying the design prevents good cache usage or that the current implementation prevents it?

Flying Frog Consultancy Ltd. said...

@Mat: Good question!

Note that this has nothing specifically to do with Haskell and is equally applicable to the use of purely functional data structures from any other languages.

Due to the architecture of modern hardware, cache complexity is a powerful predictor of real performance. On my 8-core Xeon, simultaneous L2 cache misses on all cores is about 80× slower than L1 cache hits.

So taking advantage of caches offers more potential performance improvement that obtaining linear scaling on today's widest multicore CPUs. Moreover, you cannot take advantage of multicores without having to take advantage of caches first because main memory is a shared global resource and, consequently, L2 cache misses destroy scalability.

Despite these facts, everyone continues to ignore the CPU caches when discussing the performance of purely functional code and, in particular, the performance of things like parallel Haskell where caches are incredibly important.

As for whether the design or implementation is to blame, it is impossible to say without further study. Perhaps someone can invent a framework that exploits knowledge of a particular garbage collector in order to derive cache complexities for purely functional data structures on a given language implementation in order to optimize code and benefit from some of that 80× performance improvement; or perhaps the design makes that impossible. To date, all of the work on this seems to have been qualitative. Real designs do vary, e.g. OCaml reordered blocks in the major heap via a breadth-first search that was qualitatively designed to improve cache efficiency based upon ad-hoc measurements whereas .NET preserves the order of heap blocks in gen2 by sliding blocks together based upon the assumption that the programmer expects subsequently-allocated objects to be adjacent in memory.