Sunday, 5 December 2010

Texas Multicore Technologies on Haskell

US-based startup Texas Multicore Technologies have published some of the results they obtained using Haskell for parallel programming. Their results mirror our own:

Thanks to Manuel Chakravarty of the University of New South Wales for drawing this to our attention.

12 comments:

saynte said...

It would appear that they took their Haskell code from the nofib parallel benchmark, then ran it incorrectly (somehow). The parfib code does achieve some speedup, their graph showed nothing at all.

Flying Frog Consultancy Ltd. said...

Even Simon Marlow's own results show poor performance and scalability for this embarrassingly-parallel matmult Haskell benchmark (as well as the maze and sphere benchmarks).

As usual, the problem is poorly-written parallel code that fails to take advantage of CPU caches effectively so the bottleneck is the shared access to main memory.

saynte said...

What I meant was that the code they ran has a history of actually scaling (at least up to 2.5x as in the paper you cited). Given this, the plot you have referenced is quite uninformative, they probably just wanted some marketing image, or ran the program incorrectly.

Flying Frog Consultancy Ltd. said...

The Haskell solution spends all of its time stalled on cache misses. Consequently, any speedup you observe is simply a measure of how many threads are required to max out main memory on a given machine. Therefore, you might well expect to see ~2× speedup on a 2-socket 4-cores/CPU configuration like the one Simon Marlow used and no speedup on a genuine 8-core like the new Nehalem-EX that Texas Multicore have used.

There is no logical reason to dismiss these results as marketing hype or presume that they ran the Haskell incorrectly. Your accusations are based in fantasy, not reality.

saynte said...

Well, the main problem is that there is a discrepancy between the behaviour shown by their "Example" page and the behaviour that has been observed up until now.

There may be a reason for it, but is isn't accounted for in their small example. Perhaps I am cynical, but when I see a company trying to sell something and coming up with different results for the same piece of code as others have seen, I assume there may be some error in their methodology.

I don't believe this process of reasoning is 'fantasy'.

Flying Frog Consultancy Ltd. said...

@Saynte: Different architectures ⇒ no discrepancy.

Gambling your own money on your own work is hardly conducive of fraud. Orchestrating to milk money from the system by forming a clique of researchers who review each others work and avoid genuine peer review is, however, fraud.

saynte said...

Yes, different architectures can exhibit different results. However, if this were the case, one note this with their results, and explain the difference.

"Gambling your own money on your own work is hardly conducive of fraud. Orchestrating to milk money from the system by forming a clique of researchers who review each others work and avoid genuine peer review is, however, fraud."

I'm not sure what the point of that statement is.

Flying Frog Consultancy Ltd. said...

@saynte: Strange that you expect this company to give you a theoretical explanation for free when you ignore the fact that the Haskell researchers who are paid to understand and explain this have failed to do so. Only two reasons can explain why this is, either the Haskell researchers are dishonestly conspiring to deceive people or they are are so incompetent that they do not understand the theoretical foundations of multicore parallelism even though other researchers had already explained it beautifully. I have found them guilty of both in the past.

"I'm not sure what the point of that statement is". The point is that your trust is misplaced. You attacked Texas Multicore only because their measurements undermined your religious beliefs about Haskell. Frankly, I'm amazed nobody from the Haskell community has accused them of being us yet...

saynte said...

- They have results that deviate from all other results I've seen reported for this code, if it's unexplained it just damages the strength of their argument. The argument they make for their own performance is then weakened. If they explained it, then they would have a much stronger case.

- "Haskell researchers" (if you consider them as a homogeneous mass) likely aren't explaining it because counter-examples to it existed even before it was put on the internet. But you really would have to ask them if you want to know why. That's the simplest/best form of scientific inquiry that will give you your answer; you don't have to guess about incompetence or dishonesty.

- My beliefs about Haskell aren't religious, merely based on previous experience. Indeed, I can run the code from the nofib benchmark and it shows a speedup on my machine as well.

Jules said...

What kind of speedup are you seeing and on which hardware are you running?

Flying Frog Consultancy Ltd. said...

@saynte: "They have results that deviate from all other results I've seen reported for this code". How many of those other results were made on this architecture?

"if it's unexplained". The explanation is well known. I even gave a lecture on it.

"Haskell researchers...likely aren't explaining it because counter-examples to it existed even before it was put on the internet". Where had these conflicting measurements made on the same architecture been published on the internet?

@Jules: Benchmarks that stress memory more scale better on our 2×4-core AMD Opteron than on our 2×4-core Intel Xeon. I'll publish exact measurements for this benchmark when I next have access to those machines.

On a related node, Texas Multicore's other results show even larger variations in scalability between architectures. For example, their own sparse matrix decomposition benchmark shows a considerable speedup moving from 16 to 32 AMD cores but a significant slowdown moving from 16 to 32 Intel cores.

saynte said...

@Jules
I ran it on my home machine, an i7 920 (maybe 940, I forget now). Although anecdotal, it was at least up to 3 on 8 logical cores. I don't have access to the machine now, as I'm out of the country, so I can't be more concrete than an anecdote.


@Flying
It should be repeated: the results from the example page of Texas Multicore do not state an architecture. No one could possibly verify their results from the information provided: they don't state a compiler version, operating system, test-input or machine configuration.