Tuesday, 26 January 2010

Naive parallelism with HLVM

The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results:

These results exhibit several interesting characteristics:

  • Even the naive parallelization in C++ is significantly faster than HLVM and Haskell.
  • C++ and HLVM scale well, with performance improving as more cores are used.
  • Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores.

The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:

These results show that HLVM obtained the largest speedup: 6.3× faster on 8 cores. C++ obtained the next largest speedup: 5.2x faster on 8 cores. Haskell obtained the worst speedup: only 2.9× faster on 8 cores.

More sophisticated parallelizations could obtain better performance still. Currently, the parallel for loop over the rows of the image causes each thread to compete for an iteration and results in poor locality: if a thread computes one row then it is unlikely to compute the next. Work-stealing queues and recursive subdivision of the problem space would greatly improve locality and, therefore, performance. Also, the initial construction of the scene tree has some opportunity for parallelization and further performance improvements.

Furthermore, there is plenty of room to optimize HLVM itself and some simple hacks can quantify the possibilities. Disabling bounds checking provides a substantial 20% performance improvement. Disabling the shadow stack (and, therefore, disabling GC) provides another 25% performance improvement. With these two changes, HLVM is only 24% slower than C++. In particular, HLVM's current design makes heavy reuse of its own generic routines even in performance critical sections such as manipulating the shadow stack. Optimizing these routines by hand could leverage some of this potential. For example, by removing bounds checks from manipulations of the shadow stack.

Sunday, 24 January 2010

Quadcore ARMs

Since the publication of our recent article about the upcoming ARM architecture in the context of the exploding netbook market, Marvell have announced their production of the world's first quadcore ARM CPU.

ARM also announced 2GHz capable Cortex-A9 cores in September 2009.

Hopefully we'll get the chance to port our new multicore-capable HLVM project to the ARM architecture before long. That should be an easy task thanks to LLVM's existing support for ARM and LLVM is already used on the Apple iPhone, which is ARM based.

Sunday, 17 January 2010

Naïve Parallelism: a rebuttal

Several people including Simon Marlow of Microsoft have objected to our rejection of Saynte's new Haskell code, claiming that the alterations were minor and that they optimize serial performance. This is simply not true. Firstly, over half of the lines of code in the entire program have been altered. Secondly, the new version is slower than Lennart's original version 5 on 1 core. So there is no logical reason to choose the revised Haskell for the basis of a naive parallelization unless you want to cherry pick results by leveraging knowledge of how they will scale after parallelization. Suffice to say, doing so would be bad science.

This is illustrated in the following graph of performance results for Lennart's original version 5 vs Saynte's new revised Haskell with 11 levels of spheres at 1,024×1,024:

The original code is 5% faster on 1 core but scales poorly and is 2.7× slower on 7 cores. The substantially-revised "serial" Haskell code was obviously specifically designed to be amenable to parallelization and, therefore, had no place in a comparison about naive parallelizations.

This naturally raises the question of how the different programs will perform when optimized for parallelism. A fair comparison will require the C++ to be rewritten just as the revised Haskell had been. We shall address this question in the future.

Saturday, 16 January 2010

Naïve Parallelization: C++ vs Haskell

A member of the Haskell community recently published a blog article revisiting our ray tracer language comparison, claiming to address the question of how naïve parallelizations in these two languages compare. The objective was to make only minimal changes to the programs in order to parallelize them and then compare performance.

Our attempts to verify those results turned up a lot of interesting information. Firstly, the Haskell program that was supposedly naïvely parallelized was not the original but, in fact, a complete rewrite. This raises the question of whether or not the rewrite was specifically designed to be amenable to parallelization and, therefore, is not representative of naïve parallelization at all. The C++ used was the original with minimal changes to parallelize a single loop. Secondly, although the serial benchmark results covered a spectrum of inputs, the parallel results covered only a single case and retrospectively identified the optimal results without alluding to the fact that there had been no way to predict them. This raises the concern that the results may have been cherry picked. Finally, those results were gathered on a machine with only 4 real cores.

We have repeated the experiment by applying the minimal change to two of Lennart Augustsson's original Haskell programs, the first and fourth, and compiled them with Lennart's original compile line with -threaded added. Our C++ was parallelized by building the image in-place using a parallel loop and then outputting the image using serial loops, and was compiled with the command line given in our original article with an additional -fopenmp flag. These programs genuinely represent naïvely parallelized code. We shall also present all results for 9, 12 and 13 levels of spheres at 1,024×1,024 and our measurements were made on a machine with 8 real cores: two 2.1GHz quadcore AMD Opteron 2352 CPUs running 32-bit programs.

The following graph illustrates the results for 9 levels of spheres:

In this case, the naïvely parallelized C++ program is 2.5-3.5× faster than the naïvely parallelized Haskell 4 program which is, in turn, 1.8-2.7× faster than the Haskell 1 program on 1-7 cores. These results are consistent with the understanding that the Haskell 4 program is more heavily optimized than the Haskell 1 program. The C++ and Haskell 4 programs eagerly build the scene tree in memory whereas the Haskell 1 program constructs the scene lazily as it is required by the renderer. In this case, there are only 87,381 spheres in the scene so eager construction is fast and effective, taking a small fraction of the total running time and paying off thanks to faster intersection routines. Consequently, the C++ sees a healthy 5.4× speedup moving from 1 to 8 cores.

The results for the Haskell 4 program exhibit two interesting characteristics. Firstly, Haskell is very slow if all 8 cores are used. This is a bug known as the "last core slowdown" so Haskell programmers will expect to get optimal performance from n-1 cores. However, in this case the Haskell 4 program runs fastest with only 6 cores and adding another core actually degrades performance. This is an unexpected result and there was no way to predict that Haskell would see performance degraded when using more than 6 of the 8 cores.

The following graph illustrates the results for 12 levels of spheres:

In this case, the C++ program is 1.7-3.3× faster than either of the Haskell programs but the Haskell 1 program is actually faster than the more heavily optimized Haskell 4 program on 7 cores. The scene now contains 5,592,405 spheres and the time taken to eagerly construct the scene is substantially longer. Consequently, the C++ sees only a 2.3× speedup moving from 1 to 8 cores because its scene construction has not been parallelized.

The performance of the Haskell 4 program not only degrades beyond 5 cores but also exhibits a different profile to that of the Haskell 1 program, which sees improved performance right up to 7 cores. Again, this is an unexpected result and there was no way to predict which Haskell program would be faster or what the optimal number of cores would be. The fastest result for Haskell overall happened to be the Haskell 4 program on 5 cores on this machine.

The following graph illustrates the results for 13 levels of spheres:

In this case, we see the incredible result that the fastest program overall is the least optimized Haskell 1 running on 7 cores! This is another unexpected result but it is easy to explain: Haskell 1 is the only program where the scene is constructed in parallel. In contrast, The C++ and Haskell 4 programs waste most of their time in serial code that constructs a scene tree that consumes gigabytes of memory. Consequently, the C++ sees only a 40% performance improvement moving from 1 to 8 cores.

Although this last best-case result where naïvely-parallelized Haskell is 20% faster than naïvely-parallelized C++ is very impressive, there is no way to predict if or when such a result might occur. Consequently, the worst case results are equally important in practice and, in this case, we find that the same winning Haskell program is also up to 8.2× slower than C++ on 1-7 cores in another case: 9 levels of spheres on 2 cores. However, the Haskell program is much simpler than the C++ and its parallelization is statically proven to be free of concurrency bugs such as race conditions. These benefits are well worth this level of performance degradation in many applications so parallel Haskell would seem to have immediate practical applications.

Friday, 8 January 2010

HLVM on the ray tracer language comparison

We recently completed a translation of our famous ray tracer language comparison to HLVM. The translation is equivalent to the most highly optimized implementations written in other languages and this allows us to compare HLVM with a variety of competing languages for the first time. The results are astonishing.
Running the benchmark with the default settings (level=9, n=5 to render 87,381 spheres at 512×512) on 32-bit x86 gives the following times for different languages:

These results show that HLVM already provides competitive performance for a non-trivial benchmark. HLVM took 6.7s whereas C++ (compiled with g++ 4.3.3) took only 4.3s and Haskell (compiled with GHC 6.12) took 13.9s.
However, cranking up the level parameter to 12 in order to increase the complexity of the scene, rendering a whopping 5,592,405 spheres, we find that HLVM blows away the other garbage collected languages and is even able to keep up with C++:

This remarkable result is a consequence of HLVM's space-efficient data representation strategy that avoids almost all boxing and is ideal for numerical code. Moreover, these results vindicate HLVM's unique design that many people said would not be efficient.
The main factor separating these languages is their data representations. The slowest languages on this benchmark, Java and Haskell, rely upon uniform data representations that require many values to be boxed, massively increasing the burden on the garbage collector and damaging locality by spreading the same data across a much larger heap. Moreover, the inner loops of the renderer handle boxed values in these languages and, consequently, are constantly allocating and even incurring garbage collections that require them to traverse irrelevant parts of the heap. Conversely, C++ and HLVM are able to unbox virtually all values. In fact, HLVM is so successful at storing values unboxed that the only part of the benchmark to allocate is the initial creation of the scene tree and the renderer itself performs no allocations at all. Consequently, HLVM incurs no garbage collections during rendering and traverses no more heap than the C++.
Moreover, Java, Haskell and HLVM are the only languages on this benchmark providing multicore-capable garbage collectors. We have not yet parallelized this benchmark but a parallel implementation in HLVM will scale extremely well because it is not incurring garbage collections thanks to its unique data representation.

Sunday, 3 January 2010

Will Intel lose the computer market to ARM in 2012?

The core of the computer industry, laptops and desktops, is on the brink of revolution with Intel and Microsoft set to lose their long-term stranglehold for three main reasons:

  1. Netbook sales are exploding and eating into the laptop and desktop markets.
  2. Ever more users are operating "in the cloud" where the CPU architecture and OS are irrelevant.
  3. Demand has shifted with users now wanting small solid-state fanless computers with long battery lives.

Sales of netbooks rose from only 0.4M in 2007 to 11.4M in 2008 and 35M in 2009 with projected sales of 139M in 2013. By the end of 2008, notebook sales had overtaken desktop sales for the first time in history. The tremendous growth in netbook sales was precipitated by the global financial climate making cheaper devices more alluring for cash-strapped consumers.

Google are desperately trying to get a piece of the action with their Chrome OS, an operating system designed around their Chrome web browser specifically for the new breed of users living in the cloud. Based upon Linux, this free software has the ultimate potential to destroy the profit margins that Microsoft depend upon. Microsoft have already lost 32% of the worldwide netbook market to Linux since 2007. Operating systems like Chrome OS are agnostic with respect to CPU architecture, completely removing the advantage of legacy that Intel rely upon.

Demand for fanless ultra-portable devices with the horse-power to display HD video has pushed power consumption to top priority. Power consumption has always been the achilles heel of the x86 architecture. Back in early 1996, Intel's fastest CPU, a 166MHz Pentium, was trumped by the 200MHz StrongARM but the far lower power consumption of the technologically-superior StrongARM was irrelevant in the context of desktop machines and Intel's legacy won over all markets except the (billion units per year) mobile phone market where ARM gained 70% market share. Today, netbook motherboards using Intel's fastest Core i7-920XM are competing against single-chip computers with ARM cores that are 10× smaller and 10× more power efficient. Consequently, netbooks built from ARM technology like the nVidia Tegra can perform the same challenging tasks without a fan and with far longer battery lives than Intel's best offerings. Exactly what consumers want.

Intel have been trying to compete with ARM for over a decade. They acquired StrongARM technology from Compaq who had acquired it from DEC and built the Intel XScale but succeeded in little more than increasing power consumption ten fold. The XScale has since joined the Itanium and Larrabee in the Intel hall of shame.

Windows CE and Windows Mobile already run on the ARM architecture and allow .NET applications to be run on ARM netbooks using the .NET Micro Framework. However, lack of multicore support in the .NET Micro Framework may lead to the use of alternatives like HLVM on Linux instead.

Perhaps by 2012 I really will be writing this blog not from a 2GHz 8-core Intel desktop but from a 2GHz quadcore ARM Cortex-A9MP with nVidia GPU netbook from the beach. I hope so!

High-performance parallelism with HLVM

Our open source HLVM project recently reached a major milestone with new support for high-performance shared-memory parallel programming. This is a major advance because it places HLVM among the world's fastest high-level language implementations.

The previous HLVM implementation had demonstrated the proof of concept using a stop-the-world mark-sweep garbage collector. This new release optimizes that robust foundation by carrying thread-local data in registers in order to provide excellent performance.

The following benchmark results show HLVM beating OCaml on many serial benchmarks on x86 despite the overheads of HLVM's new multicore-capable garbage collector:

HLVM is over 2× faster than OCaml on average over these results and 4× faster on the floating-point Fibonacci function.