Thursday, 19 April 2012

Where are my cores?

In 2005, the world's first multicore consumer CPUs became commercially available. In 2007, Intel introduced their first quad-core processors and predicted that the number of cores would continue doubling and reach 64 cores by 2011. The following graph shows this prediction and the actual number of cores that shipped on Intel's CPUs:

Interesting to see how big the discrepancy is now. We were supposed to get 64-core CPUs last year but, instead, the widest Intel CPUs shipping in Dell desktops today have just 6 cores and the widest Intel CPUs available have just 10 cores. After Larrabee and the SCC, Intel are now hyping a Many Integrated Core (MIC) architecture but a consumer version has yet to materialise.

Sunday, 8 April 2012

x86 code generation quality

People often claim that it is difficult to improve upon the code generated by a modern compiler using hand-written assembler. This blog post takes a look at the code generated by GCC for a simple function and finds that it implements impressive high-level optimizations but the generated code leaves a lot of room for improvement.

Consider the recursive integer Fibonacci function that may be defined in C like this:

int fib(int n) { return (n<2 ? n : fib(n-1)+fib(n-2)); }

Writing this function in assembler by hand we obtain the following 12 instructions:

 cmp eax, 2
 jl fin
 push eax
 add eax, -1
 call fib
 push eax
 mov eax, [esp+4]
 add eax, -2
 call fib
 add eax, [esp]
 add esp, 8

Benchmarking this solution we find that it takes 3.25s to compute fib(40).

With the -O2 option we find that GCC produces the following 25 assembler instructions:

 push edi
 push esi
 push ebx
 sub esp, 16
 mov edi, DWORD [esp+32]
 cmp edi, 1
 jbe L4
 mov ebx, edi
 xor esi, esi
 lea eax, [ebx-1]
 mov DWORD [esp], eax
 call _fib
 sub ebx, 2
 add esi, eax
 cmp ebx, 1
 ja L3
 and edi, 1
 lea eax, [esi+edi]
 add esp, 16
 pop ebx
 pop esi
 pop edi
 xor esi, esi
 jmp L2

Benchmarking we find that the performance of this solution is indistinguishable from our assembler. So the compiler generated over twice as much code for no performance improvement. Moreover, our assembler solution was comprehensible whereas the output from the compiler includes a wider vocabulary of instructions (and, xor and lea) as well as a dramatic rearrangement. Specifically, the code generated by the compiler contains only a single recursive call rather than two recursive calls.

The transformation made by the compiler cleverly turned one of the two recursive calls into a loop. Disassembling the generated code by hand we can reconstruct the following algorithmically-similar C program:

int loop(int a, int n) { return (n<=1 ? a+n : loop(a+fib(n-1), n-2)); }
int fib(int n) { return loop(0, n); }

Compiling this to assembler by hand we obtain the following 15-instruction program:

 push ebx
 push ecx
 mov ebx, 0
 mov ecx, eax
 jmp loop2

 lea eax, [ecx-1]
 call fib
 add ebx, eax
 sub ecx, 2
 cmp ecx, 1
 jg loop1

 lea eax, [ebx+ecx]
 pop ecx
 pop ebx

Note that we have made one non-trivial rearrangement to create this assembler. Specifically, the loop containing a branch has been rotated so that the branch is the loop and, therefore, the entry point to the loop is now in the middle of the code. This removes one of the two branches from a naive translation.

Whereas our previous assembler and the C code both took 3.25s to compute fib(40), this new assembler takes just 2.52s. These results imply that GCC is applying some productive high-level optimizations but their benefit is lost in poor code generation. Concretely, the compiler successfully replaced one call with a jmp and, consequently, removed some stack operations but the remaining code was too inefficient for this to pay off.

Interestingly, applying the same optimization to equivalent F# code reduces the running time from 4.44s to 2.64s and in OCaml from 3.21s to 2.77s. Here is the resulting code:

let rec loop a n =
  if n<2 then a+n else
    loop (a + fib(n-1)) (n-2)
and fib n = loop 0 n

Compiling with gcc -O3 produces far more code but the running time is reduced down to 2.01s. The most productive optimization in the case of this function appears to be to unroll the recursive calls once and simplify, in particular reusing the result of the two calls to fib(n-3), resulting in this program:

let rec fib = function
  | 0 | 1 as n -> n
  | 2 -> 1
  | 3 -> 2
  | n -> fib(n-2) + 2*fib(n-3) + fib(n-4)

This reduces the running time to just 0.72s in F# and 0.39s in OCaml.