Saturday, 7 May 2016

ARM code generation quality

I previously looked at x86 code generation quality and found that GCC does some interesting high-level optimisations but, in that case, any performance benefit was lost in poor quality code generation. In this article I’m going to look at ARM assembly instead.

 

Compiling the Fibonacci function in C from last time we obtain times ranging from 2.4s to 6.6s for fib(40) using –O2 to –O0. Interestingly, using –O3 actually worsens performance over –O2. As before, inspection of the generated assembly language shows that GCC employs nifty high-level optimisations when given the –O2 option.

 

The simplest implementation of our Fibonacci function in ARM assembly is perhaps:

 

fib:

        cmp     r0, #2

        movlt   pc, lr

        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        bl      fib

        mov     r2, r0

        mov     r0, r1

        bl      fib

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}

 

These 11 instructions are almost identical to the output generated by –O1 except we have been more frugal in order to avoid having to save and restore R3. This takes 3.9s to run.

 

Perhaps the most obvious optimisation is to inline the initial test (if n<2 then n else …) and then skip it when recursing:

 

fib2:

        cmp     r0, #2

        bxlt    lr

fib2i:

        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        cmp     r0, #2

        blge    fib2i

        mov     r2, r0

        mov     r0, r1

        cmp     r0, #2

        blge    fib2i

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}

 

This immediately reduces the time taken to 1.975, over 20% faster than any of the C solutions. So with very little effort we have written assembly by hand that is both shorter and faster than the assembly generated by GCC.

 

Let’s take a look at that high-level optimisation that GCC did. With –O2, GCC generates 17 instructions:

 

fib:

        cmp     r0, #1

        stmfd   sp!, {r4, r5, r6, lr}

        mov     r6, r0

        ble     .L4

        mov     r4, r0

        mov     r5, #0

.L3:

        sub     r0, r4, #1

        bl      fib

        sub     r4, r4, #2

        cmp     r4, #1

        add     r5, r5, r0

        bgt     .L3

        and     r6, r6, #1

.L2:

        add     r0, r5, r6

        ldmfd   sp!, {r4, r5, r6, pc}

.L4:

        mov     r5, #0

        b       .L2

 

This is equivalent to the following:

 

let rec loop(r4, r5, r6) =

  r5 += fib(r4-1)

  if r4>3 then loop(r4-2, r5, r6) else r5+(r6 & 1)

let fib(n) =

  if n <= 1 then n else

    loop(n, 0, n)

 

We can rewrite this high-level code in assembly by hand as 15 instructions:

 

fib3:

        cmp     r0, #1

        bxle    lr

        stmfd   sp!, {r1, r2, r3, lr}

        mov     r1, r0

        mov     r2, #0

        mov     r3, r0

loop:

        sub     r0, r1, #1

        bl      fib3

        add     r2, r2, r0

        cmp     r1, #3

        subgt   r1, r1, #2

        bgt     loop

        and     r3, r3, #1

        add     r0, r2, r3

        ldmfd   sp!, {r1, r2, r3, pc}

 

Furthermore, whereas the C code took 2.4s this hand-written assembly takes just 1.9s. This is probably because the assembly generated by GCC takes 8 instructions to implement the identity function when n<=1 whereas our solution takes just 2 instructions.

 

GCC’s choice of high-level optimisation is interesting. Looking at the problem, the most obvious high-level optimisation to me is inlining the recursive calls. This is particularly beneficial because it results to two identical calls to fib(n-3) in the general case and that common subexpression can be factored out. The following assembly does this and runs in just 38ms:

 

fib5:

        cmp     r0, #4

        bge     fib5mid

        cmp     r0, #2

        moveq   r0, #1

        movgt   r0, #2

        bx      lr

fib5mid:

        stmfd   sp!, {r1, r2, lr}

        mov     r1, r0

        sub     r0, r1, #4

        bl      fib5

        mov     r2, r0

        sub     r0, r1, #3

        bl      fib5

        add     r2, r2, r0

        cmp     r1, #3

        sublt   r0, r1, #1

        addlt   r0, r0, r2

        blt     fib5end

        add     r2, r2, r0

        sub     r0, r1, #2

        bl      fib5

        add     r0, r0, r2

fib5end:

        ldmfd   sp!, {r1, r2, pc}

 

So it seems the folklore wisdom that it is impossible to beat the assembly generated by a modern C compiler simply isn’t true, at least in this case.

 

No comments: