Monday, 3 September 2018

What languages did you learn, in what order?

I once compiled the following list of programming languages and which year I started learning them:
  • 1981: Sinclair BASIC (on a ZX81)
  • 1983: BBC BASIC (on a BBC Micro)
  • 1985: Pascal (installed via ROM)
  • 1987: Logo
  • 1987: 6502 Assembly (on a BBC Micro)
  • 1989: ARM assembly (on an Acorn Archimedes)
  • 1992: Casio fx-7700 BASIC
  • 1992: C (using the excellent Norcroft compiler for Acorn computers)
  • 1994: C++ (the awful Beebug Easy C++)
  • 1995: UFI (on a VAX running OpenVMS)
  • 1996: StrongARM assembly
  • 1996: Standard ML
  • 1997: Mathematica
  • 1998: Quake C
  • 2004: OCaml
  • 2006: Java
  • 2007: Common Lisp
  • 2007: Scheme
  • 2007: Scala
  • 2007: Haskell
  • 2007: F#
  • 2009: Clojure
  • 2010: HLVM
  • 2017: Elm
  • 2017: Javascript
I was recently concerned to hear from Socio-PLT: Quantitative and Social Theories for Programming Language Adoption that the number of languages a programmer knows stagnates after the age of just 20. The decade is nearing its end but I have only learned three languages. Now I'm wondering which languages I should learn next...

Wednesday, 31 January 2018

Background reading on the reference counting vs tracing garbage collection debate

Eight years ago I answered a question on Stack Overflow about the suitability of OCaml and Haskell for soft real-time work like visualization:
"for real-time applications you will also want low pause times from the garbage collector. OCaml has a nice incremental collector that results in few pauses above 30ms but, IIRC, GHC has a stop-the-world collector that incurs arbitrarily-long pauses"
My personal experience has always been that RAII in C++ incurs long pauses when using non-trivial data (i.e. nested, structured, collections of collections of collections, trees, graphs and so on), non-deferred reference counting has the same problem for the same reason, tracing garbage collectors like OCaml work beautifully but there are many notoriously bad tools like Java that have given tracing garbage collection a bad name.
Now that I am revisiting this issue I am surprised to find many individuals and organisations repeating exactly the same experimental tests that I did and coming to the same conclusions.
Pusher published a blog post Low latency, large working set, and GHC’s garbage collector: pick two of three that explains how, after building their production solution in Haskell in the interests of correctness, they were forced to drop the Haskell programming language due to insurmountable problems with GC latency. They chose to switch to Google's Go language instead but I think OCaml would have served them well. I'd also like to stress that people in industry neglecting performance as a functional requirement is a bugbear for me!
Gabriel Schrerer published his research as a blog post Measuring GC latencies in Haskell, OCaml, Racket where he presented a related GC benchmark with results showing that OCaml attains 2ms max GC pauses compared 38ms with tuned Racket (formerly known as PLT Scheme) and to 51ms with Haskell.
Twitch published a post entitled "Go’s march to low-latency GC" that described their five year journey from GC pauses lasting tens of seconds down to 1ms using Go.
Hashtable latencies on Github is a web server benchmark ported to a variety of tracing garbage collected languages as well as Swift (which uses reference counting). Although at the beginning they say "the Swift version is mostly used as a reference of what would be ideal" at the end they actually conclude "Want to build low-latency, memory-intensive services in a GCed language? Use OCaml (or Reason if you prefer the syntax)". Note in particular their graph of measured latency profiles which shows no significant difference between OCaml and Swift at any scale:
Here is a discussion in the Rust subreddit about reference counting woes where the author ends up admitting "it might be interesting to deliberately leak the memory and just hope that the OS is smart enough to swap out pages that aren't being referenced any more".
Google's V8 team have blogged about their garbage collector including the post Getting Garbage Collection for Free about optimising their GC for latency by deferring collection work until idle time.

Tuesday, 26 December 2017

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Our previous post caused some controversy by questioning the validity of some commonly-held beliefs. Specifically, the beliefs that reference counting (RC) always requires less memory than tracing garbage collection (GC) and that tracing GCs require 3-4x more memory in order to perform adequately. We presented a benchmark written in Swift and OCaml and noted that the RC'd Swift implementation ran over 5x slower and required over 3x more memory than the tracing GC'd OCaml implementation. This observation disproves these beliefs in their strong form and even brings into question whether there is even any validity in the weak forms of those beliefs. After all, we have never seen any empirical evidence to support these beliefs in any form.
We received a lot of criticism for that post. The vast majority of the criticism was not constructive but two valid points did arise. Firstly, although our result was anomalous it would be more compelling to see the benchmark repeated across a wider variety of production languages using both RC and tracing GC because this would help to reduce the error caused by differences other than GC strategy. Secondly, it would be more compelling to repeat the study with different benchmark tasks in case this one problem (computing symbolic derivatives) is anomalous. This post addresses the former concern.
We implemented equivalent solutions to the original Swift and OCaml, this time in F# and Mathematica. The F# uses a tracing garbage that is inherited from its runtime (either Mono or .NET). Mathematica is a domain specific language (DSL) designed specifically for computer algebra that uses its own heavily-optimized RC implementation so it stands the best possible chance of being able to compete with the original OCaml. So these additional measurements should usefully extend the applicability of our findings.
The F# 4.0 code was run on Mono 4.6.2 directly on the Raspberry Pi 3 alongside the Swift and OCaml and also on .NET 4.7 on Windows 10 (i.e. different architecture and platform!). Our results for Mathematica represent excess memory consumption over the initial consumption (because Mathematica has a uniquely huge standard library) and were obtained running on Mathematica 11.2 in the cloud and 10.0.1 on the Raspberry Pi 3.
According to our measurements, the solutions using tracing GCs require less memory than the solutions using RC in every single case:
Again, we find that this common belief that RC requires less memory to be unjustified.
Just out of interest, here are the run times too (on Raspberry Pi 3 only):

Sunday, 24 December 2017

Does reference counting really use less memory than tracing garbage collection? Swift vs OCaml

The way software developers cling to folklore and avoid proper experimental testing of hypotheses really bothers me. I think perhaps the worst branch of computer science still riddled with myths and legends is memory management.
Despite over half a century of research on garbage collection showing that reference counting is inferior to tracing garbage collections algorithms (even when almost all GC research restricts consideration to Java when much better algorithms have been known for years) there are still many people who claim otherwise. C++ developers obviously still believe that reference counted smart pointers are superior to tracing garbage collection but now other people are doing it too. People who should know better.
This situation recently reared its ugly head again when Apple released a promising new programming language called Swift that shuns tracing garbage collection in favor of reference counting. Now, there are logical reasons for Apple to have chosen reference counting, most notably that their existing infrastructure uses it heavily. That I can tolerate. However, what I cannot tolerate is Apple putting lipstick on their pig by making technical claims about the superiority of reference counting over tracing garbage collection. Baseless claims.
Chris Lattner is a great guy and I have a lot of respect for him as the author of both LLVM and Swift but I'm amazed to see him make claims like:
"More pointedly, not relying on GC enables Swift to be used in domains that don’t want it - think boot loaders, kernels, real time systems like audio processing, etc.
...
One thing that I don’t think is debatable is that the heap compaction behavior of a GC (which is what provides the heap fragmentation win) is incredibly hostile for cache (because it cycles the entire memory space of the process) and performance predictability.
Given that GC’s use a lot more memory than ARC systems do, it isn’t clear what you mean by GC’s winning on heap fragmentation.
...
GC also has several *huge* disadvantages that are usually glossed over: while it is true that modern GC's can provide high performance, they can only do that when they are granted *much* more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance. Additionally, since the sweep pass touches almost all RAM in the process, they tend to be very power inefficient (leading to reduced battery life)."
Some of these beliefs are clearly wrong. For example, Docker and Mirage are both written in languages using tracing garbage collection (Go and OCaml, respectively). Others are less obviously wrong. For example, claiming that GCs "cycle the entire memory space of the process" and "the sweep pass touches almost all RAM in the process" is actually untrue but it isn't immediately obvious why. The reason is that tracing garbage collectors only care about pointers. When presented with a block of binary data (e.g. an array of numbers) that doesn't contain any pointers a tracing GC will not bother to trace it. None of that RAM gets touched. A tracing GC will only touch almost all RAM if the heap is composed almost entirely of pointers. For example, if you had a traditional purely functional collection like a balanced binary tree so big that it consumed all available memory. Even on a mobile phone that is extremely unlikely to occur. Also, Chris' claim about heap fragmentation is at best dubious. The primary source of defragmentation in a generational collector is copying survivors from the nursery into a mostly-contiguous part of an older generation. The nursery is deliberately sized so that this operation is expected to fit into CPU cache so it is not "incredibly hostile for the cache" at all. On the contrary, GC research has shown the cache friendliness of this operation to be one of the most important benefits of generational GCs. Finally, some of these claims are surprising to me and I am unwilling to accept them as fact without first running some tests in an attempt to falsify them. Most notably this belief that tracing garbage collection requires 3-4x more memory than reference counting. Where did these numbers come from?!
Before we get into my findings perhaps I should explain my unwillingness to accept this information. Perhaps the most obvious problem is that, if the performance of a tracing GC is crippled by less than 3x memory overhead, why is the default memory overhead of OCaml's GC set to just 80%? I've noticed that Apple aficionados often cite this paper by Hertz et al. in an attempt to justify their belief that tracing garbage collection requires 3-4x more memory than manual memory management in order to work efficiently. There are many problems with this. Firstly, extending the conclusion to reference counting requires you to believe that reference counting has no space overheads when, in fact, it obviously needs to store the reference counts and that it collects at the earliest possible point in a programs execution when, in fact, it doesn't. Secondly, that paper silently ignores the production-quality JVM collector and replaces it with a variety of toy collectors, most of which use algorithms the likes of which haven't been seen in production for decades. Am I supposed to just believe that they did an expert job implementing their toy collectors? Where is their source code so I can reproduce these results for myself? Smells fishy. Finally, they give no justification for their choice of benchmark programs, they end up dropping three of the eight benchmarks they started with because they couldn't get their code to work properly and some of the remaining benchmarks (e.g. a ray tracer) sound rather anomalous from a memory management perspective.
So I decided to test this hypothesis myself by benchmarking Swift against OCaml. Swift uses reference counting and OCaml uses tracing garbage collection. I am interested primarily in memory consumption but performance, compile times and code size will also be interesting. I thought about possible benchmark tasks. Given Chris' implication of high pointer density I thought it would be good to try some immutable collections. At first I thought of lists but then I figured trees would be more involved but still tractable in both languages. A common source of such collections is expression trees. Rather than writing an interpreter I thought it would be more interesting to write a program that computes the symbolic derivative of a mathematical expression. If we restrict consideration to integers, variables, addition, multiplication, raise to the power and logarithm then we can easily write a function that differentiates any such expression.
Here is a solution in Swift and a solution in OCaml. The algorithms should be identical but clearly aren't optimal. I'm much more familiar with OCaml than Swift so I reached out to the Swift community for code review and folded in one fix and some minor improvements but nobody could optimize it.
On a Raspberry Pi 3 running Swift 3.1.1 and OCaml 4.02.3 to compute up to the 9th derivative of x^x I get:
  • Swift requires 70% more chars of source code (3,939 vs 2,308 chars)
  • OCaml compiles over 3x faster than Swift with -O (1.7s vs 5.3s).
  • OCaml runs over 5x faster than Swift (1.469s vs 7.623s).
  • Swift consumes over 3x more resident memory (104MB vs 29MB).

Turns out the Swift code can be shrunk by prefixing function parameters with an underscore in order to avoid having to name all arguments at all call sites so the code sizes aren't amazingly interesting, except perhaps that it is awesome to see more mainstream languages providing union types and pattern matching. The compile times are interesting: OCaml is very fast so 3x slower than OCaml is actually pretty good. Many languages are orders of magnitude slower than OCaml at compilation. The run times are very interesting to me because I have previously noted reference counting in C++ running 10x slower than OCaml and fully expected Swift to be slow at this because reference counting is slow but exactly how slow I didn't know because I am aware that Swift goes to great lengths to try to optimise away increments and decrements of reference counts. The memory consumption is a really amazing to me. Swift uses over 3x more resident memory than OCaml. Wow. Not even close!
This cannot just be because every word-size datum in the heap has been augmented with a word-sized reference count because that could only account for a factor of two. I suspect OCaml's tracing garbage collector is collecting garbage earlier than Swift's reference counting. I'll wager part of the problem is that Swift leaks in the presence of recursion due to the use of scope-based reference counting without tail call elimination. If so, that is an obvious optimisation for Chris to put into his Swift compiler. I'll leave the task of replacing my recursive nest function with an iterative looping one to test this theory as an exercise for the reader.
Anyway, the conclusion seems clear: in at least this case reference counting requires more memory than tracing garbage collection. Therefore, it is unreasonable to make assertions like "tracing GC requires 3-4x more memory than reference counting to perform adequately".

Merry Christmas, y'all.

Monday, 17 April 2017

Xavier Leroy's "standard lecture on threads"

Xavier Leroy’s standard lecture on threads post from the caml-list in 2002 seems to have disappeared from the INRIA archives so I am reproducing it here for anyone who is interested. This post is of historical interest because it explains why OCaml has no support for multicore parallelism to this day (twelve years after the release of the first consumer-level multicore CPUs).

Date:
2002-11-25 (10:01)
From:
Xavier Leroy <xavier.leroy@i...>
Subject:
Re: [Caml-list] Why systhreads?
It seems that the annual discussion on threads started again.  Allow
me to deliver again my standard lecture on this topic.

Threads have at least three different purposes:

1- Parallelism on shared-memory multiprocessors.
2- Overlapping I/O and computation (while a thread is blocked on a network
   read, other threads may proceed).
3- Supporting the "coroutine" programming style
   (e.g. if a program has a GUI but performs long computations,
    using threads is a nicer way to structure the program than
    trying to wrap the long computation around the GUI event loop).

The goals of OCaml threads are (2) and (3) but not (1) (for reasons
that I'll get into later), with historical emphasis on (2) due to the
MMM (Web browser) and V6 (HTTP proxy) applications.

Pure user-level scheduling, or equivalently control operators (call/cc),
provide (3) but not (2).

To achieve (2) with a user-level scheduler such as OCaml's bytecode
thread library requires all sorts of hacks, such as non-blocking I/O
and select() under Unix, plus wrapping of all I/O operations so that
they call the user-level scheduler in cases where they are about to
block.  (Otherwise, the whole process would block, and not just the
calling thread.)

Not only this is ugly (read the sources of the bytecode thread library
to get an idea) and inefficient, but it interacts very poorly with
external libraries written in C.  For instance, deep inside the C
implementation of gethostbyname(), there are network reads that can
block; there is no way to wrap these with scheduler calls, short of
rewriting gethostbyname() entirely.

To make things worse, non-blocking I/O is done completely differently
under Unix and under Win32.  I'm not even sure Win32 provides enough
support for async I/O to write a real user-level scheduler.

Another issue with user-level threads, at least in native code, is the
handling of the thread stacks, especially if we wish to have thread
stacks that start small and grow on demand.  It can be done, but is
highly processor- and OS-dependent.  (For instance, stack handling on
the IA64 is, ah, peculiar: there are actually two stacks that grow in
opposite directions within the same memory area...)

One aspect of wisdom is to know when not to do something oneself, but
leave it to others.  Scheduling I/O and computation concurrently, and
managing process stacks, is the job of the operating system.  Trying
to do it entirely in a user-mode program is just not reasonable.
(For another reference point, see Java's move away from "green
threads" and towards system threads.)

What about parallelism on SMP machines?  The main issue here is that
the runtime system, and in particular the garbage collector and memory
manager, must be MP-safe.  This means minimizing global state, and
introducing locking around accesses to shared resources.  If done
naively (e.g. locking at each heap allocation), this can be extremely
costly; it also complicates the runtime system a lot.  Finally,
garbage collection can become a limiting factor if it is done in the
"stop the world" fashion (all threads stop during GC); a concurrent GC
avoids this problem, but adds tremendous complexity.

(Of course, all this SMP support stuff slows down the runtime system
even if there is only one processor, which is the case for almost all
our users...)

All this has been done before in the context of Caml: that was
Damien Doligez's Concurrent Caml Light system, in the early 90s.
Indeed, the incremental major GC that we have in OCaml is a
simplification of Damien's concurrent GC.  If you're interested, have
a look at Damien's publications.

Why was Concurrent Caml Light abandoned?  Too complex; too hard to debug
(despite the existence of a machine-checked proof of correctness);
and dubious practical interest.  Shared-memory multiprocessors have
never really "taken off", at least in the general public.  For large
parallel computations, clusters (distributed-memory systems) are the
norm.  For desktop use, monoprocessors are plenty fast.  Even if you
have a 4-processor SMP machine, it isn't clear whether you should
write your program using shared memory or using message passing -- the
latter is slightly more expensive, but scales to clusters...

What about hyperthreading?  Well, I believe it's the last convulsive
movement of SMP's corpse :-)  We'll see how it goes market-wise.  At
any rate, the speedups announced for hyperthreading in the Pentium 4
are below a factor of 1.5; probably not enough to offset the overhead
of making the OCaml runtime system thread-safe. 

In summary: there is no SMP support in OCaml, and it is very very
unlikely that there will ever be.  If you're into parallelism, better
investigate message-passing interfaces.

- Xavier Leroy
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

Monday, 30 May 2016

Disadvantages of purely functional programming

I have been asked to elaborate on my answer on Quora so here goes:

 

1.       There is no efficient purely functional unsorted dictionary or set

Since the 1990s the use of dictionaries in software has gone through the roof. Dictionaries are now a stock collection type that every programmer expects to find in their standard library.

Purely functional or persistent data structures such as those found in Okasaki’s fabulous monograph on the subject can be a great tool. In particular, the persistence they offer means you can reuse old versions of collections without having to worry about mutation. In many cases (particularly for some kinds of problems such as logic programming and compiler writing) this can make solutions shorter and clearer, partly because it makes backtracking trivial. However, persistence comes at a great cost in terms of performance: purely functional dictionaries are typically 10x slower than a decent hash table and I have seen them run up to 40x slower. For some applications this is just too slow.

Furthermore, most functional programming languages (OCaml, Haskell, Scala) are incapable of expressing a fast generic mutable hash table because they lack the killer combo of: reified generics, value types and a fast GC write barrier.

BEWARE: people who try to claim that Haskell’s purely functional dictionaries are fast by comparing them with Haskell’s mutable hash tables. The correct conclusion is that Haskell’s mutable hash tables are slow.

 

2.       There is no purely functional weak hash table.

With a garbage collected imperative language, the relationships between the vertices and edges of a graph can be expressed using weak hash tables. The garbage collector will then collect unreachable subgraphs for you. There is no purely functional weak hash table so, in a pure language, you must write your own garbage collector.

Note that this is a really fringe disadvantage with most developers never having used a weak hash table!

 

3.       There are no purely functional concurrent collections.

By definition, immutable collections cannot support concurrent mutation. Consequently, if you want a shared mutable collection such as an in-memory database then there is no efficient purely functional solution.

 

4.       Most graph algorithms look worse and run much slower when written in an FP style.

Purely functional programming is a great tool for some kinds of problems but graph algorithms are one place where I have noticed that pure solutions are often worse both in terms of speed and clarity.

Compare Prim’s algorithm in 12 lines of Python with Prim’s algorithm in 20 lines of Haskell. And why does the Haskell use Prim’s algorithm? Probably because Kruskal’s algorithm is built upon the union-find collection and there is no known efficient purely functional union-find collection.

 

5.       The inertia of traditional imperative data structures and algorithms is huge.

Beyond graph algorithms, there are many parts of computer science where 65 years of published literature has focused almost entirely upon imperative solutions. Consequently, imperative programmers can easily build upon the backs of giants whereas purely functional programmers are often left starting from scratch. After all, just a few years ago memoization in Haskell was the topic of a PhD thesis!

I once challenged a group of Haskell programmers (several of whom had PhDs in Haskell) to write an efficient generic parallelised quicksort in Haskell and this is what happened.

 

6.       All existing implementations of functional programming languages, both pure and impure, happen to allocate far too much by design.

Around 1960, McCarthy invented Lisp. The core data structure was the singly-linked list. Each list node was a separate heap allocated block. All modern functional languages evolved from this. In the 1970s, Scheme used essentially the same data representation strategy as Lisp. In the 1980s, SML added a little unboxing with tuples heap allocated as a single block of memory. In the 1990s, OCaml added a little more with unboxed float arrays. Haskell added the ability to unbox some data. But to date no functional programming language has unboxed tuples by default. Even F#, which sits on .NET which provides arbitrary value types, still uses .NET’s boxed tuples. Consequently, all modern functional programming languages incur very high allocation rates for essentially no good reason. Consequently, they all stress their garbage collectors far more than necessary. This is a serious problem not just because it makes serial code slow but because the garbage collector is a shared resource and, therefore, stressing the GC impedes the scalability of parallel programs.

I should note that calling this a “disadvantage” is contentious. Xavier Leroy of OCaml fame regards OCaml’s Lisp-like data representation as a good thing because it is the backbone of OCaml’s excellent performance when running the Coq theorem prover. Here Xavier asserted that “OCaml's strategy is close to optimal for symbolic computing”. Functional languages are often optimised for high performance purely functional collections at the expense of low performance imperative collections. Given that imperative collections are generally faster, this puts an artificially-low ceiling on the performance of almost all functional languages.

 

7.       Purely functional programming is theoretically good for parallelism but bad for performance in practice, which is the sole purpose of parallelism.

There are two reasons to write parallel programs today. The first is to write objectively fast solutions. The second is to make a slow solution less slow. In most cases, parallel programming in functional languages is a form of the latter. Almost nobody in high performance computing circles (i.e. supercomputers) is running functional code directly. When most functional programmers employ parallel programming today they do so not to attain the best absolute performance but just to improve the performance they have.

Purely functional languages like Haskell are designed to abstract away space and time. This gives you a higher-level perspective of your solution but it makes it very hard to reason about the amount of memory or length of time a Haskell program will require to produce a result. In parallel programming it is always important to make sure that the gain from parallelisation outweighs the administrative overheads of running code in parallel. Haskell makes this very hard. So hard, in fact, that published research on parallel Haskell notoriously cherry picks the degree of parallelisation that maximises performance even though that degree could not be predicted before running the program many times. I have found that straightforward parallelization often yields reliable speedups in languages like C++ but not in Haskell where performance is unpredictable.

BEWARE: People who talk only about scalability and disregard absolute performance. You can improve the scalability of almost any parallel program by redundantly recomputing the Mandelbrot set after each line of code for no reason because most of the time will then be spent in embarrassingly parallel code. Scalability is necessary but insufficient. You must also look at absolute performance.

 

8.       It took 50 years for normal people to dilute the smug weenies to the point where you can get a useful answer about functional programming on social media.

I’ve been doing functional programming for over 20 years now. For decades there was a social chasm between functional programmers and people who had real problems to solve. Thankfully this problem is now starting to dissolve away with functional languages like Scala, Clojure and F# being used for real work but for many years the predominantly-smug-weenies dominated the functional scene, making it hard for people to get real solutions to their real problems. Part of the reason for this was that, having languished in obscurity for so many decades, some communities (most notably Lisp) had highly evolved (but wrong) arguments as to why Lisp was good. It took me many years to figure out what was wrong with these arguments.

 

9.       There is a huge amount of misinformation about functional programming in circulation.

If you criticise the performance of hash tables in Haskell (more recently here) then you get pure misinformation from the leading lights of the community such as someone advising people to effectively turn off garbage collection.

For years the functional programming community brandished beautifully short implementations of the Sieve of Eratosthenes and Quicksort algorithms. These were even taught to students for years. Only many years later did it emerge that their solutions did not implement those algorithms. Melissa O’Neill even published a paper correcting the Sieve of Eratosthenes in Haskell. In particular, her genuine sieve requires 100x more code than the original Haskell. Same for quicksort where Haskell’s elegant two-line sort is over 1,000x slower than Sedgewick’s Quicksort in C because the Haskell deep copies lists over and over again, completely blowing the asymptotic IO complexity of Hoare original algorithm.

See also “Why is Haskell used so little in industry?” for a thorough debunking of the Haskell in Industry page.

 

Saturday, 7 May 2016

ARM code generation quality

I previously looked at x86 code generation quality and found that GCC does some interesting high-level optimisations but, in that case, any performance benefit was lost in poor quality code generation. In this article I’m going to look at ARM assembly instead.

 

Compiling the Fibonacci function in C from last time we obtain times ranging from 2.4s to 6.6s for fib(40) using –O2 to –O0. Interestingly, using –O3 actually worsens performance over –O2. As before, inspection of the generated assembly language shows that GCC employs nifty high-level optimisations when given the –O2 option.

 

The simplest implementation of our Fibonacci function in ARM assembly is perhaps:

 

fib:

        cmp     r0, #2

        movlt   pc, lr

        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        bl      fib

        mov     r2, r0

        mov     r0, r1

        bl      fib

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}

 

These 11 instructions are almost identical to the output generated by –O1 except we have been more frugal in order to avoid having to save and restore R3. This takes 3.9s to run.

 

Perhaps the most obvious optimisation is to inline the initial test (if n<2 then n else …) and then skip it when recursing:

 

fib2:

        cmp     r0, #2

        bxlt    lr

fib2i:

        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        cmp     r0, #2

        blge    fib2i

        mov     r2, r0

        mov     r0, r1

        cmp     r0, #2

        blge    fib2i

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}

 

This immediately reduces the time taken to 1.975, over 20% faster than any of the C solutions. So with very little effort we have written assembly by hand that is both shorter and faster than the assembly generated by GCC.

 

Let’s take a look at that high-level optimisation that GCC did. With –O2, GCC generates 17 instructions:

 

fib:

        cmp     r0, #1

        stmfd   sp!, {r4, r5, r6, lr}

        mov     r6, r0

        ble     .L4

        mov     r4, r0

        mov     r5, #0

.L3:

        sub     r0, r4, #1

        bl      fib

        sub     r4, r4, #2

        cmp     r4, #1

        add     r5, r5, r0

        bgt     .L3

        and     r6, r6, #1

.L2:

        add     r0, r5, r6

        ldmfd   sp!, {r4, r5, r6, pc}

.L4:

        mov     r5, #0

        b       .L2

 

This is equivalent to the following:

 

let rec loop(r4, r5, r6) =

  r5 += fib(r4-1)

  if r4>3 then loop(r4-2, r5, r6) else r5+(r6 & 1)

let fib(n) =

  if n <= 1 then n else

    loop(n, 0, n)

 

We can rewrite this high-level code in assembly by hand as 15 instructions:

 

fib3:

        cmp     r0, #1

        bxle    lr

        stmfd   sp!, {r1, r2, r3, lr}

        mov     r1, r0

        mov     r2, #0

        mov     r3, r0

loop:

        sub     r0, r1, #1

        bl      fib3

        add     r2, r2, r0

        cmp     r1, #3

        subgt   r1, r1, #2

        bgt     loop

        and     r3, r3, #1

        add     r0, r2, r3

        ldmfd   sp!, {r1, r2, r3, pc}

 

Furthermore, whereas the C code took 2.4s this hand-written assembly takes just 1.9s. This is probably because the assembly generated by GCC takes 8 instructions to implement the identity function when n<=1 whereas our solution takes just 2 instructions.

 

GCC’s choice of high-level optimisation is interesting. Looking at the problem, the most obvious high-level optimisation to me is inlining the recursive calls. This is particularly beneficial because it results to two identical calls to fib(n-3) in the general case and that common subexpression can be factored out. The following assembly does this and runs in just 38ms:

 

fib5:

        cmp     r0, #4

        bge     fib5mid

        cmp     r0, #2

        moveq   r0, #1

        movgt   r0, #2

        bx      lr

fib5mid:

        stmfd   sp!, {r1, r2, lr}

        mov     r1, r0

        sub     r0, r1, #4

        bl      fib5

        mov     r2, r0

        sub     r0, r1, #3

        bl      fib5

        add     r2, r2, r0

        cmp     r1, #3

        sublt   r0, r1, #1

        addlt   r0, r0, r2

        blt     fib5end

        add     r2, r2, r0

        sub     r0, r1, #2

        bl      fib5

        add     r0, r0, r2

fib5end:

        ldmfd   sp!, {r1, r2, pc}

 

So it seems the folklore wisdom that it is impossible to beat the assembly generated by a modern C compiler simply isn’t true, at least in this case.