Tuesday, 6 November 2018

Benchmarking in the web age

The TechEmpower website contains some fascinating benchmarks of servers. The results on this benchmark of multiple requests to servers provide some insight into the performance characteristics of .NET on a modern problem. Specifically, the C# on ASP.NET Core solutions range from 2.5-80× slower than fastest solution which is written in Rust. In fact, C# is beaten by the following programming languages in order:

  1. Rust
  2. Java
  3. Kotlin
  4. Go
  5. C
  6. Perl
  7. Clojure
  8. PHP
  9. C++

Furthermore, .NET Core is Microsoft's new improved and faster version of .NET aimed specifically at these kinds of tasks. So why is it beaten by all those languages? I suspect that a large part of this is the change in workload from the kind of number crunching .NET was designed for to a modern string-heavy workload and I suspect .NET's GC isn't as optimised for this as the JVM is. As we have found, .NET has really poor support for JSON compared to other languages and frameworks, with support fragmented across many non-standard libraries. Amusingly, OCaml has standardised on Yojson which we have found to be much faster and more reliable despite the fact that it is a hobby project.

Another interesting observation is that C and C++ are doing surprisingly badly. The fact that Perl and PHP are beating C++ is doubtless because they are calling out to C code but that seems less likely for Rust, Java, Kotlin and Go. One interesting hypothesis I read on the interwebs is that C++ developers tend to copy strings a lot in order to avoid memory management bugs whereas Rust protects its developers from such bugs so they can be more efficient without having to worry. If true, that would be quite amazing because the user-land code behind these benchmarks isn't that big and, yet, C++ has already failed to scale and has developers reaching for one-size-fits-all-badly solutions.

Whatever the case, one thing is clear: the time is ripe for new languages, VMs and frameworks.

Sunday, 4 November 2018

On "Quantifying the Performance of Garbage Collection vs. Explicit Memory Management"

The computer science research paper "Quantifying the Performance of Garbage Collection vs. Explicit Memory Management" by Emery Berger and Matthew Hertz contains an interesting study about memory management. However, the conclusions given in the paper were badly worded and are now being used to justify an anti-GC ideology.

Introduction

That paper describes an experiment that analyzed the performance of a benchmark suite using:
  1. Tracing garbage collection.
  2. Oracular memory management (precomputing the earliest point free could have been inserted).
The experiment was performed:
  • On one VM, the Jikes Research Virtual Machine (RVM).
  • Using one programming language, Java, and consequently one programming paradigm, object oriented programming.
  • Using each of the five different garbage collection algorithms provided by that VM.
The five GC algorithms are:
  • GenMS - Appel-style generational collector (1988)
  • GenCopy - two generations with copying mature space
  • CopyMS - nursery with whole-heap collection
  • Semispace - Cheney semi-space (1970)
  • MarkSweep - non-relocating, non-copying single-generation
Note that none of these GC algorithms are in widespread use today. The only one that comes close is Semispace which is one of Erlang's GC algorithms, albeit used in a substantially different way. In fact, the algorithms studied are all at least 30 years old.

The Obvious Discussion

The experimental results presented in the paper are very interesting and immediately raise many questions. Perhaps the most obvious question is: do these results generalize to any other programming paradigms, languages, virtual machines or garbage collection algorithms?
In the absence of evidence let's just think about this first. Functional languages use different data structures (purely functional or persistent collections) in different ways (recursion via tail calls rather than loops and mutation) so I see no reason to assume that they would behave in the same way that Java does in this context. Indeed, lean functional languages like OCaml famously require far less memory than Java to solve the same problem and the relative bloat is usually attributed to object oriented programming.
Closer to Java is C# on .NET but even this is substantially different because reified generics and value types result in aggressive unboxing giving a flatter heap topology and relieving stress from the GC. This manifests particularly in the context of hash tables where .NET provides an efficient generic hash table implementation that the JVM is incapable of expressing. So it seems extremely unlikely that these results would even generalize to .NET languages like C#.
Still, the observations are interesting.

Their Conclusions

Sadly, rather than asking these obvious questions and suggesting further work the authors chose to draw several completely unjustified conclusions. Worse, this paper passed peer review and has been published containing these unjustified conclusions and is now being bandied around on the internet.
Firstly, the paper concludes:
"In particular, when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management"
The evidence described in the paper justifies the conclusion that at least for these benchmarks in Java on the Jikes RVM the GenMS GC algorithm requires five times more memory than oracular memory management to achieve the same level of performance. This extremely specific observation cannot possibly be generalized to all garbage collectors.
The paper goes on to conclude:
"Practitioners can use these results to guide their choice of explicitly-managed languages like C or C++, or garbage-collected languages like Java or C#"
Not only is there no evidence that these results have any bearing on C# whatsoever but there is also no evidence to substantiate the implication that the oracular memory management studied here is at all representative of hand-written C or C++ code. Remember: the oracle precomputes the optimal place to free every heap allocated block of memory. Is it reasonable to assume that human software developers will do the same? Absolutely not. In practice, software developers using such languages employ pool allocators that amortise the cost of allocation at the expense of dramatically increasing the amount of floating garbage. Furthermore, the program generated by the oracular method is not a general solution that will run correctly on different inputs: it is only correct for and is optimised for the one input it was given! This is akin to hoisting computation out of the benchmark and is completely unrepresentative of real software.
So, again, it seems extremely unlikely that the oracular memory management described here (although interesting) is at all representative of manual memory management in practice.
Finally, there are two major families of garbage collection: tracing and reference counting. All of the garbage collectors covered by this paper are tracing garbage collectors so the results say nothing about reference counting at all, much less the whole of garbage collection in general.

Our Conclusions

This paper is another piece of an interesting puzzle but it is a puzzle that will never be solved. No amount of objective quantitative factual evidence can ever be enough to justify a completely general conclusion about garbage collection. Hopefully someday someone will repeat this experiment using a different VM, like .NET Core, and a different language and paradigm, like F#. Until then, there is no evidence to support the hypothesis that garbage collection requires 5x more memory to achieve the same performance as manual memory management.

Tuesday, 16 October 2018

Some objective quantitative measurements I once posted on Usenet:
Jon Harrop wrote:
> Andreas Rossberg wrote:
>> That is a wild claim, and I doubt that you have any serious statistics
>> to back it up.
>
> Here are some statistics on the proportion of lines of code devoted to
> type annotations from 175kLOC of production OCaml and 5kLOC of production
> Haskell:
>
> OCaml:
> Hevea 9.0%
> ADVI 8.6%
> FFTW3 5.2%
> Unison 3.5%
> MLDonkey 2.5%
> LEdit 1.4%
> MTASC 0.0%
> HLVM 0.0%
>
> Haskell:
> XMonad 19%
> Darcs 12%

For further comparison, here are some statistics for compilers written in
OCaml and Standard ML:

OCaml: 6.3% of 217kLOC
MosML: 13% of 69kLOC

Monday, 3 September 2018

What languages did you learn, in what order?

I once compiled the following list of programming languages and which year I started learning them:
  • 1981: Sinclair BASIC (on a ZX81)
  • 1983: BBC BASIC (on a BBC Micro)
  • 1985: Pascal (installed via ROM)
  • 1987: Logo
  • 1987: 6502 Assembly (on a BBC Micro)
  • 1989: ARM assembly (on an Acorn Archimedes)
  • 1992: Casio fx-7700 BASIC
  • 1992: C (using the excellent Norcroft compiler for Acorn computers)
  • 1994: C++ (the awful Beebug Easy C++)
  • 1995: UFI (on a VAX running OpenVMS)
  • 1996: StrongARM assembly
  • 1996: Standard ML
  • 1997: Mathematica
  • 1998: Quake C
  • 2004: OCaml
  • 2006: Java
  • 2007: Common Lisp
  • 2007: Scheme
  • 2007: Scala
  • 2007: Haskell
  • 2007: F#
  • 2009: Clojure
  • 2010: HLVM
  • 2017: Elm
  • 2017: Javascript
I was recently concerned to hear from Socio-PLT: Quantitative and Social Theories for Programming Language Adoption that the number of languages a programmer knows stagnates after the age of just 20. The decade is nearing its end but I have only learned three languages. Now I'm wondering which languages I should learn next...

Wednesday, 31 January 2018

Background reading on the reference counting vs tracing garbage collection debate

Eight years ago I answered a question on Stack Overflow about the suitability of OCaml and Haskell for soft real-time work like visualization:
"for real-time applications you will also want low pause times from the garbage collector. OCaml has a nice incremental collector that results in few pauses above 30ms but, IIRC, GHC has a stop-the-world collector that incurs arbitrarily-long pauses"
My personal experience has always been that RAII in C++ incurs long pauses when using non-trivial data (i.e. nested, structured, collections of collections of collections, trees, graphs and so on), non-deferred reference counting has the same problem for the same reason, tracing garbage collectors like OCaml work beautifully but there are many notoriously bad tools like Java that have given tracing garbage collection a bad name.
Now that I am revisiting this issue I am surprised to find many individuals and organisations repeating exactly the same experimental tests that I did and coming to the same conclusions.
Pusher published a blog post Low latency, large working set, and GHC’s garbage collector: pick two of three that explains how, after building their production solution in Haskell in the interests of correctness, they were forced to drop the Haskell programming language due to insurmountable problems with GC latency. They chose to switch to Google's Go language instead but I think OCaml would have served them well. I'd also like to stress that people in industry neglecting performance as a functional requirement is a bugbear for me!
Gabriel Schrerer published his research as a blog post Measuring GC latencies in Haskell, OCaml, Racket where he presented a related GC benchmark with results showing that OCaml attains 2ms max GC pauses compared 38ms with tuned Racket (formerly known as PLT Scheme) and to 51ms with Haskell.
Twitch published a post entitled "Go’s march to low-latency GC" that described their five year journey from GC pauses lasting tens of seconds down to 1ms using Go.
Hashtable latencies on Github is a web server benchmark ported to a variety of tracing garbage collected languages as well as Swift (which uses reference counting). Although at the beginning they say "the Swift version is mostly used as a reference of what would be ideal" at the end they actually conclude "Want to build low-latency, memory-intensive services in a GCed language? Use OCaml (or Reason if you prefer the syntax)". Note in particular their graph of measured latency profiles which shows no significant difference between OCaml and Swift at any scale:
Here is a discussion in the Rust subreddit about reference counting woes where the author ends up admitting "it might be interesting to deliberately leak the memory and just hope that the OS is smart enough to swap out pages that aren't being referenced any more".
Google's V8 team have blogged about their garbage collector including the post Getting Garbage Collection for Free about optimising their GC for latency by deferring collection work until idle time.

Tuesday, 26 December 2017

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Our previous post caused some controversy by questioning the validity of some commonly-held beliefs. Specifically, the beliefs that reference counting (RC) always requires less memory than tracing garbage collection (GC) and that tracing GCs require 3-4x more memory in order to perform adequately. We presented a benchmark written in Swift and OCaml and noted that the RC'd Swift implementation ran over 5x slower and required over 3x more memory than the tracing GC'd OCaml implementation. This observation disproves these beliefs in their strong form and even brings into question whether there is even any validity in the weak forms of those beliefs. After all, we have never seen any empirical evidence to support these beliefs in any form.
We received a lot of criticism for that post. The vast majority of the criticism was not constructive but two valid points did arise. Firstly, although our result was anomalous it would be more compelling to see the benchmark repeated across a wider variety of production languages using both RC and tracing GC because this would help to reduce the error caused by differences other than GC strategy. Secondly, it would be more compelling to repeat the study with different benchmark tasks in case this one problem (computing symbolic derivatives) is anomalous. This post addresses the former concern.
We implemented equivalent solutions to the original Swift and OCaml, this time in F# and Mathematica. The F# uses a tracing garbage that is inherited from its runtime (either Mono or .NET). Mathematica is a domain specific language (DSL) designed specifically for computer algebra that uses its own heavily-optimized RC implementation so it stands the best possible chance of being able to compete with the original OCaml. So these additional measurements should usefully extend the applicability of our findings.
The F# 4.0 code was run on Mono 4.6.2 directly on the Raspberry Pi 3 alongside the Swift and OCaml and also on .NET 4.7 on Windows 10 (i.e. different architecture and platform!). Our results for Mathematica represent excess memory consumption over the initial consumption (because Mathematica has a uniquely huge standard library) and were obtained running on Mathematica 11.2 in the cloud and 10.0.1 on the Raspberry Pi 3.
According to our measurements, the solutions using tracing GCs require less memory than the solutions using RC in every single case:
Again, we find that this common belief that RC requires less memory to be unjustified.
Just out of interest, here are the run times too (on Raspberry Pi 3 only):

Sunday, 24 December 2017

Does reference counting really use less memory than tracing garbage collection? Swift vs OCaml

The way software developers cling to folklore and avoid proper experimental testing of hypotheses really bothers me. I think perhaps the worst branch of computer science still riddled with myths and legends is memory management.
Despite over half a century of research on garbage collection showing that reference counting is inferior to tracing garbage collections algorithms (even when almost all GC research restricts consideration to Java when much better algorithms have been known for years) there are still many people who claim otherwise. C++ developers obviously still believe that reference counted smart pointers are superior to tracing garbage collection but now other people are doing it too. People who should know better.
This situation recently reared its ugly head again when Apple released a promising new programming language called Swift that shuns tracing garbage collection in favor of reference counting. Now, there are logical reasons for Apple to have chosen reference counting, most notably that their existing infrastructure uses it heavily. That I can tolerate. However, what I cannot tolerate is Apple putting lipstick on their pig by making technical claims about the superiority of reference counting over tracing garbage collection. Baseless claims.
Chris Lattner is a great guy and I have a lot of respect for him as the author of both LLVM and Swift but I'm amazed to see him make claims like:
"More pointedly, not relying on GC enables Swift to be used in domains that don’t want it - think boot loaders, kernels, real time systems like audio processing, etc.
...
One thing that I don’t think is debatable is that the heap compaction behavior of a GC (which is what provides the heap fragmentation win) is incredibly hostile for cache (because it cycles the entire memory space of the process) and performance predictability.
Given that GC’s use a lot more memory than ARC systems do, it isn’t clear what you mean by GC’s winning on heap fragmentation.
...
GC also has several *huge* disadvantages that are usually glossed over: while it is true that modern GC's can provide high performance, they can only do that when they are granted *much* more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance. Additionally, since the sweep pass touches almost all RAM in the process, they tend to be very power inefficient (leading to reduced battery life)."
Some of these beliefs are clearly wrong. For example, Docker and Mirage are both written in languages using tracing garbage collection (Go and OCaml, respectively). Others are less obviously wrong. For example, claiming that GCs "cycle the entire memory space of the process" and "the sweep pass touches almost all RAM in the process" is actually untrue but it isn't immediately obvious why. The reason is that tracing garbage collectors only care about pointers. When presented with a block of binary data (e.g. an array of numbers) that doesn't contain any pointers a tracing GC will not bother to trace it. None of that RAM gets touched. A tracing GC will only touch almost all RAM if the heap is composed almost entirely of pointers. For example, if you had a traditional purely functional collection like a balanced binary tree so big that it consumed all available memory. Even on a mobile phone that is extremely unlikely to occur. Also, Chris' claim about heap fragmentation is at best dubious. The primary source of defragmentation in a generational collector is copying survivors from the nursery into a mostly-contiguous part of an older generation. The nursery is deliberately sized so that this operation is expected to fit into CPU cache so it is not "incredibly hostile for the cache" at all. On the contrary, GC research has shown the cache friendliness of this operation to be one of the most important benefits of generational GCs. Finally, some of these claims are surprising to me and I am unwilling to accept them as fact without first running some tests in an attempt to falsify them. Most notably this belief that tracing garbage collection requires 3-4x more memory than reference counting. Where did these numbers come from?!
Before we get into my findings perhaps I should explain my unwillingness to accept this information. Perhaps the most obvious problem is that, if the performance of a tracing GC is crippled by less than 3x memory overhead, why is the default memory overhead of OCaml's GC set to just 80%? I've noticed that Apple aficionados often cite this paper by Hertz et al. in an attempt to justify their belief that tracing garbage collection requires 3-4x more memory than manual memory management in order to work efficiently. There are many problems with this. Firstly, extending the conclusion to reference counting requires you to believe that reference counting has no space overheads when, in fact, it obviously needs to store the reference counts and that it collects at the earliest possible point in a programs execution when, in fact, it doesn't. Secondly, that paper silently ignores the production-quality JVM collector and replaces it with a variety of toy collectors, most of which use algorithms the likes of which haven't been seen in production for decades. Am I supposed to just believe that they did an expert job implementing their toy collectors? Where is their source code so I can reproduce these results for myself? Smells fishy. Finally, they give no justification for their choice of benchmark programs, they end up dropping three of the eight benchmarks they started with because they couldn't get their code to work properly and some of the remaining benchmarks (e.g. a ray tracer) sound rather anomalous from a memory management perspective.
So I decided to test this hypothesis myself by benchmarking Swift against OCaml. Swift uses reference counting and OCaml uses tracing garbage collection. I am interested primarily in memory consumption but performance, compile times and code size will also be interesting. I thought about possible benchmark tasks. Given Chris' implication of high pointer density I thought it would be good to try some immutable collections. At first I thought of lists but then I figured trees would be more involved but still tractable in both languages. A common source of such collections is expression trees. Rather than writing an interpreter I thought it would be more interesting to write a program that computes the symbolic derivative of a mathematical expression. If we restrict consideration to integers, variables, addition, multiplication, raise to the power and logarithm then we can easily write a function that differentiates any such expression.
Here is a solution in Swift and a solution in OCaml. The algorithms should be identical but clearly aren't optimal. I'm much more familiar with OCaml than Swift so I reached out to the Swift community for code review and folded in one fix and some minor improvements but nobody could optimize it.
On a Raspberry Pi 3 running Swift 3.1.1 and OCaml 4.02.3 to compute up to the 9th derivative of x^x I get:
  • Swift requires 70% more chars of source code (3,939 vs 2,308 chars)
  • OCaml compiles over 3x faster than Swift with -O (1.7s vs 5.3s).
  • OCaml runs over 5x faster than Swift (1.469s vs 7.623s).
  • Swift consumes over 3x more resident memory (104MB vs 29MB).

Turns out the Swift code can be shrunk by prefixing function parameters with an underscore in order to avoid having to name all arguments at all call sites so the code sizes aren't amazingly interesting, except perhaps that it is awesome to see more mainstream languages providing union types and pattern matching. The compile times are interesting: OCaml is very fast so 3x slower than OCaml is actually pretty good. Many languages are orders of magnitude slower than OCaml at compilation. The run times are very interesting to me because I have previously noted reference counting in C++ running 10x slower than OCaml and fully expected Swift to be slow at this because reference counting is slow but exactly how slow I didn't know because I am aware that Swift goes to great lengths to try to optimise away increments and decrements of reference counts. The memory consumption is a really amazing to me. Swift uses over 3x more resident memory than OCaml. Wow. Not even close!
This cannot just be because every word-size datum in the heap has been augmented with a word-sized reference count because that could only account for a factor of two. I suspect OCaml's tracing garbage collector is collecting garbage earlier than Swift's reference counting. I'll wager part of the problem is that Swift leaks in the presence of recursion due to the use of scope-based reference counting without tail call elimination. If so, that is an obvious optimisation for Chris to put into his Swift compiler. I'll leave the task of replacing my recursive nest function with an iterative looping one to test this theory as an exercise for the reader.
Anyway, the conclusion seems clear: in at least this case reference counting requires more memory than tracing garbage collection. Therefore, it is unreasonable to make assertions like "tracing GC requires 3-4x more memory than reference counting to perform adequately".

Merry Christmas, y'all.