Sunday, 13 October 2013

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says:

Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable” – Wikipedia

Similar claims can even be seen in published research such as the paper “Down for the Count? Getting Reference Counting Back in the Ring”:

“Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” Blackburn et al.

On the other hand you can see statements by experts like Richard Jones, co-author of the excellent Garbage Collection Handbook, make statements like:

“More importantly, note also that even an immediate (i.e. non deferred) reference counter cannot reclaim objects as soon as they are no longer referenced as finalisation must be asynchronous (see Hans Boehm's POPL03 paper "Destructors, finalizers and synchronization").” – a post on the gc-list by Richard Jones.

Let’s have a closer look at the thinking behind this belief and test it with a simple program. The mental model that underpins this belief is that any function’s local variables are stored in separate slots in the function’s stack frame for the entire duration of a function’s body and, therefore, will be reachable from the point of view of the garbage collector for the duration of the call to the function. This mental model underpins exam and interview questions such as Is object eligible for garbage collection after “obj = null”? and When Is The Object Eligible For Garbage Collection?.

In reality, this mental model is simple, obvious and wrong. Why? Firstly, the garbage collector sees the run-time representation of a program after it has been subjected to transforms such as inlining, instruction reordering and code block reordering by the compiler that can mutilate the structure of a program beyond recognition and, consequently, concepts like scope that exist only in the source code and not in the compiled form are not visible to the garbage collector. Secondly, the register allocator does everything possible to keep references in registers and avoid spilling them to the stack and when they must be spilled it uses the results of liveness analysis to overwrite any dead references in the stack frame whenever possible. In fact, some compilers don’t even use stack frames, such as our own x86 JIT in F# and the HLVM project, and other compilers like SML/NJ convert every call into continuation style and put stack frames on the heap, splitting every segment of code between a pair of function calls in the source into its own separate function in the compiled form.

Enough theory, let’s take a look at some working code. Here is a simple example using tracing garbage collection in OCaml/F# where an argument tmp to a function dies in the middle of the function body and, in particular, before a recursive call:

let rec loop tmp i =
  if i<=0 then tmp else
    let tmp2 = loop (Array.copy tmp) (i-1)
    tmp2.[0] <- tmp2.[0] + 1

When run using loop (Array.init m id) n, this code clearly uses less than mn space and keeps on running indefinitely. This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected.

Here is the equivalent using scope-based reference counting in C++:

shared_ptr<vector<double> > loop(shared_ptr<vector<double> > tmp, int i) {
  if (i<=0) {
    return tmp;
  } else {
    shared_ptr<vector<double> > tmp1(new vector<double>(*tmp));
    shared_ptr<vector<double> > tmp2 = loop(tmp1, i-1);
    return tmp2;

In contrast, this code clearly requires at least mn space when run, goes to swap and (on Windows) dies from out of memory. Unlike the OCaml/F# code, the scope-based reference counting using shared_ptr in C++ keeps the tmp array allocated for longer than necessary, right until the end of the function call.

This observation also destroys another popular memory management myth: that tracing garbage collection always requires more memory than reference counting.

If there is any advantage to the C++ then it is the presence of guarantees. The semantics of C++ guarantee that after the end of scope the object has been deleted. However, it is worth noting that this guarantee of determinism does not apply to objects shared between threads because in that situation the threads race to decrement the reference counter to zero and the winner of the race condition is burdened with executing the destructor.

Saturday, 12 October 2013

Memory management myths: determinism

Although the vast majority of programmers have now migrated to garbage collected languages and will probably never go back, there are still a few clinging to manual memory management. In most cases, the continued use of manual memory management is for good reason but some of these people are perpetuating myths in an attempt to justify avoiding garbage collection. Determinism can be a genuinely good reason to stick with manual memory management and is practically important in memory-constrained embedded devices. However, C++ programs are not as deterministic as people sometimes claim and, in particular, thread-safe reference counting using shared_ptr is non-deterministic. Specifically, threads holding references to shared reference-counted objects race to perform the final decrement and the thread that wins the race is responsible for destruction.

Thursday, 10 October 2013

Herb Sutter's favorite C++ 10-liner has a memory management bug

In a recently-posted video, Herb Sutter (a prominent C++ expert) describes his favorite C++ 10-liner as “a thread-safe reference-counted object cache”:

shared_ptr<widget> get_widget(int id) {
  static map<int, weak_ptr<widget>> cache;
  static mutex m;

  lock_guard<mutex> hold(m);
  auto sp = cache[id].lock();
  if (!sp) cache[id] = sp = load_widget(id);
  return sp;

This example is very interesting. Firstly, it manages to pull in reference counting, weak references and a mutex which are all very rare in modern programming. Secondly, it contains a memory leak that is difficult to fix in C++ because APIs are burdened with memory management details and this API is incapable of expressing deterministic cleanup because there is no facility for a widget's destructor to remove its entry in the map. Finally, the correct name for this data structure is a concurrent weak dictionary, specifically one with weak values. You'll find correct implementations of this data structure are widely available for C#, F# and Java such as the one here.

The obvious fix is to sweep stale entries from the map when get_widget is called but this leaves floating garbage in the map between calls to get_widget, is asymptotically less efficient and incurs unbounded pauses for an unbounded number of threads.

Update: Matthew Avery (from the USA) suggests altering the API and semantics of the functions involved so load_widget returns a shared_ptr with a custom deleter that removes the stale map entry as soon as a widget is destructed. If this idea can be made to work then it would be the only deterministic solution to have been proposed to date.

Wednesday, 25 September 2013

How do reference counting and tracing garbage collection compare?

Reference counting works by counting the number of references to each object. When the reference count drops to zero the object is definitely unreachable and can be recycled. The first advantage is simplicity. The second advantage is that decrements can occur as locals fall out of scope so collection can be deterministic and predictable. The third advantage is that the asymptotic complexity of reference counting does not depend upon the total size of the heap. The first problem is that cycles keep reference counts above zero even when they are unreachable, so reference counting alone cannot handle the general case. The second problem with reference counting is that incrementing and decrementing a counter in an object every time a reference to it is copied or dropped is very computationally expensive because it incurs a cache miss even if nothing else in the object is read or written. The third problem is that multithreaded reference counting is non-deterministic because increments and decrements race. The fourth problem is that simple implementations like Boost's shared_ptr in C++ allow destructors to avalanche which incurs unbounded pause times.

The first problem is solved in some languages (e.g. Erlang, Mathematica) by imposing unidirectional heaps so cycles cannot be created by design and, therefore, reference counting is an accurate form of GC for them. The second problem can be addressed using a variety of techniques such as deferred decrements but they all add significant complexity and, therefore, undermine the first advantage. The third problem obviously negates the second advantage in the context of multithreaded programs and, in this era of multicore computing, most programs are multithreaded. The problem of avalanching destructors can be solved by adding a queue to defer destruction but this adds complexity and makes the solution unpredictable, i.e. negates both advantages.

Tracing garbage collection works by keeping track of references into the heap (the global roots) and then tracing through the heap to find unreachable objects eligible for collection. The first advantage of tracing collection is that it can be very simple (e.g. the mark-sweep collector in my HLVM project is ~20 lines of code). The second advantage is that it is always accurate (can collect cycles). The third advantages is that naive mark-sweep is much faster than naive reference counting and optimized tracing collectors (e.g. JVM and CLR) are significantly faster than the fastest reference counting collectors. The first disadvantage is that, although they can be deterministic (as OCaml is), they are unpredictable because it is not possible to predict when an object will be collected. The second disadvantage of naive tracing collectors is that mutator threads must all be paused for an unbounded amount of time during each collection cycle. The third disadvantage of tracing collection is that the time spent marking is proportional to the size of the entire heap.

The second disadvantage is easily reduced by adopting incremental mark-sweep and can be completely eliminated by more advanced techniques. The third disadvantage turns out to be practically irrelevant because the constant pre-factor is so small: even with the largest heaps in use today the mark phase is surprisingly quick.

There are some common myths surrounding this topic as well. Many people believe (incorrectly) that objects are always kept alive until the end of their scope in the source code and, therefore, that reference counting always collects at the earliest opportunity because it collects objects reachable from variables as they fall out of scope. In reality, this mental model is horribly broken. In reality, the GC acts at run-time when the code has been compiled so the concept of scope does not even exist. Not all locally-held references are spilled to the stack and even when a reference is spilled to the stack it is likely to be overwritten by another when it is no longer needed. Register allocators rely upon liveness analysis (not scope) to determine when references need to be kept.

Why is garbage collection an important feature?

  1. Garbage collection eliminates a class of bugs caused by erroneous memory management (forget to free, free too soon, free more than once).
  2. Garbage collection removes the need for APIs to describe contracts about memory management.
  3. Garbage collection facilitates programming styles such as first-class lexical closures in functional programming (see the Funarg problem).

What's new in garbage collection?

Since the 1950s there have been three main families of collectors: semi-space, mark-sweep and mark-compact. Almost all production GCs have been generational mark-sweep even though they exhibit pathological performance when the nursery is full of survivors because they are marked, (physically) copied and all references to them updated which is ~3x slower than necessary and is practically useful when filling a hash table with heap-allocated keys and/or values. The Beltway (2002) and Immix (2008) garbage collectors introduced the new family called the mark-region GCs. With a mark-region GC the entire heap is a collection of regions and so is the nursery generation so it can be logically aged by replacing it with another region when it is full of survivors. Sun's Hotspot JVM introduced the first mainstream mark-region GC with its G1 collector.

The advent of multicore in 2005 has meant more emphasis on parallel and concurrent garbage collectors. The Staccato (2008) garbage collector is the first that is simultaneously parallel and concurrent and real-time.

.NET represents a major lateral advancement among mainstream GCs because support for reified generics and value types allows .NET languages like C# and F# to express generic collections that can be filled with new values without requiring any heap allocation. For example, filling a Dictionary with ints, floats, complex numbers and low-dimensional vectors and matrices.

Saturday, 14 September 2013

A look at the IronJS source code

The author of IronJS has made some claims (e.g. F# can feel “like a slightly crappier version of C#”) that I find surprising so I'm reading his code. I find it to be quite odd for a variety of reasons:
  1. Surprisingly OOP (dozens of interfaces and classes and hundreds of augmentations, static members and overloads) given how well suited ML is to writing compilers.
  2. Lots of low-level bit-twiddling.
  3. Uses List.empty instead of [].
  4. Uses large tuples.
  5. Uses arrays of closures, e.g. Stmt : (Lexer.Token -> State -> Ast.Tree) array.
  6. Doesn't use a symbol table.
  7. Choice of data structures appears to incur lots of unnecessary boxing.
  8. Odd choice of data structures for memory representation, e.g. nullleft and stmt fields for a token are scattered.
  9. Doesn't seem to use concurrency or parallelism.
  10. Parser contains only 42 match expressions in 1,400 lines of code.
  11. Massive amounts of code duplication.
  12. Hand-rolled parser is unusual. Hand-rolled lexer without tables is even more unusual.
  13. Strange choice of syntax, e.g. lots of p |> csymbol instead of just csymbol p.
  14. Allegedly optimized but no structs.
  15. Uses HashSet.Add repeatedly instead of the declarative constructor from a seq.
  16. Doesn't use HashIdentity.Structural.
Furthermore, I don't recall ever seeing that author ask anyone for help with it and it looks like he could do with quite a bit of advice. So I'd take the "rewrite in C#" with a pinch of salt...
Here's an example of some code from the hand-rolled lexer that is using many comparisons of unicode characters when single table lookups could suffice:

let inline isAlpha c = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
let inline isDecimal c = c >= '0' && c <= '9'
let inline isOctal c = c >= '0' && c <= '7'
let inline isNonOctalDigit c = c = '8' && c = '9'
let inline isHex c = isDecimal c || (c >= 'A' && c <= 'F') || (c >= 'a' && c <= 'f')

Here's an example of some odd code that uses OOP methods to give names to the elements of a tuple when it should just use a record type instead of a tuple:

member x.TokenName =
  let s, _, _, _ = x.Token in s |> Lexer.Symbol.getName

member x.TokenValue =
  let _, v, _, _ = x.Token in v

member x.TokenLine =
  let _, _, l, _ = x.Token in l

member x.TokenColumn =
  let _, _, _, c = x.Token in c

I believe this is the offending type definition:

type Token = int * string * int * int

Here's another example of some strange code from the lexer where the programmer has manually assigned consecutive integers to symbols:

let [<Literal>] Break = 0
let [<Literal>] Case = 1
let [<Literal>] Catch = 2
let [<Literal>] Continue = 3
let [<Literal>] Default = 4

Note the complete absence of any metaprogramming here, which is surprising because it is written in a MetaLanguage (ML).

Here's another oddity, the use of new when constructing non-IDisposable objects:

new Dictionary<string, int>(

Yet another oddity, pretending that all type definitions are mutually recursive when they are actually completely independent:

and ScopeType
  = GlobalScope
  | FunctionScope
and EvalMode
  = Clean
  | Contains
  | Effected
and LookupMode
  = Static
  | Dynamic

Very odd to say that a Dictionary being faster than a Map is "sad" when it isn't even being mutated:

// Sadly a normal Dictionary is so
// much faster then a F# Map that
// it's worth using it
new Dictionary<string, int>(

In summary, although IronJS is written in F# it is not idiomatic F# code and there is a lot of room for improvement. I see no reason to believe that the problems with the IronJS code base have anything to do with the F# programming language.

Wednesday, 4 September 2013

Efficient Parallel Stencil Convolution in Haskell

Until recently, all research on parallel Haskell programs neglected CPU caches and existing high-performance software solutions like BLAS, LAPACK, FFTW and OpenCV. We criticized the paper Regular, shape-polymorphic, parallel arrays in Haskell for comparing parallel Haskell programs only with poorly-written C code, particularly because this buries the real performance differences in unnecessary cache misses.

In 2011, Ben Lippmeier, Gabriele Keller and Simon Peyton-Jones published a paper entitled Efficient Parallel Stencil Convolution in Haskell that is the first to address caches and locality in the context of parallel Haskell programs and to compare parallel Haskell programs to existing high-performance solutions. The paper discusses an algorithm for edge detection in computer vision that was 10x slower using the old REPA approach and walks through the design and implementation of a more advanced solution using new approaches that take CPU caches into account and, ultimately, attains performance comparable to the high-performance OpenCV library. We highly recommend this paper for anyone interested in the subject. Hopefully future work will progress to more advanced topics like Frigo’s trapezoidal space-time decomposition.

This appears to have marked a turning point in Haskell research. Several other contributors are now discussing locality and cache oblivious algorithms written declaratively in a purely functional style. Best of luck to them!

Tuesday, 3 September 2013

Hash table insertion performance: F# vs OCaml vs Haskell

Time taken to insert ten million int→T bindings for T as each of int, float, complex and string using F# on .NET 4.5, OCaml 3.12 and GHC 7.4.2:
On a 3.4GHz Intel Core i7-3770 running Windows 7.
Value types and reified generics facilitate .NET's very efficient Dictionary implementation when handling value types like int, float and complex (unmatched by any other language/VM including C, C++ and Java). F# is 50% faster than OCaml and Haskell using the reference type string perhaps because the .NET write barrier is significantly more efficient.
Note: we use "int" to refer to ordinary 32-bit ints in F# and Haskell but 31-bit ints in OCaml. Using 32-bit ints in OCaml would be much slower because they are boxed (i.e. heap allocated).

Saturday, 2 February 2013

Curious about the hash table problem

An old post from Stack Overflow:

In short, even with the fix in the latest GHC, Haskell is still incapable of providing a dictionary (mutable or immutable) that is competitively efficient.
Haskell's hash tables were 32× slower than alternatives like C++ and .NET with GHC 6.10. That was partly due to a performance bug in the GHC garbage collector that was fixed for GHC 6.12.2. But Simon Marlow's results there show only a 5× performance improvement which still leaves Haskell's hash tables many times slower than most alternatives.
Purely functional alternatives are also much slower than a decent hash table. For example, Haskell'sIntMap is 10× slower than .NET's hash table.
Using F# 2010 and the latest Haskell Platform 2010.2.0.0 (released yesterday!) with GHC 6.12.3 on this 2.0GHz E5405 Xeon running 32-bit Windows Vista to insert 20M int->int bindings into an empty hash table we find that Haskell is still 29× slower than F# in real time and over 200× slower in terms of CPU time because the Haskell burns all cores:
GHC 6.12.3 Data.HashTable: 42.8s (new!)
.NET hash table:            1.47s
Provided you run only short-lived microbenchmarks you can disable the GHC garbage collector as Don Stewart suggests above. By asking for a nursery generation so large that this particular program will never fill it, he brought the time for the Haskell hash table down to only 1.5s here. However, this completely undermines the whole point of having a nursery generation and will massively degrade the performance of other code because freshly allocated values will now always be cold in the cache (which is why the nursery generation is typically the size of the L2 cache, orders of magnitude smaller than this).