Saturday, 28 February 2015

Memory management myths in Apple circles

Nothing gets my hackles up more than people perpetuating memory management myths. Apparently there is a new trend in town thanks to Apple, who are deprecating their garbage collector on OS X in favor of Automatic Reference Counting (ARC).
News website Cult of Mac say:
iOS is twice as memory-efficient as Android. Here’s why... According to Glyn Williams over on Quora, iOS devices run better than Android devices with twice the RAM because Android apps use Java, and need all the extra RAM to do something called garbage collection.”
Another news website, Redmond Pie, say:
“That was basically the same question put to Quora, the social website that gives people a way to ask questions and then have them answered by people who are experts in their respective field. The upvoting system adds a spot of authority tracking to the answers that are provided, and we have a clear winner as far as the question around why Android phones have so much more memory than iPhones.

Enter Glyn Williams.

The response, upvoted by over 2,600 people, included a handy graph and an explanation that involves garbage collection and Java. Basically, Android needs more memory because of the way it handles things.

You can head on over to the Quora question and check out Glyn’s explanation yourself, but what it boils down to is this: Android apps use Java, and as a result Android does something called garbage collection which involves memory being recycled once applications are finished with it. That’s all well and good, and actually performs really well when given plenty of memory to work with. The problems arise when the system is starved of memory.”
They are both referring to the same answer on Quora by a guy called Glyn Williams. His answer is as follows:
“Servicing RAM uses power. So more memory = more power consumption.
Android apps using Java, recycle released memory using garbage collection.

What this diagram shows is that garbage collectors are really awesomely fast if you have a relative memory footprint of 4 or 8.

In other words, you need four or eight times more memory, than you are actually using to be super efficient. But when the memory becomes constrained, that performance goes way down.

This is why Android devices have all that RAM.

iOS does not use this style of garbage collection and does not slow down in constrained memory environments.

So 1GB for iOS results in more performance than 3GB for Android.”

Some problems are immediately obvious with this. Firstly, RAM typically only accounts for a fraction of a percent of total power consumption in a mobile phone so power is not an excuse for skimping on RAM. Secondly, the graph is one of seven graphs from a ten-year-old research paper that compared various toy garbage collectors with an alternative scheme that used trial runs to deallocate memory aggressively. There are many problems with this. The other six graphs in the paper do not substantiate Glyn’s claims, i.e. he cherry picked his graph. Most of the garbage collectors (e.g. Cheney semi space, stop and copy, non-generational mark-sweep) are not representative of anything used on Android. The most realistic garbage collector on the graph is the generational mark-sweep collector that outperformed all of the others but even this is not as sophisticated as the concurrent garbage collector employed by the latest Android's run-time, ART. Thirdly, Glyn asserts that GCs must “have a relative memory footprint of 4 or 8” to be really awesomely fast when this graph clearly shows a relative footprint of 2.5 for the only realistic GC provides the best possible performance. Fourthly, Glyn implicitly assumed that ARC has optimal performance and memory overhead when, in fact, reference counting can be 10x slower than tracing GC and reference counts take up a lot of room. Ten times slower is literally off the chart here. Fifthly, Glyn asserts that garbage collection is the reason why Android devices have more RAM but there is no evidence to support this. Finally, Glyn asserts that this (garbage collection) is the reason why a 3GB Android device performs like a 1GB iOS device when there is clearly no evidence to support that conclusion which is, in fact, pure speculation.
I took the opportunity to ask Glyn himself what had given him the impression than Android needs more RAM to attain the same performance as iOS and why exactly he thought that was due to garbage collection. The only concrete evidence Glyn offered was this video showing two devices being switched by hand between a variety of different applications. Most of the applications are written in C++ and the Android device actually won the benchmark.
So this news-worthy gem turned out to be pure speculation.

Sunday, 26 January 2014

On the performance of boxed tuples

Stephen Dolan, a PhD student at the University of Cambridge, published an interesting article A “sane” design for multicore OCaml recently and the following discussion on Reddit proved to be quite informative. In particular, more than one person asserted that ints, floats and tuples are fast in OCaml. In this blog post I’m going to take a look at tuple performance.


As Stephen points out, one might reasonably expect unboxed tuples to be faster for passing arguments to functions and returning multiple values from functions because the elements can stay in registers but slower for storing in the heap because they require multi-word reads and writes instead of a single word (a pointer to the existing tuple). However, HLVM has shown that unboxed tuples can be extremely fast so why the discrepancy?


The performance charactistics of different heap topologies are not quite so simple in a garbage collected environment. Two aspects of garbage collection affect the results: the write barrier and survivors. The write barrier is a relatively-slow piece of code injected whenever a program writes a reference into the heap in order to keep the garbage collector apprised of the constantly-changing heap topology. Therefore, writing an unboxed pair of ints into the heap requires two int writes whereas writing a boxed pair of ints into the heap requires one pointer write and a write barrier. Now, in order to record information for the garbage collector the write barrier always performs at least one write in addition to other housekeeping work. Therefore, writing a pair of ints will always be slower if the pair is boxed. In fact, the F# Journal article Pathological garbage collector behaviour found that a write that incurs the write barrier is 2.4x slower than a write that does not. Moreover, .NET is heavily optimized for mutable code so it has a very efficient write barrier whereas OCaml is heavily optimized for purely functional code and has a notoriously slow write barrier.


The next issue that complicates the performance of boxed vs unboxed tuples in the heap is survivors. Both OCaml and .NET use generational garbage collectors. New objects like boxed tuples are allocated in a nursery generation. When the nursery is full, surviving objects are identified and physically copied to the next generation. If a program violates the generational hypothesis (that most objects die young) by allocating many objects that survive then it incurs a performance overhead for marking the survivors, copying them into the next generation and fixing up all pointers to those objects to point at their new locations. If tuples are unboxed then none of these overheads exist.


So it is instructive to measure the performance of writing newly-allocated tuples into progressively longer slices of an array. We have done this using boxed tuples in OCaml and F# as well as unboxed structs in F#. The following graph visualizes the performance as a function of the size of the array slice:

For array slices containing up to 1,000 elements none of the tuples survive the first generation and the performance of the boxed tuples is only slightly worse than for the unboxed representation (probably due to the write barrier). For more than 1,000 elements the performance of boxed tuples in both F# and OCaml rapidly worsens until they are 10x slower than unboxed tuples for 1,000,000 elements.


So tuples can indeed be fast in OCaml but only if they are short-lived temporaries. If tuples survive the nursery generation (which is 256kB by default) then performance is very bad.


The poor performance of long-lived tuples in OCaml has actually been worked around on several occasions. The Map implementation is almost identical to the Set implementation except the key-value pairs have been manually unboxed into the variant type representing the AVL tree, resulting in a large amount of unnecessary code duplication. The Hashtbl implementation uses a custom bucket type that is a list where the key-value pairs in each cons cell have been manually unboxed. Variant types are most elegantly represented as a tag and argument. For multiple arguments, the argument can just be a tuple. This simple and efficient representation is used in HLVM and it works very well. In OCaml, the compiler unboxes multiple arguments as a special case in order to combat the performance of boxed tuples. This results in a language wart where brackets around the arguments to a variant type constructor alter the meaning from multiple arguments to a single argument that is a tuple.


Locality is another important aspect of the performance of boxed vs unboxed tuples. Consider sorting an array of pairs and then enumerating the sorted array. With an unboxed representation the elements of the array are physically moved into place within a contiguous block of memory and enumeration is cache friendly. With a boxed representation, sorting scrambles the pointers and enumeration then has worst-case cache behaviour.


Suffice to say, the performance of boxed tuples is not as clear-cut as one might imagine.


Sunday, 13 October 2013

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says:

Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable” – Wikipedia

Similar claims can even be seen in published research such as the paper “Down for the Count? Getting Reference Counting Back in the Ring”:

“Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” Blackburn et al.

On the other hand you can see statements by experts like Richard Jones, co-author of the excellent Garbage Collection Handbook, make statements like:

“More importantly, note also that even an immediate (i.e. non deferred) reference counter cannot reclaim objects as soon as they are no longer referenced as finalisation must be asynchronous (see Hans Boehm's POPL03 paper "Destructors, finalizers and synchronization").” – a post on the gc-list by Richard Jones.

Let’s have a closer look at the thinking behind this belief and test it with a simple program. The mental model that underpins this belief is that any function’s local variables are stored in separate slots in the function’s stack frame for the entire duration of a function’s body and, therefore, will be reachable from the point of view of the garbage collector for the duration of the call to the function. This mental model underpins exam and interview questions such as Is object eligible for garbage collection after “obj = null”? and When Is The Object Eligible For Garbage Collection?.

In reality, this mental model is simple, obvious and wrong. Why? Firstly, the garbage collector sees the run-time representation of a program after it has been subjected to transforms such as inlining, instruction reordering and code block reordering by the compiler that can mutilate the structure of a program beyond recognition and, consequently, concepts like scope that exist only in the source code and not in the compiled form are not visible to the garbage collector. Secondly, the register allocator does everything possible to keep references in registers and avoid spilling them to the stack and when they must be spilled it uses the results of liveness analysis to overwrite any dead references in the stack frame whenever possible. In fact, some compilers don’t even use stack frames, such as our own x86 JIT in F# and the HLVM project, and other compilers like SML/NJ convert every call into continuation style and put stack frames on the heap, splitting every segment of code between a pair of function calls in the source into its own separate function in the compiled form.

Enough theory, let’s take a look at some working code. Here is a simple example using tracing garbage collection in OCaml/F# where an argument tmp to a function dies in the middle of the function body and, in particular, before a recursive call:

let rec loop tmp i =
  if i<=0 then tmp else
    let tmp2 = loop (Array.copy tmp) (i-1)
    tmp2.[0] <- tmp2.[0] + 1

When run using loop (Array.init m id) n, this code clearly uses less than mn space and keeps on running indefinitely. This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected.

Here is the equivalent using scope-based reference counting in C++:

shared_ptr<vector<double> > loop(shared_ptr<vector<double> > tmp, int i) {
  if (i<=0) {
    return tmp;
  } else {
    shared_ptr<vector<double> > tmp1(new vector<double>(*tmp));
    shared_ptr<vector<double> > tmp2 = loop(tmp1, i-1);
    return tmp2;

In contrast, this code clearly requires at least mn space when run, goes to swap and (on Windows) dies from out of memory. Unlike the OCaml/F# code, the scope-based reference counting using shared_ptr in C++ keeps the tmp array allocated for longer than necessary, right until the end of the function call.

This observation also destroys another popular memory management myth: that tracing garbage collection always requires more memory than reference counting.

If there is any advantage to the C++ then it is the presence of guarantees. The semantics of C++ guarantee that after the end of scope the object has been deleted. However, it is worth noting that this guarantee of determinism does not apply to objects shared between threads because in that situation the threads race to decrement the reference counter to zero and the winner of the race condition is burdened with executing the destructor.

Saturday, 12 October 2013

Memory management myths: determinism

Although the vast majority of programmers have now migrated to garbage collected languages and will probably never go back, there are still a few clinging to manual memory management. In most cases, the continued use of manual memory management is for good reason but some of these people are perpetuating myths in an attempt to justify avoiding garbage collection. Determinism can be a genuinely good reason to stick with manual memory management and is practically important in memory-constrained embedded devices. However, C++ programs are not as deterministic as people sometimes claim and, in particular, thread-safe reference counting using shared_ptr is non-deterministic. Specifically, threads holding references to shared reference-counted objects race to perform the final decrement and the thread that wins the race is responsible for destruction.

Thursday, 10 October 2013

Herb Sutter's favorite C++ 10-liner has a memory management bug

In a recently-posted video, Herb Sutter (a prominent C++ expert) describes his favorite C++ 10-liner as “a thread-safe reference-counted object cache”:

shared_ptr<widget> get_widget(int id) {
  static map<int, weak_ptr<widget>> cache;
  static mutex m;

  lock_guard<mutex> hold(m);
  auto sp = cache[id].lock();
  if (!sp) cache[id] = sp = load_widget(id);
  return sp;

This example is very interesting. Firstly, it manages to pull in reference counting, weak references and a mutex which are all very rare in modern programming. Secondly, it contains a memory leak that is difficult to fix in C++ because APIs are burdened with memory management details and this API is incapable of expressing deterministic cleanup because there is no facility for a widget's destructor to remove its entry in the map. Finally, the correct name for this data structure is a concurrent weak dictionary, specifically one with weak values. You'll find correct implementations of this data structure are widely available for C#, F# and Java such as the one here.

The obvious fix is to sweep stale entries from the map when get_widget is called but this leaves floating garbage in the map between calls to get_widget, is asymptotically less efficient and incurs unbounded pauses for an unbounded number of threads.

Update: Matthew Avery (from the USA) suggests altering the API and semantics of the functions involved so load_widget returns a shared_ptr with a custom deleter that removes the stale map entry as soon as a widget is destructed. If this idea can be made to work then it would be the only deterministic solution to have been proposed to date.

Wednesday, 25 September 2013

How do reference counting and tracing garbage collection compare?

Reference counting works by counting the number of references to each object. When the reference count drops to zero the object is definitely unreachable and can be recycled. The first advantage is simplicity. The second advantage is that decrements can occur as locals fall out of scope so collection can be deterministic and predictable. The third advantage is that the asymptotic complexity of reference counting does not depend upon the total size of the heap. The first problem is that cycles keep reference counts above zero even when they are unreachable, so reference counting alone cannot handle the general case. The second problem with reference counting is that incrementing and decrementing a counter in an object every time a reference to it is copied or dropped is very computationally expensive because it incurs a cache miss even if nothing else in the object is read or written. The third problem is that multithreaded reference counting is non-deterministic because increments and decrements race. The fourth problem is that simple implementations like Boost's shared_ptr in C++ allow destructors to avalanche which incurs unbounded pause times.

The first problem is solved in some languages (e.g. Erlang, Mathematica) by imposing unidirectional heaps so cycles cannot be created by design and, therefore, reference counting is an accurate form of GC for them. The second problem can be addressed using a variety of techniques such as deferred decrements but they all add significant complexity and, therefore, undermine the first advantage. The third problem obviously negates the second advantage in the context of multithreaded programs and, in this era of multicore computing, most programs are multithreaded. The problem of avalanching destructors can be solved by adding a queue to defer destruction but this adds complexity and makes the solution unpredictable, i.e. negates both advantages.

Tracing garbage collection works by keeping track of references into the heap (the global roots) and then tracing through the heap to find unreachable objects eligible for collection. The first advantage of tracing collection is that it can be very simple (e.g. the mark-sweep collector in my HLVM project is ~20 lines of code). The second advantage is that it is always accurate (can collect cycles). The third advantages is that naive mark-sweep is much faster than naive reference counting and optimized tracing collectors (e.g. JVM and CLR) are significantly faster than the fastest reference counting collectors. The first disadvantage is that, although they can be deterministic (as OCaml is), they are unpredictable because it is not possible to predict when an object will be collected. The second disadvantage of naive tracing collectors is that mutator threads must all be paused for an unbounded amount of time during each collection cycle. The third disadvantage of tracing collection is that the time spent marking is proportional to the size of the entire heap.

The second disadvantage is easily reduced by adopting incremental mark-sweep and can be completely eliminated by more advanced techniques. The third disadvantage turns out to be practically irrelevant because the constant pre-factor is so small: even with the largest heaps in use today the mark phase is surprisingly quick.

There are some common myths surrounding this topic as well. Many people believe (incorrectly) that objects are always kept alive until the end of their scope in the source code and, therefore, that reference counting always collects at the earliest opportunity because it collects objects reachable from variables as they fall out of scope. In reality, this mental model is horribly broken. In reality, the GC acts at run-time when the code has been compiled so the concept of scope does not even exist. Not all locally-held references are spilled to the stack and even when a reference is spilled to the stack it is likely to be overwritten by another when it is no longer needed. Register allocators rely upon liveness analysis (not scope) to determine when references need to be kept.

Why is garbage collection an important feature?

  1. Garbage collection eliminates a class of bugs caused by erroneous memory management (forget to free, free too soon, free more than once).
  2. Garbage collection removes the need for APIs to describe contracts about memory management.
  3. Garbage collection facilitates programming styles such as first-class lexical closures in functional programming (see the Funarg problem).