Saturday, 1 January 2011

Boost's shared_ptr up to 10× slower than OCaml's garbage collection

Our recent post about the merits of accurate garbage collection over reference counting prompted Tezka to ask for measurements demonstrating the performance differences between an accurate GC and the reference counted shared_ptr smart pointers provided by the Boost C++ library.

The benchmarks we have been using to prototype allocators for our HLVM project provide the perfect setting for such an experiment. We already have C++ code that solves the n-queens problem using logic programming with a variety of different allocation strategies. Adding another version of our benchmark using Boost's shared_ptr gave the following remarkable results:

These new results highlight just how slow reference counting can be. The C++ code using Boost's reference counted smart pointers is running up to 10× slower than OCaml!

Note also the new "stack" section that is the first C++ to beat OCaml, albeit cheating by exploiting the fact that this implementation of this benchmark always happens to allocate and free in FIFO order. So this approach is problem-specific and cannot be used as a general-purpose allocator. However, it is worth noting that the only allocation strategy to beat C++ also (like OCaml's nursery) exploits the ability to collect many freed values in constant time by resetting the stack pointer rather than using repeated pops. Therefore, the only allocation strategies likely to approach OCaml's performance on these kinds of problems are those that exploit sparsity. For example, by manipulating free lists in run-length encoded form.

We have yet to find a general-purpose allocation strategy that can match the performance of OCaml's generational garbage collector for these kinds of applications but we can make two interesting observations from these new results:

  • Reference counting is the only difference between the "free" results (using malloc and free in FIFO order) and these new "shared_ptr" results, so around 80% of the time is spent handling reference counts, far more than is spent in actual allocation and deallocation!

  • HLVM currently performs poorly on this benchmark because around 70% of its time is spent maintaining the shadow stack, a comparable overhead to the use of reference counting (but one that handles the collection of cycles correctly).

Finally, we should note that our "region 12" results are also new and note that they are significantly worse than the previous "region" results. The difference is that the previous results allocated and deallocated in FIFO order explicitly in the inner loops of the benchmark whereas these new results are a more accurate reflection of the allocation and deallocate patterns that the current version of HLVM generates, specifically the inner loops only maintain the mark bits and deallocation is now performed by sweeps triggered by sufficient allocation. Sweeping results in much worse performance (from 20% slower than OCaml to 2× slower than OCaml). We have invented two ways to attack this problem. Firstly, the thread-local current region could be swept locally, either using a remembered set or just back to the previous write barrier. Secondly, runs of consecutively-allocated unreachable values are likely to appear near the end of the allocated list, so sparse free lists will hopefully accelerate the collection of contiguous sequences of unreachable values.


16 comments:

veikko_e said...

Is it possible to have knowledge of the compiler, compilation flags, platform and the source code used? I'd be interested to take a closer (just out of curiosity, no plans to publish any results of the digging).

Flying Frog Consultancy Ltd. said...

I'm using GCC 4.4.1 on a 2.1GHz 2354 Opteron with -Wall -O3 on 32-bit Kubuntu. Here's the source code:

#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
#include <vector>
#include <boost/shared_ptr.hpp>

using namespace std;
using namespace boost;

struct intint {
int x, y;
};

template<typename T>
struct cons {
T elt;
shared_ptr<cons<T> > next;
};

template<typename T>
shared_ptr<cons<T> > Cons(T h, shared_ptr<cons<T> > t) {
shared_ptr<cons<T> > xs(new cons<T>());
xs->elt = h;
xs->next = t;
return xs;
}

template<typename Tenv, typename T1, typename T2>
struct closure {
T2 (*func)(Tenv, T1);
Tenv env;
};

template<typename Tenv, typename T1, typename T2>
T2 apply(closure<Tenv, T1, T2> f, T1 x) {
return f.func(f.env, x);
}

template<typename T>
bool isEmpty(shared_ptr<cons<T> > xs) {
return xs == NULL;
}

template<typename T>
int length(shared_ptr<cons<T> > xs) {
if (isEmpty(xs)) return 0;
return 1 + length(xs->next);
}

bool safe(intint p1, intint p2) {
int x1=p1.x, y1=p1.y, x2=p2.x, y2=p2.y;
return x1 != x2 && y1 != y2 && x2-x1 != y2-y1 && x1-y2 != x2-y1;
}

template<typename Tenv, typename T>
shared_ptr<cons<T> > filter(closure<Tenv, T, bool> pred, shared_ptr<cons<T> > xs) {
if (isEmpty(xs)) return xs;
T h = xs->elt;
shared_ptr<cons<T> > t = xs->next;
t = filter(pred, t);
if (apply(pred, h))
return Cons(h, t);
else
return t;
}

bool safe_2(intint *env, intint p2) {
return safe(*env, p2);
}

closure<intint, intint, bool> safe_1(intint p1) {
closure<intint, intint, bool> c;
c.func = safe;
c.env = p1;
return c;
}

int search(int (*f)(shared_ptr<cons<intint> >, int),
int n,
shared_ptr<cons<intint> > qs,
shared_ptr<cons<intint> > ps,
int accu) {
if (isEmpty(ps))
if (length<intint>(qs) == n)
return f(qs, accu);
else
return accu;
else {
intint q = ps->elt;
ps = ps->next;
accu = search(f, n, qs, ps, accu);
return search(f, n, Cons(q, qs), filter(safe_1(q), ps), accu);
}
}

shared_ptr<cons<intint> > ps(int n, int i, int j) {
if (i == n)
if (j == n-1)
return shared_ptr<cons<intint> >();
else
return ps(n, 0, j+1);
else {
intint p;
p.x = i;
p.y = j;
return Cons(p, ps(n, i+1, j));
}
}

int f(shared_ptr<cons<intint> >xs, int n) { return n+1; }

int queens(int n) {
return search(f, n, shared_ptr<cons<intint> >(), ps(n, 0, 0), 0);
}

int main(int argc, char *argv[])
{
int n = atoi(argv[1]);
printf("%d-queens: %d\n", n, queens(n));

return 0;
}

Veikko Eeva said...

@Flying Frog

You sure are seeing trouble to make the comparison be between similar constructs. One one hand, it makes the comparisons be on more even ground, in some sense. Then on second hand, it could also be that it's not playing on the strengths of languages and their implementations. :)

I would think a typical C++ programmer wouldn't structure the code like that (I know I wouldn't have had, but it's a long time I've really programmed in C++).

All right, enough about that. A few things stand out as performance drains. Could probably changed to source code in a matter of minutes without any restructirng (I don't have a Linux nor a C++ compiler here right now).

In the cons function, you are creating the object with "new". I haven't measured or profiled, but in this case I would imagine it does have a significant performance boost to use make_shared instead. Memory allocations like that tend to get heavy in C++, even if with some super-allocator from Intel.

Other thing is that passing the structures as constant references (maybe even as rvalue references in some cases?) should ease considerably the overhead of making copies on every call. In shared pointer case it involves copy constructor and destructor call, which probably involves atomic counting and such.

Flying Frog Consultancy Ltd. said...

@Veikko:

"I would think a typical C++ programmer wouldn't structure the code like that". The code structure is taken from the prototype allocators for HLVM and reflects the structure of code generated by HLVM.

"In the cons function, you are creating the object with "new". I haven't measured or profiled, but in this case I would imagine it does have a significant performance boost to use make_shared instead". Replacing "new" with "make_shared<cons<intint> >" actually degrades performance by 4%.

"Other thing is that passing the structures as constant references (maybe even as rvalue references in some cases?) should ease considerably the overhead of making copies on every call". Yes, the 80% of the time spend manipulating reference counts is where the performance improvements lie. However, I am content that this benchmark has demonstrated that reference counting is inefficient and, therefore, a sub-optimal place to start.

"In shared pointer case it involves copy constructor and destructor call, which probably involves atomic counting and such". Actually, I believe Boost shared_ptr is thread unsafe by default. Therefore, I estimate that a thread safe alternative would be around 30× slower than OCaml due to the atomic operations you mention.

Veikko Eeva said...

Degrades by 4%. It used to be quite the opposite, but I'm content with that. I think you are right in that too that using shared_ptr as a general reference counting mechanism is inefficient. At least if one doesn't want to take the usual approach and design memory issues in the mind. I'd argue many won't the non-technical problems are usually more pressing.

By the way, documentation explains at http://www.boost.org/doc/libs/1_45_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety
explains "shared_ptr objects offer the same level of thread safety as built-in types. A shared_ptr instance can be "read" (accessed using only const operations) simultaneously by multiple threads. Different shared_ptr instances can be "written to" (accessed using mutable operations such as operator= or reset) simultaneosly by multiple threads (even when these instances are copies, and share the same reference count underneath.)"

Arseny Kapoulkine said...

By default boost shared_ptr uses atomic primitives for shared count updates. You can recompile your benchmark with BOOST_SP_DISABLE_THREADS define to get a no-threads version.

Flying Frog Consultancy Ltd. said...

Arseny: "You can recompile your benchmark with BOOST_SP_DISABLE_THREADS define to get a no-threads version". That is faster but still 6× slower than OCaml.

helium said...
This comment has been removed by the author.
helium said...

struct intint {
int x, y;
};

template<typename T>
struct cons {
T elt;
shared_ptr<cons<T> > next;
};

template<typename T>
shared_ptr<cons<T> > Cons(T h, shared_ptr<cons<T> > const & t) {
shared_ptr<cons<T> > xs(new cons<T>());
xs->elt = h;
xs->next = t;
return xs;
}

template<typename Tenv, typename T1, typename T2>
struct closure {
T2 (*func)(Tenv, T1);
Tenv env;
};

template<typename Tenv, typename T1, typename T2>
T2 apply(closure<Tenv, T1, T2> f, T1 x) {
return f.func(f.env, x);
}

template<typename T>
bool isEmpty(shared_ptr<cons<T> > const & xs) {
return xs == NULL;
}

template<typename T>
int length(shared_ptr<cons<T> > const & xs) {
if (isEmpty(xs)) return 0;
return 1 + length(xs->next);
}

bool safe(intint p1, intint p2) {
int x1=p1.x, y1=p1.y, x2=p2.x, y2=p2.y;
return x1 != x2 && y1 != y2 && x2-x1 != y2-y1 && x1-y2 != x2-y1;
}

template<typename Tenv, typename T>
shared_ptr<cons<T> > filter(closure<Tenv, T, bool> pred, shared_ptr<cons<T> > const & xs) {
if (isEmpty(xs)) return xs;
T h = xs->elt;
shared_ptr<cons<T> > t = xs->next;
t = filter(pred, t);
if (apply(pred, h))
return Cons(h, t);
else
return t;
}

bool safe_2(intint *env, intint p2) {
return safe(*env, p2);
}

closure<intint, intint, bool> safe_1(intint p1) {
closure<intint, intint, bool> c;
c.func = safe;
c.env = p1;
return c;
}

int search(int (*f)(shared_ptr<cons<intint> > const & , int),
int n,
shared_ptr<cons<intint> > const & qs,
shared_ptr<cons<intint> > ps,
int accu) {
if (isEmpty(ps))
if (length<intint>(qs) == n)
return f(qs, accu);
else
return accu;
else {
intint q = ps->elt;
ps = ps->next;
accu = search(f, n, qs, ps, accu);
return search(f, n, Cons(q, qs), filter(safe_1(q), ps), accu);
}
}

shared_ptr<cons<intint> > ps(int n, int i, int j) {
if (i == n)
if (j == n-1)
return shared_ptr<cons<intint> >();
else
return ps(n, 0, j+1);
else {
intint p;
p.x = i;
p.y = j;
return Cons(p, ps(n, i+1, j));
}
}

int f(shared_ptr<cons<intint> > const & xs, int n) { return n+1; }

int queens(int n) {
return search(f, n, shared_ptr<cons<intint> >(), ps(n, 0, 0), 0);
}

Flying Frog Consultancy Ltd. said...

@helium: That is 70% faster but still 6× slower than OCaml when thread safe and 5× slower than OCaml when thread unsafe.

Denzil said...

Part of the reason you are seeing poor performance with boost::shared_ptr<> is that you are passing them around by value (IE: copying them). Try passing by reference (IE: boost::shared_ptr&) and I suspect your performance difference will be greatly reduced.

-Denzil

Veikko Eeva said...
This comment has been removed by the author.
Veikko Eeva said...

Denzil, C++ general allocator isn't that good at numerous small allocations from the heap which is exactly what happens in the Cons function. The number of cons allocations grows rapidly as a function of input number.

You could make a point in that the test is measuring more the corresponding memory allocation strategies and not reference counting (though there's quite a lot of "needless" copies without passing by reference). On ther other hand, you could argue C++ -- and reference counting -- doesn't shield one from such problems. That is what I and, I think, Arseny alluded at when we wrote a C++ programmer wouldn't write code like that.

Flying Frog Consultancy Ltd. said...

@Denzil: "I suspect your performance difference will be greatly reduced". This was already done by Helium here in the comments and I posted my benchmark results for the optimized version: still 5× slower than OCaml.

@Veikko: "You could make a point in that the test is measuring more the corresponding memory allocation strategies and not reference counting". How do you explain the fact that the solution using malloc/free is 2.5× faster than the fastest one using shared_ptr? That discrepancy can only be a measure of the (huge) overhead of shared_ptr.

Arseny Kapoulkine said...

@Flying Frog Consultancy

That is because shared_ptr has an externally allocated counter, which by default (unless BOOST_SP_USE_QUICK_ALLOCATOR is defined) uses malloc/free, so you get twice the amount of malloc/free. Defining BOOST_SP_USE_QUICK_ALLOCATOR should reduce the time gap.

Flying Frog Consultancy Ltd. said...

@Arseny: That is almost 20% faster again. Boost's intrusive_ptr is slightly faster still because it puts the reference count into the object itself, avoiding that extra allocation entirely. Even with this, thread unsafe reference counting with smart pointers is still 4× slower than OCaml.