Saturday, 1 January 2011

Boost's shared_ptr up to 10× slower than OCaml's garbage collection

Our recent post about the merits of accurate garbage collection over reference counting prompted Tezka to ask for measurements demonstrating the performance differences between an accurate GC and the reference counted shared_ptr smart pointers provided by the Boost C++ library.

The benchmarks we have been using to prototype allocators for our HLVM project provide the perfect setting for such an experiment. We already have C++ code that solves the n-queens problem using logic programming with a variety of different allocation strategies. Adding another version of our benchmark using Boost's shared_ptr gave the following remarkable results:

These new results highlight just how slow reference counting can be. The C++ code using Boost's reference counted smart pointers is running up to 10× slower than OCaml!

Note also the new "stack" section that is the first C++ to beat OCaml, albeit cheating by exploiting the fact that this implementation of this benchmark always happens to allocate and free in FIFO order. So this approach is problem-specific and cannot be used as a general-purpose allocator. However, it is worth noting that the only allocation strategy to beat C++ also (like OCaml's nursery) exploits the ability to collect many freed values in constant time by resetting the stack pointer rather than using repeated pops. Therefore, the only allocation strategies likely to approach OCaml's performance on these kinds of problems are those that exploit sparsity. For example, by manipulating free lists in run-length encoded form.

We have yet to find a general-purpose allocation strategy that can match the performance of OCaml's generational garbage collector for these kinds of applications but we can make two interesting observations from these new results:

  • Reference counting is the only difference between the "free" results (using malloc and free in FIFO order) and these new "shared_ptr" results, so around 80% of the time is spent handling reference counts, far more than is spent in actual allocation and deallocation!

  • HLVM currently performs poorly on this benchmark because around 70% of its time is spent maintaining the shadow stack, a comparable overhead to the use of reference counting (but one that handles the collection of cycles correctly).

Finally, we should note that our "region 12" results are also new and note that they are significantly worse than the previous "region" results. The difference is that the previous results allocated and deallocated in FIFO order explicitly in the inner loops of the benchmark whereas these new results are a more accurate reflection of the allocation and deallocate patterns that the current version of HLVM generates, specifically the inner loops only maintain the mark bits and deallocation is now performed by sweeps triggered by sufficient allocation. Sweeping results in much worse performance (from 20% slower than OCaml to 2× slower than OCaml). We have invented two ways to attack this problem. Firstly, the thread-local current region could be swept locally, either using a remembered set or just back to the previous write barrier. Secondly, runs of consecutively-allocated unreachable values are likely to appear near the end of the allocated list, so sparse free lists will hopefully accelerate the collection of contiguous sequences of unreachable values.


29 comments:

Veikko Eeva said...

Is it possible to have knowledge of the compiler, compilation flags, platform and the source code used? I'd be interested to take a closer (just out of curiosity, no plans to publish any results of the digging).

Jon Harrop said...

I'm using GCC 4.4.1 on a 2.1GHz 2354 Opteron with -Wall -O3 on 32-bit Kubuntu. Here's the source code:

#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
#include <vector>
#include <boost/shared_ptr.hpp>

using namespace std;
using namespace boost;

struct intint {
int x, y;
};

template<typename T>
struct cons {
T elt;
shared_ptr<cons<T> > next;
};

template<typename T>
shared_ptr<cons<T> > Cons(T h, shared_ptr<cons<T> > t) {
shared_ptr<cons<T> > xs(new cons<T>());
xs->elt = h;
xs->next = t;
return xs;
}

template<typename Tenv, typename T1, typename T2>
struct closure {
T2 (*func)(Tenv, T1);
Tenv env;
};

template<typename Tenv, typename T1, typename T2>
T2 apply(closure<Tenv, T1, T2> f, T1 x) {
return f.func(f.env, x);
}

template<typename T>
bool isEmpty(shared_ptr<cons<T> > xs) {
return xs == NULL;
}

template<typename T>
int length(shared_ptr<cons<T> > xs) {
if (isEmpty(xs)) return 0;
return 1 + length(xs->next);
}

bool safe(intint p1, intint p2) {
int x1=p1.x, y1=p1.y, x2=p2.x, y2=p2.y;
return x1 != x2 && y1 != y2 && x2-x1 != y2-y1 && x1-y2 != x2-y1;
}

template<typename Tenv, typename T>
shared_ptr<cons<T> > filter(closure<Tenv, T, bool> pred, shared_ptr<cons<T> > xs) {
if (isEmpty(xs)) return xs;
T h = xs->elt;
shared_ptr<cons<T> > t = xs->next;
t = filter(pred, t);
if (apply(pred, h))
return Cons(h, t);
else
return t;
}

bool safe_2(intint *env, intint p2) {
return safe(*env, p2);
}

closure<intint, intint, bool> safe_1(intint p1) {
closure<intint, intint, bool> c;
c.func = safe;
c.env = p1;
return c;
}

int search(int (*f)(shared_ptr<cons<intint> >, int),
int n,
shared_ptr<cons<intint> > qs,
shared_ptr<cons<intint> > ps,
int accu) {
if (isEmpty(ps))
if (length<intint>(qs) == n)
return f(qs, accu);
else
return accu;
else {
intint q = ps->elt;
ps = ps->next;
accu = search(f, n, qs, ps, accu);
return search(f, n, Cons(q, qs), filter(safe_1(q), ps), accu);
}
}

shared_ptr<cons<intint> > ps(int n, int i, int j) {
if (i == n)
if (j == n-1)
return shared_ptr<cons<intint> >();
else
return ps(n, 0, j+1);
else {
intint p;
p.x = i;
p.y = j;
return Cons(p, ps(n, i+1, j));
}
}

int f(shared_ptr<cons<intint> >xs, int n) { return n+1; }

int queens(int n) {
return search(f, n, shared_ptr<cons<intint> >(), ps(n, 0, 0), 0);
}

int main(int argc, char *argv[])
{
int n = atoi(argv[1]);
printf("%d-queens: %d\n", n, queens(n));

return 0;
}

Veikko Eeva said...

@Flying Frog

You sure are seeing trouble to make the comparison be between similar constructs. One one hand, it makes the comparisons be on more even ground, in some sense. Then on second hand, it could also be that it's not playing on the strengths of languages and their implementations. :)

I would think a typical C++ programmer wouldn't structure the code like that (I know I wouldn't have had, but it's a long time I've really programmed in C++).

All right, enough about that. A few things stand out as performance drains. Could probably changed to source code in a matter of minutes without any restructirng (I don't have a Linux nor a C++ compiler here right now).

In the cons function, you are creating the object with "new". I haven't measured or profiled, but in this case I would imagine it does have a significant performance boost to use make_shared instead. Memory allocations like that tend to get heavy in C++, even if with some super-allocator from Intel.

Other thing is that passing the structures as constant references (maybe even as rvalue references in some cases?) should ease considerably the overhead of making copies on every call. In shared pointer case it involves copy constructor and destructor call, which probably involves atomic counting and such.

Jon Harrop said...

@Veikko:

"I would think a typical C++ programmer wouldn't structure the code like that". The code structure is taken from the prototype allocators for HLVM and reflects the structure of code generated by HLVM.

"In the cons function, you are creating the object with "new". I haven't measured or profiled, but in this case I would imagine it does have a significant performance boost to use make_shared instead". Replacing "new" with "make_shared<cons<intint> >" actually degrades performance by 4%.

"Other thing is that passing the structures as constant references (maybe even as rvalue references in some cases?) should ease considerably the overhead of making copies on every call". Yes, the 80% of the time spend manipulating reference counts is where the performance improvements lie. However, I am content that this benchmark has demonstrated that reference counting is inefficient and, therefore, a sub-optimal place to start.

"In shared pointer case it involves copy constructor and destructor call, which probably involves atomic counting and such". Actually, I believe Boost shared_ptr is thread unsafe by default. Therefore, I estimate that a thread safe alternative would be around 30× slower than OCaml due to the atomic operations you mention.

Veikko Eeva said...

Degrades by 4%. It used to be quite the opposite, but I'm content with that. I think you are right in that too that using shared_ptr as a general reference counting mechanism is inefficient. At least if one doesn't want to take the usual approach and design memory issues in the mind. I'd argue many won't the non-technical problems are usually more pressing.

By the way, documentation explains at http://www.boost.org/doc/libs/1_45_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety
explains "shared_ptr objects offer the same level of thread safety as built-in types. A shared_ptr instance can be "read" (accessed using only const operations) simultaneously by multiple threads. Different shared_ptr instances can be "written to" (accessed using mutable operations such as operator= or reset) simultaneosly by multiple threads (even when these instances are copies, and share the same reference count underneath.)"

Arseny Kapoulkine said...

By default boost shared_ptr uses atomic primitives for shared count updates. You can recompile your benchmark with BOOST_SP_DISABLE_THREADS define to get a no-threads version.

Jon Harrop said...

Arseny: "You can recompile your benchmark with BOOST_SP_DISABLE_THREADS define to get a no-threads version". That is faster but still 6× slower than OCaml.

helium said...
This comment has been removed by the author.
helium said...

struct intint {
int x, y;
};

template<typename T>
struct cons {
T elt;
shared_ptr<cons<T> > next;
};

template<typename T>
shared_ptr<cons<T> > Cons(T h, shared_ptr<cons<T> > const & t) {
shared_ptr<cons<T> > xs(new cons<T>());
xs->elt = h;
xs->next = t;
return xs;
}

template<typename Tenv, typename T1, typename T2>
struct closure {
T2 (*func)(Tenv, T1);
Tenv env;
};

template<typename Tenv, typename T1, typename T2>
T2 apply(closure<Tenv, T1, T2> f, T1 x) {
return f.func(f.env, x);
}

template<typename T>
bool isEmpty(shared_ptr<cons<T> > const & xs) {
return xs == NULL;
}

template<typename T>
int length(shared_ptr<cons<T> > const & xs) {
if (isEmpty(xs)) return 0;
return 1 + length(xs->next);
}

bool safe(intint p1, intint p2) {
int x1=p1.x, y1=p1.y, x2=p2.x, y2=p2.y;
return x1 != x2 && y1 != y2 && x2-x1 != y2-y1 && x1-y2 != x2-y1;
}

template<typename Tenv, typename T>
shared_ptr<cons<T> > filter(closure<Tenv, T, bool> pred, shared_ptr<cons<T> > const & xs) {
if (isEmpty(xs)) return xs;
T h = xs->elt;
shared_ptr<cons<T> > t = xs->next;
t = filter(pred, t);
if (apply(pred, h))
return Cons(h, t);
else
return t;
}

bool safe_2(intint *env, intint p2) {
return safe(*env, p2);
}

closure<intint, intint, bool> safe_1(intint p1) {
closure<intint, intint, bool> c;
c.func = safe;
c.env = p1;
return c;
}

int search(int (*f)(shared_ptr<cons<intint> > const & , int),
int n,
shared_ptr<cons<intint> > const & qs,
shared_ptr<cons<intint> > ps,
int accu) {
if (isEmpty(ps))
if (length<intint>(qs) == n)
return f(qs, accu);
else
return accu;
else {
intint q = ps->elt;
ps = ps->next;
accu = search(f, n, qs, ps, accu);
return search(f, n, Cons(q, qs), filter(safe_1(q), ps), accu);
}
}

shared_ptr<cons<intint> > ps(int n, int i, int j) {
if (i == n)
if (j == n-1)
return shared_ptr<cons<intint> >();
else
return ps(n, 0, j+1);
else {
intint p;
p.x = i;
p.y = j;
return Cons(p, ps(n, i+1, j));
}
}

int f(shared_ptr<cons<intint> > const & xs, int n) { return n+1; }

int queens(int n) {
return search(f, n, shared_ptr<cons<intint> >(), ps(n, 0, 0), 0);
}

Jon Harrop said...

@helium: That is 70% faster but still 6× slower than OCaml when thread safe and 5× slower than OCaml when thread unsafe.

Denzil said...

Part of the reason you are seeing poor performance with boost::shared_ptr<> is that you are passing them around by value (IE: copying them). Try passing by reference (IE: boost::shared_ptr&) and I suspect your performance difference will be greatly reduced.

-Denzil

Veikko Eeva said...
This comment has been removed by the author.
Veikko Eeva said...

Denzil, C++ general allocator isn't that good at numerous small allocations from the heap which is exactly what happens in the Cons function. The number of cons allocations grows rapidly as a function of input number.

You could make a point in that the test is measuring more the corresponding memory allocation strategies and not reference counting (though there's quite a lot of "needless" copies without passing by reference). On ther other hand, you could argue C++ -- and reference counting -- doesn't shield one from such problems. That is what I and, I think, Arseny alluded at when we wrote a C++ programmer wouldn't write code like that.

Jon Harrop said...

@Denzil: "I suspect your performance difference will be greatly reduced". This was already done by Helium here in the comments and I posted my benchmark results for the optimized version: still 5× slower than OCaml.

@Veikko: "You could make a point in that the test is measuring more the corresponding memory allocation strategies and not reference counting". How do you explain the fact that the solution using malloc/free is 2.5× faster than the fastest one using shared_ptr? That discrepancy can only be a measure of the (huge) overhead of shared_ptr.

Arseny Kapoulkine said...

@Flying Frog Consultancy

That is because shared_ptr has an externally allocated counter, which by default (unless BOOST_SP_USE_QUICK_ALLOCATOR is defined) uses malloc/free, so you get twice the amount of malloc/free. Defining BOOST_SP_USE_QUICK_ALLOCATOR should reduce the time gap.

Jon Harrop said...

@Arseny: That is almost 20% faster again. Boost's intrusive_ptr is slightly faster still because it puts the reference count into the object itself, avoiding that extra allocation entirely. Even with this, thread unsafe reference counting with smart pointers is still 4× slower than OCaml.

Unknown said...


Nice blog it is informative thank you for sharing Python Online Training

Anonymous said...

Sharp
Lampung
Metroyoutube
youtube
lampung
kuota
Indonesia

Repairtech Solutions said...

Right here is the right site for anyone who would like to find out about this topic. You know so much its almost tough to argue with you (not that I personally will need to…HaHa). You definitely put a new spin on a subject which has been written about for ages. Excellent stuff, just excellent! vivo V17 pro display replacement marathahalli Aw, this was a very nice post. Finding the time and actual effort to produce a superb article… but what can I say… I procrastinate a lot and don't manage to get nearly anything done.oneplus 7 pro display replacement bangalore


Repairtech Solutions said...

An impressive share! I've just forwarded this onto a colleague who had been doing a little research on this. And he actually ordered me lunch because I stumbled upon it for him... lol. So allow me to reword this.... Thank YOU for the meal!! But yeah, thanx for spending time to talk about this matter here on your site. samsung M30s display replacement After looking over a few of the blog posts on your blog, I truly appreciate your way of blogging. I bookmarked it to my bookmark website list and will be checking back soon. Take a look at my web site as well and let me know how you feel. iphone 11 pro display replacement marathahalli

Repairtech Solutions said...

You made some decent points there. I checked on the internet for additional information about the issue and found most people will go along with your views on this site. realme x display replacement Good post. I learn something totally new and challenging on sites I stumbleupon everyday. It will always be exciting to read through content from other authors and use something from other web sites. redmi note 8 pro display replacement bangalore

meritstep said...

I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for thizs article. python online training

Angular expert said...

Nice post. I learn something totally new and challenging on blogs I stumbleupon on a daily basis. It's always exciting to read articles from other writers and use something from their sites.

Selenium Courses in Marathahalli

selenium institutes in Marathahalli

selenium training in Bangalore

Selenium Courses in Bangalore

best selenium training institute in Bangalore

selenium training institute in Bangalore

Angular expert said...

You've made some decent points there. I checked on the web to find out more about the issue and found most individuals will go along with your views on this website.

Best Advanced Java Training In Bangalore Marathahalli

Advanced Java Courses In Bangalore Marathahalli

Advanced Java Training in Bangalore Marathahalli

Advanced Java Training Center In Bangalore

Advanced Java Institute In Marathahalli

meritstep Technology said...

Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
microservices online training

Nandhini said...

Data Science with Python Training in BTM
UI and UX Training in BTM
Angular training in bangalore
Web designing Training in BTM
Digital Marketing Training in BTM

Jack sparrow said...


Hello, I have gone through your post Its really awesome.Thats a great article. I am also want to share about advanced python course and python tutorial for beginners .thank you

Infocampus said...

This is the most supportive blog which I have ever observed.
Java Training in Bangalore
Ui Development Training in Bangalore

meritstep Technology said...

Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
blockchain online training
best blockchain online training
top blockchain online training