Tuesday, 19 October 2010

Can you repro this 64-bit .NET GC bug?

Update: Maoni Stephens of Microsoft, the author of this new "Background" garbage collector in .NET 4, has now reproduced the bug and acknowledged that we really are seeing GC pauses lasting several minutes.

Whilst developing low latency software on .NET, we have discovered a serious bug in the .NET 4 concurrent workstation garbage collector that can cause applications to hang for up to several minutes at a time.

On three of our machines the following simple C# program causes the GC to leak memory until none remains and a single mammoth GC cycle kicks in, stalling the program for several minutes (!) while 11Gb of heap is recycled:

static void Main(string[] args) {
  var q = new System.Collections.Generic.Queue(); 
  while (true) { 
    q.Enqueue(0); 
    if (q.Count > 1000000) 
      q.Dequeue(); 
  } 
}

You need to compile for x64 on a 64-bit Windows OS with .NET 4 and run with the default (concurrent workstation) GC using the default (interactive) latency setting.

Here's what the Task Manager looks like when running this program on this machine:

Note that 11Gb of heap have been leaked here when this program requires no more than 100Mb of memory.

We have now accumulated around a dozen repros of this bug, written in F# as well as C#, and it appears to be related to a bug in the GC write barrier when most of gen0 survives. However, Microsoft have not yet been able to reproduce it. Can you?

21 comments:

Carsten said...

Hi,

I just tried to reproduce this bug on a Win7/64 with 4 cores.
The app uses about 24% CPU (as suspected) and after about 15min 60-100 MB so the GC seems to do it's job.

My memory stays constant and I ain't got any issues so far.

I will continue running this app for some time but I guess I've got a negative here.

Flying Frog Consultancy Ltd. said...

@Carsten: That's the behaviour we wish we were seeing!

Four people have successfully reproduced this bug also using 64-bit Windows Vista/7 so there is something more specific about your setup that means you're evading it.

Are you using .NET 4 and compiling to a ".NET Framework 4" target profile (in project properties → application)? If so, what is the version of

Carsten said...

Hi,

yes to all of your questions.

But now after running it for some time I have the but too,.

The memory-usage of the app is at 6.5-8 GB and I see something like your saw-tooth like curve in my memroy usage.

So I guess it's just some timing issue if.

Carsten said...

Well I guess I misspelled some words there - sorry.

Also noteworthy: my mamoryusage is at 11.3 G at max and dropping to about 6.5G.

Very strange.

I had something similar with a app running on a XP embedded machine some month ago.

There the GC didn't work at all and the app was failing after about an hour.

Stranges thing was, that while I was watching the system with PC-Anywhere (a tool to look at a remote desktop and controling it) the problem was non-existent.

So I watched for some hours and couldn't reproduce the problem but a short time after I called it a day I had the next call on my phone ... arg

Solved the problem by installing a newer OS there (XPe didn't support .net3.5 SP1 at that time)

So I'm really interested in those kind of issues .... never thought that you could create one this easy.

paul van brenk said...

I see this on my home machine. But it doesn't repro with the profiler enabled.

Who have you been talking to @ msft?

if you want to, you can contact me here: pvanbren(at)microsoft(dot)com

Michael Robin said...

I can also repro -- using 3.5/AnyCPU the process mem never got above ~100M. Using 4.0/x64 I get a sawtooth with the process memory maxing at about 3.5GB. (I think it's fair to look at the mem column for the process only - I think you're looking at the mem committed by the whole system in your 11GB number, no?)
Weird note: I could be mistaken, but I thought there was one instance in about 10 tries where this *didn't* happen.

Matt said...

I also see the sawtooth pattern: http://imgur.com/42t5h.png. This is on a 4GB machine, Win7, x64. The sample code quickly ramps up its Working Set size to 2.2GB as measured by Process Explorer from Sysinternals.

However, when the GC runs my machine isn't paused. I can listen to Pandora in the background throughout the whole time, for example. I guess it's only reclaiming ~2.1GB. When it starts out, the process uses ~100MB.

Rush said...
This comment has been removed by the author.
Rush said...

confirmed!

http://files.rushfrisby.com/images/gc_x64_leak.png

Nick Martyshchenko said...

First of all, I can repro too..

But, you don't give GC any chance :) It wants to help you gently.

Try:


while (true)
{
q.Enqueue(0);
if (q.Count > 1000000)
q.Dequeue();

++k; if (k == 1000000)
{
k = 0;
Thread.Sleep(100);
}
}

And if GC will have enough time to keep your heap small.

Carsten said...

Nice "bugfix".

Reminds me of the old windows 3.1 -days though ("multitasking" ;) )

Nick Martyshchenko said...
This comment has been removed by the author.
Nick Martyshchenko said...

No, its not bugfix by any means :) I just want to show that GC don't interupt process to collect as he does in x86 mode. But if you pause pollute memory then GC do his work. It can be treated as bug thru I personally prefer manipulate memory directly "by hand" in such stress conditions as shown in post

Carsten said...

I'm really no expert on GC but doesn't this one live in his own thread/process?
And if it does it's work correctly it can go and reclaim the memory without interfering with your program at all. So it can happily free memory on one processor as you run your program on the next.
On x86 there is no interupt on your process either.
You can check this by stressing a bug in the 3.5 framework:
Using a Transaction object without explicitly keeping the Connection save can cause you major payne when the GC destroys the connection before you can end the transaction - I had this bug in one of my buisiness apps and it took my weeks to find the issue .... maybe I should have googled bevore ;)

Jules said...

I couldn't reproduce this.

Flying Frog Consultancy Ltd. said...

@Jules: That's very interesting. Can you describe your exact setup?

Jules said...

Hmm, now it does leak:

http://i.imgur.com/YLnkr.png

steve said...

Well, that's why you better use the JVM for such things.

It's not that the JVM is bug-free but at least it gets tested before it gets released.

Flying Frog Consultancy Ltd. said...

@steve: There are lots of other options as well, of course. We have found that OCaml's pause times are around 100× shorter than .NET's pause times. Another issue for us was the extremely poor performance of .NET serialization and, again, we found that OCaml is over 170× faster than .NET.

Given that our code is written in F#, the obvious backup language is OCaml rather than anything JVM based. Indeed, rewriting all of our code that relies upon tail calls would be quite a laborious undertaking...

Paul said...

One thing to check is that this is actually gen0 memory rather than just process memory as the GC has a feature which means it won't release memory from a process until it hits the high memory watermark - think of the behaviour of SQL Server

W.Cook said...

Was there any resolution to this? We had the same problem a while back and had to go back to .NET 3.5. Does anyone know if the problem has been fixed?