This will be a short interlude from Edulinq, although as it will become obvious, it was in the process of writing the next Edulinq post that I made this discovery. It may be known about elsewhere, but it's the first time I'd come across it - or rather, it's the first time I've considered the impact of the compiler's decisions in this area.
This post is about details of iterator blocks and memory usage. It's going to be quite important that you can easily follow exactly what I'm talking about, so I'm going to define a few things up-front.
When I talk about a sequence, I mean an IEnumerable
When I talk about an iterator, I mean the result of calling GetEnumerator() on a sequence - it's an IEnumerator
The C# language specification calls a sequence an enumerable object and it calls an iterator an enumerator object. However, those terms are easy to get mixed up - partly because they differ in so few letters - hence my use of sequence and iterator.
An iterator block is a method declared to return IEnumerable, IEnumerator, IEnumerable
The details of exactly how the compiler transforms iterator blocks is outside the scope of the C# specification. It does talk about some details, and gives an implementation example, but there's plenty of scope for compiler-specific variation.
The Microsoft C# compiler does a "neat trick" which is the source of the gotcha I'm describing today. When it creates a hidden type to represent a sequence, the same type is used to represent the iterator, and the first time that GetEnumerator() is called on the same thread as the one which created the sequence, it returns a reference to "this" rather than creating a new object.
This is easier to follow with an example. Consider this method:
static IEnumerable{
for (int i = 0; i < count; i++)
{
yield return i;
}
}
After compilation, my method ended up looking like this:
private static IEnumerable{
d__.<>3__count = count;
return d__;
}
We can see that the <>3__count field of the generated type represents the initial value for the "count" parameter - although my code happens not to change the parameter value, it's important that it's preserved for all iterators generated from the same sequence. The -2 argument to the constructor is to represent the state of the object: -2 is used to represent a sequence rather than an iterator.
I'm not going into all the details of the generated type in this post (I have an article which goes into some more details) but the important part is the GetEnumerator() method:
[DebuggerHidden]IEnumerator
{
Test.
if ((Thread.CurrentThread.ManagedThreadId == this.<>l__initialThreadId
&& (this.<>1__state == -2))
{
this.<>1__state = 0;
d__ = this;
}
else
{
d__ = new Test.
}
d__.count = this.<>3__count;
return d__;
}
The line of "d__ = this;" is the crucial one for the purpose of this post: a sequence's first iterator is itself.
We'll look into the negative ramifications of this in a minute, but first let's think about the upsides. Basically, it means that if we only need to iterate over the sequence once, we only need a single object. It's reasonable to suspect that the "fetch and iterate once" scenario is an extremely common one, so it's fair to optimize for it. I'm not sure it's actually worth the extra complexity introduced though - especially when there can be a heavy penalty in some odd cases. I assume Microsoft has performed more performance testing of this than I have, but they may not have considered the specific issue mentioned in this post. Let's look at it now.
I'll present the problem using the context in which I discovered it: the LINQ "Distinct" operator. First I'll demonstrate it using the LINQ to Objects implementation, then show a few fixes we can make if we use our own one.
Here's the sample program we'll be using throughout, sometimes with a few modifications:
using System;using System.Collections.Generic;
using System.Linq;
class Test
{
static void Main()
{
var source = Enumerable.Range(0, 1000000);
var query = source.Distinct();
ShowMemory("Before Count()");
query.Count();
ShowMemory("After Count()");
GC.KeepAlive(query);
ShowMemory("After query eligible for GC");
}
static void ShowMemory(string explanation)
{
long memory = GC.GetTotalMemory(true);
Console.WriteLine("{0,-30} {1}",
explanation + ":",
memory);
}
}
Obviously ShowMemory is just a diagnostic method. What we're interested in is what happens in Main:
We create a source sequence. In this case it's a range, but it could be anything that returns a lot of distinct elements. We create a query by calling Distinct. So far, so good - we haven't done anything with the range yet, or iterated over anything. We dump the total memory used. We count the elements in the query results: this has to iterate over the query, and will dispose of the iterator afterwards. We dump the total memory used. We call GC.KeepAlive simply to make sure that the query sequence itself isn't eligible for garbage collection until this point. We no longer explicitly have a reference to the iterator used by Count, but we do have a reference to the sequence up until now. We dump memory one final time... but because this occurs after the GC.KeepAlive call (and there are no later references to query) the sequence can be disposed.Note that the GetTotalMemory(true) call performs a full GC each time. I dare say it's not foolproof, but it gives a pretty reasonable indication of memory that can't be collected yet.
Now in a sensible world, I would expect the memory to stay reasonably constant. The Count() method may well generate a lot of garbage as it iterates over the query, but that's fine - it should all be transient, right?
Wrong. Here's one sample set of results:
Before Count(): 47296After Count(): 16828480
After query eligible for GC: 51140
Until we allow the sequence to be garbage collected, we're left with all the garbage which was associated with the iterator used by Count. "All the garbage" is a set of elements which have previously been returned - Distinct needs to remember them all to make sure it doesn't return any duplicates.
The problem is the "optimization" performed by the C# compiler: the iterator used by Count() is the same object that "query" refers to, so it's got a reference to that large set of integers... a set we no longer actually care about.
As soon as the sequence can be garbage collected, we're back down to modest memory use - the difference between the first and third lines of output is probably due to the code being loaded and JIT compiled over the course of the program.
Note that if we extended our query to call Where, Select etc after Distinct, we'd have the same problem - each sequence in the chain would keep a reference to its own "source" sequence, and eventually you'd get back to the Distinct sequence and all its post-iteration garbage.
Okay, enough doom and gloom... how can we fix this?
If we're aware of the problem, we can sometimes fix it ourselves, even if we only own the "client" code. Our sample program can be fixed with a single line:
using (query.GetEnumerator());Just put that anywhere before the call to Count(), and we're in the clear. The compiler gives a warning due to a possibly-accidental empty block, mind you... why would we want to ask for a value only to dispose it?
If you remember, the GetEnumerator method only returns "this" the first time you call it. So we're just creating that iterating and immediately disposing of it. None of the code that was originally within the Distinct implementation will be run, so we won't generate any garbage.
Sure enough, the results bear out the theory:
Before Count(): 47296After Count(): 51228
After query eligible for GC: 51140
Unfortunately, this relies on two things:
Every caller has to know this is a potential problem. I've never heard anyone mention it before and I only discovered it myself yesterday. Note that it's only likely to be a problem for sequence types which are generated by the C# compiler... how much do you want to be at the mercy of the implementation details of whatever you're calling? It only works if you can force GetEnumerator() to be called on the problematic sequence. If we add a call to Select(x => x) to the end of our query declaration, we're back to square one - because calling GetEnumerator() on the Select sequence doesn't force GetEnumerator() to be called on the Distinct sequence.It's an interesting fix, but basically not a good idea in the long run.
Obviously I can't change LINQ to Objects, so at this point I'll resort to a very simplistic custom implementation of Distinct. I'm not going to bother about the overload with a custom equality comparer, and I won't even perform argument validation - although I will split the implementation into two methods as we'd have to for normal argument validation to work eagerly. That will be useful later. Here's the code in its broken state:
public static class CustomEnumerable{
public static IEnumerable
this IEnumerable
{
return DistinctImpl(source);
}
private static IEnumerable
IEnumerable
{
HashSet
foreach (TSource item in source)
{
if (seen.Add(item))
{
yield return item;
}
}
}
}
Readers who've been following my Edulinq series should be pretty familiar with this. It suffers from the same problem as the LINQ to Objects implementation, which is what I'd expect. Now let's do something really horrible - set a "local" variable to null when we know it will no longer be referenced within the method:
private static IEnumerableIEnumerable
{
HashSet
try
{
foreach (TSource item in source)
{
if (seen.Add(item))
{
yield return item;
}
}
}
finally
{
seen = null;
}
}
This is horrible code. If I hadn't come across this problem, and someone presented me with the above code in a code review, I would by sighing and shaking my head. Setting variables to null at the end of the method for the sake of garbage collection normally shows a hideous misunderstanding of how variables and the GC work.
In this case, it fixes the problem... but at the cost of the ability to look our fellow developers in the eye.
The problem we're facing stems from the compiler optimization used when an iterator block returns a sequence type. What if it only returns an iterator? Well, we need something to implement IEnumerable
{
private readonly Func
public DelegatingSequence(Func
{
this.func = func;
}
public IEnumerator
{
return func();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Now we know that the sequence will be separate from the iterator. Of course we could be given a delegate which does the wrong thing, but it's easy to use correctly. Here's the corresponding code for Distinct:
public static IEnumerablethis IEnumerable
{
return new DelegatingSequence
}
private static IEnumerator
IEnumerable
{
}
This now solves the "bulky iterator problem", and it's a pretty easy fix to propagate wherever it's necessary... but again, only if you're aware of the issue to begin with.
There are already a few attributes which change the generated IL: we could add another one to suppress this optimization, and apply it to any iterator blocks we want to behave differently:
[AliasSequenceAndIterator(false)]private static IEnumerable
IEnumerable
I very much doubt that this will make it into the C# compiler any time soon - and frankly I'm not too happy with it anyway, especially as it still requires anyone using an iterator block to be aware of the issue.
The compiler knows what "local variables" we're using. It's already implementing a Dispose method for various reasons, including setting the iterator state appropriately. Why can't it just set all the fields associated with local variables from the source code to their default values as part of disposal?
There's one reason I can think of, although it would come up rarely: the variables might be captured by anonymous functions within the iterator block. The delegates created from those anonymous functions may still be alive somewhere. Fortunately, the compiler is presumably aware of which variables have been captured. It could simply have a policy of turning off the optimization completely for iterator blocks which capture variables in anonymous functions, and clearing the appropriate variables for all the other situations.
At the same time, the generated code could be changed to clear the field which underlies the Current property - at the moment, if you ask an auto-generated iterator for Current after you've finished iterating over it, the result will be the last value yielded. This could be another subtle pseudo-leak, unlikely as it is. (Aside from anything else, it's just not neat at the moment.)
I'd love to see the data on which Microsoft based the decision to optimize the generated code this way. I have confidence that there is such data somewhere - although I wonder whether it has been updated since the advent of LINQ, which I suspect is one of the heaviest users of autogenerated iterators. It's quite possible that this is still a "win" in most cases - just like the dodgy-sounding mutable struct iterator returned by the public List
If the optimization were removed, there would be two options:
Stick with a single generated type impementing both the sequence and iterator interfaces, but always create a new instance for iteratorsCreate one sequence type and one iterator typePersonally the second sounds cleaner to me - and it means that each instance would only need the fields relevant to it. But cleanliness isn't terribly important for generated code, so long as it doesn't have significant issues, so I wouldn't be too upset if the team took the first approach. I expect that in either case it would simplify the code - there'd be no need to remember the thread a sequence was created on, and so on.
The very fact that I've never heard of this problem before suggests that it's not very serious - but it feels serious. While it's possibly rare to hold onto a query for a long time, it ought to be reasonable to do so without incurring a memory penalty related to the size of the query the first time it was executed. This doesn't affect all query operators, of course - and I haven't investigated which ones it does affect. The ones I'd look at are those which have to buffer their data for some reason - including ones which operate on two sequences, buffering one and streaming the other.
Out of all the possible solutions, I would personally favour 5 or 6. This really ought to be a compiler fix - it's unreasonable to suggest that all developers should become aware of this idiosyncrasy and work round it. The part of me which longs for simplicity prefers solution 6; in reality I suspect that 5 is more likely. It's quite possible that the C# team will decide that it's a sufficiently harmless situation that they won't fix it at all - and I'm not going to claim to know their product and priorities better than they do. (Obviously I'm making them aware of my findings, but not in a fix-this-or-else way.)
Whatever happens, it's been a real blast investigating this... I never get tired of learning more about the details of source transformations and their (sometimes inadvertent) side-effects.


0 comments:
Post a Comment