Pages

Saturday, 19 March 2011

Reimplementing LINQ to Objects: Optimization Asp.Net C#

I'm not an expert in optimization, and most importantly I don't have any real-world benchmarks to support this post, so please take it with a pinch of salt. That said, let's dive into what optimizations are available in LINQ to Objects.


Just as we think of refactoring as changing the internal structure of code without changing its externally visible behaviour, optimization is the art/craft/science/voodoo of changing the performance of code without changing its externally visible behaviour. Sort of.


This requires two definitions: "performance" and "externally visible behaviour". Neither are as simple as they sound. In almost all cases, performance is a trade-off, whether in speed vs memory, throughput vs latency, big-O complexity vs the factors within that complexity bound, and best case vs worst case.


In LINQ to Objects the "best case vs worst case" is the balance which we need to consider most often: in the cases where we can make a saving, how significant is that saving? How much does it cost in every case to make a saving some of the time? How often do we actually take the fast path?


Externally visible behaviour is even harder to pin down, because we definitely don't mean all externally visible behaviour. Almost all the optimizations in Edulinq are visible if you work hard enough - indeed, that's how I have unit tests for them. I can test that Count() uses the Count property of an ICollection instead of iterating over it by creating an ICollection implementation which works with Count but throws an exception if you try to iterate over it. I don't think we care about that sort of change to externally visible behaviour.


What we really mean is, "If we use the code in a sensible way, will we get the same results with the optimized code as we would without the optimization?" Much better. Nothing woolly about the term "sensible" at all, is there? More realistically, we could talk about a system where every type adheres to the contracts of every interface it implements - that would at least get rid of the examples used for unit testing. Still, even the performance of a system is externally visible... it's easy to tell the difference between an implementation of Count() which is optimized and one which isn't, if you've got a list of 10 million items.


Effectively we have one technique for significant optimization in LINQ to Objects: finding out that a sequence implements a more capable interface than IEnumerable, and then using that interface. (Or in the case of my optimization for HashSet and Contains, using a concrete type in the same way.) Count() is the most obvious example of this: if the sequence implements ICollection or ICollection, then we can use the Count property on that interface and we're done. No need to iterate at all.


These are generally good optimizations because they allow us to transform an O(n) computation into an O(1) computation. That's a pretty big win when it's applicable, and the cost of checking is reasonably small. So long as we hit the right type once every so often - and particularly if in those cases the sequences are long - then it's a net gain. The performance characteristics are likely to be reasonably fixed here for any one program - it's not like we'll sometimes win and sometimes lose for a specific query... it's that for some queries we win and some we won't. If we were micro-optimizing, we might want a way of calling a "non-optimized" version which didn't even bother trying the optimization, because we know it will always fail. I would regard such an approach as a colossal waste of effort in the majority of cases.


Some optimizations are slightly less obvious, particularly because they don't offer a change in the big-O complexity, but can still make a significant difference. Take ToArray, for example. If we know the sequence is an IList we can construct an array of exactly the right size and ask the list to copy the elements into it. Chances are that copy can be very efficient indeed, basically copying a whole block of bits from one place to another - and we know we won't need any resizing. Compare that with building up a buffer, resizing periodically including copying all the elements we've already discovered. Every part of that process is going to be slower, but they're both O(n) operations really. This is a good example of where big-O notation doesn't tell the whole story. Again, the optimization is almost certainly a good one to make.


Then there are distinctly dodgy optimizations which can make a difference, but are unlikely to apply. My optimization for ElementAt and ElementAtOrDefault comes into play here. It's fine to check whether an object implements IList, and use the indexer if so. That's an obvious win. But I have an extra optimization to exit quickly if we can find out that the given index is out of the bounds of the sequence. Unfortunately that optimization is only useful when:

The sequence implements ICollection or ICollection (but remember it has to implement IEnumerable - there aren't many collections implementing only the non-generic ICollection, but the generic IEnumerable) The sequence doesn't implement IList (which gets rid of almost all implementations of ICollection) The given index is actually greater than or equal to the size of the collection

All that comes at the cost of a couple of type checks... not a great cost, and we do potentially save an O(n) check for being given an index out of the bounds of the collection... but how often are we really going to make that win? This is where I'd love to have something like Dapper, but applied to LINQ to Objects and running in a significant number of real-world projects, just logging in as light a way as possible how often we win, how often we lose, and how big the benefit is.


Finally, we come to the optimizations which don't make sense to me... such as the optimization for First in both Mono and LinqBridge. Both of these projects check whether the sequence is a list, so that they check the count and then use the indexer to fetch item 0 instead of calling GetEnumerator()/MoveNext()/Current. Now yes, there's a chance this avoids creating an extra object (although not always, as we've seen before) - but they're both O(1) operations which are likely to be darned fast. At this point not only is the payback very small (if it even exists) but the whole operation is likely to be so fast that the tiny check for whether the object implements IList is likely to become more significant. Oh, and then there's the extra code complexity - yes, that's only relevant to the implementers, but I'd personally rather they spent their time on other things (like getting OrderByDescending to work properly... smirk). In other words, I think this is a bad target for optimization. At some point I'll try to do a quick analysis of just how often the collection has to implement IList in order for it to be worth doing this - and whether the improvement is even measurable.


Of course there are other micro-optimizations available. When we don't need to fetch the current item (e.g. when skipping over items) let's just call MoveNext() instead of also assigning the return value of a property to a variable. I've done that in various places in Edulinq, but not as an optimization strategy, which I suspect won't make a significant difference, but for readability - to make it clearer to the reader that we're just moving along the iterator, not examining the contents as we go.


The only other piece of optimization I think I've performed in Edulinq is the "yield the first results before sorting the rest" part of my quicksort implementation. I'm reasonably proud of that, at least conceptually. I don't think it really fits into any other bucket - it's just a matter of thinking about what we really need and when, deferring work just in case we never need to do it.


I've found a few optimizations in both Edulinq and other implementations which I believe to be invalid.


Here's an example I happened to look at just this morning, when reviewing the code for Skip:

var list = source as IList;
if (list != null)
{
count = Math.Max(count, 0);
for (int index = count; index < list.Count; index++)
{
yield return list[index];
}
yield break;
}

If our sequence is a list, we can just skip straight to the right part of it and yield the items one at a time. That sounds great, but what if the list changes (or is even truncated!) while we're iterating over it? An implementation working with the simple iterator would usually throw an exception, as the change would invalidate the iterator. This is definitely a behavioural change. When I first wrote about Skip, I included this as a "possible" optimization - and actually turned it on in the Edulinq source code. I now believe it to be a mistake, and have removed it completely.


Another example is Reverse, and how it should behave. The documentation is fairly unclear, but when I ran the tests, the Mono implementation used an optimization whereby if the sequence is a list, it will just return items from the tail end using the indexer. (This has now been fixed - the Mono team is quick like that!) Again, that means that changes made to the list while iterating will be reflected in the reversed sequence. I believe the documentation for Reverse ought to be clear that:

Execution is deferred: the input sequence isn't read when the method is called. When the result sequence is first read by the caller, a snapshot is taken, and that's what's used to return the data. If the result sequence is read more than once (i.e. GetEnumerator is called more than once) then a new snapshot is created each time - so changes to the input sequence between calls to GetEnumerator on the result sequence will be observed.

Now this is still not as precise as it might be in terms of what "reading" a sequence entails - in particular, a simple implementation of Reverse (as per Edulinq) will actually take the snapshot on the first call to MoveNext() on the iterator returned by GetEnumerator() - but that's probably not too bad. The snapshotting behaviour itself is important though, and should be made explicit in my opinion.


The problem with both of these "optimizations" is arguably that they're applying list-based optimizations within an iterator block used for deferred execution. Optimizing for lists either upfront at the point of the initial method call or within an immediate execution operator (Count, ToList etc) is fine, because we assume the sequence won't change during the course of the method's execution. We can't make that assumption with an iterator block, because the flow of the code is very different: our code is visited repeatedly based on the caller's use of MoveNext().


Another aspect of behaviour which isn't well-specified is that of identity. When is it valid for an operator to return the input sequence itself as the result sequence?


In the Microsoft implementation, this can occur in two operators: AsEnumerable (which always returns the input sequence reference, even if it's null) and Cast (which returns the original reference only if it actually implements IEnumerable).


In Edulinq, I have two other operators which can return the input sequence: OfType (only if the original reference implements IEnumerable and TResult is a non-nullable value type) and Skip (if you provide a count which is zero or negative). Are these valid optimizations? Let's think about why we might not want them to be...


If you're returning a sequence from one layer of your code to another, you usually want that sequence to be viewed only as a sequence. In particular, if it's backed by a List, you don't want callers casting to List and modifying the list. With any operator implemented by an iterator block, that's fine - the object returned from the operator has no accessible reference to its input, and the type itself only implements IEnumerable (and IEnumerator, and IDisposable, etc - but not IList). It's not so good if the operator decides it's okay to return the original reference.


The C# language specification refers to this in the section about query expression translation: a no-op projection at the end of a query can be omitted if and only if there are other operators in the query. So a query expression of "from foo in bar select foo" will translate to "bar.Select(foo => foo)" but if we had a "where" clause in the query, the Select call would be removed. It's worth noting that the call to "Cast" generated when you explicitly specify the type of a range variable is not enough to prevent the "no-op" projection from being generated... it's almost as if the C# team "knows" that Cast can leak sequence identity whereas Where can't.


Personally I think that the "hiding" of the input sequence should be guaranteed where it makes sense to do so, and explicitly not guaranteed otherwise. We could also add an operator of something like "HideIdentity" which would simply (and unconditionally) add an extra iterator block into the pipeline. That way library authors wouldn't have to guess, and would have a clear way of expressing their intention. Using Select(x => x) or Skip(0) is not clear, and in the case of Skip it would even be pointless when using Edulinq.


As for whether my optimizations are valid - that's up for debate, really. It seems hard to justify why leaking sequence identity would be okay for Cast but not okay for OfType, whereas I think there's a better case for claiming that Skip should always hide sequence identity.


If you remember, I have a disagreement around what Contains should do when you don't provide an equality comparer, and when the sequence implements ICollection. I believe it should be consistent with the rest of LINQ to Objects, which always uses the default equality comparer for the element type when it needs one but the user hasn't specified one. Everyone else (Microsoft, Mono, LinqBridge) has gone with delegating to the collection's implementation of ICollection.Contains. That plays well in terms of consistency of what happens if you call Contains on that object, so that it doesn't matter what the compile-time type is. That's a debate to go into in another post, but I just want to point out that this is not an example of optimization. In some cases it may be faster (notably for HashSet) but it stands a very good chance of changing the behaviour. There is absolutely nothing to suggest that the equality comparer used by ICollection should be the default one for the type - and in some cases it definitely isn't.


It's therefore a matter of what result we want to get, not how to get that result faster. It's correctness, not optimization - but both the LinqBridge and Mono tests which fail for Edulinq are called "Contains_CollectionOptimization_ReturnsTrueWithoutEnumerating" - and I think that shows a mistaken way of thinking about this.


I've been considering a couple of optimizations which I believe to be perfectly legitimate, but which none of the implementations I've seen have used. One reason I haven't implemented them myself yet is that they will reduce the effectiveness of all my unit tests. You see, I've generally used Enumerable.Range as a good way of testing a non-list-based sequence... but what's to stop Range and Repeat being implemented as IList implementations?


All the non-mutating members are easy to implement, and we can just throw exceptions from the mutating members (as other read-only collections do).


Would this be more efficient? Well yes, if you ever performed a Count(), ElementAt(), ToArray(), ToList() etc operation on a range or a repeated element... but how often is that going to happen? I suspect it's pretty rare - probably rare enough not to make it worth my time, particularly when you then consider all the tests that would have to be rewritten to use something other than Range when I wanted a non-list sequence...


Surprise, surprise - doing optimization well is difficult. When it's obvious what can be done, it's not obvious what should be done... and sometimes it's not even what is valid in the first place.


Note that none of this has really talked about data structures and algorithms. I looked at some options when implementing ordering, and I'm still thinking about the best approach for implementing TopBy (probably either a heap or a self-balancing tree - something which could take advantage of the size being constant would be nice) - but in general the optimizations here haven't required any cunning knowledge of computer science. That's quite a good thing, because it's many years since I've studied CS seriously...


I suspect that with this post more than almost any other, I'm likely to want to add extra items in the future (or amend mistakes which reveal my incompetence). Watch this space.


Next up, I think it would be worth revisiting query expressions from scratch. Anyone who's read C# in Depth or has followed this blog for long enough is likely to be able to skip it, but I think the series would be incomplete without a quick dive into the compiler translations involved.


 

0 comments:

Post a Comment

 
Powered by Blogger