Pages

Monday, 21 March 2011

Reimplementing LINQ to Objects: ElementAt / ElementAtOrDefault Asp.Net C#

A nice easy pair of operators tonight. I should possibly have covered them at the same time as First/Last/Single and the OrDefault variants, but never mind...


ElementAt and ElementAtOrDefault have a single overload each:

public static TSource ElementAt(
this IEnumerable source,
int index)

public static TSource ElementAtOrDefault(
this IEnumerable source,
int index)


Isn't that blissfully simple after the overload storm of the past few days?


The two operators work in very similar ways:

They use immediate execution. The source parameter must not be null, and this is validated immediately. They return the element at the specified zero-based index, if it's in the range 0 <= index < count.

The methods only differ in their handling of an index which falls outside the given bound. ElementAt will throw an ArgumentOutOfRangeException; ElementAtOrDefault will return the default value for TSource (e.g. 0, null, false). This is true even if index is negative. You might have expected some way to specify the default value to return if the index is out of bounds, but there isn't one. (This is consistent with FirstOrDefault() and so on, but not with Nullable.GetValueOrDefault())


This behaviour leaves us some room for common code - for once I haven't used cut and paste for the implementation. Anyway, I'm getting ahead of myself.


As you can imagine, my tests for the two operators are identical except for the expected result in the case of the index being out of range. I've tested the following cases:

Null source A negative index An index which is too big on a NonEnumerableCollection An index which is too big on a NonEnumerableList An index which is too big on a lazy sequence (using Enumerable.Range) A valid index in a NonEnumerableList A valid index in a lazy sequence

The "non-enumerable" list and collection are to test that the optimizations we're going to perform are working. In fact, the NonEnumerableCollection test fails on LINQ to Objects - it's only optimized for IList. You'll see what I mean in a minute... and why that might not be a bad thing.


None of the tests are very interesting, to be honest.


As I mentioned earlier, I've used some common code for once (although I admit the first implementation used cut and paste). As the only difference between the two methods is the handling of a particular kind of failure, I've used the TryXXX pattern which exists elsewhere in the framework. There's a common method which tries to retrieve the right element as an out parameter, and indicates whether or not it succeeded via the return value. Not every kind of failure is just returned, of course - we want to throw an ArgumentNullException if source is null in either case.


That leaves our public methods looking quite straightforward:

public static TSource ElementAt(
this IEnumerable source,
int index)
{
TSource ret;
if (!TryElementAt(source, index, out ret))
{
throw new ArgumentOutOfRangeException("index");
}
return ret;
}

public static TSource ElementAtOrDefault(
this IEnumerable source,
int index)
{
TSource ret;
TryElementAt(source, index, out ret);
return ret;
}


TryElementAt will only return false if the index is out of bounds, so the exception is always appropriate. However, there is a disadvantage to this approach: we can't easily indicate in the exception message whether index was too large or negative. We could have specified a message which included the value of index itself, of course. I think it's a minor matter either way, to be honest.


The main body of the code is in TryElementAt, obviously. It would actually be very simple - just looping and counting up to index, checking as we went - except there are two potential optimizations.


The most obvious - and most profitable - optimization is if the collection implements IList. If it does, we can efficiently obtain the count using the ICollection.Count property (don't forget that IList extends ICollection), check that it's not too big, and then use the indexer from IList to get straight to the right element. Brilliant! That's a clear win.


The less clear optimization is if the collection implements ICollection but not IList, or if it only implements the nongeneric ICollection. In those cases we can still get at the count - but we can't then get directly to the right element. In other words, we can optimize the failure case (possibly hugely), but at a very slight cost - the cost of checking whether the sequence implements either interface - for the success case, where the check won't do us any good.


This is the sort of optimization which is impossible to judge without real data. How often are these operators called with an invalid index? How often does that happen on a collection which implements ICollection but not IList (or implements ICollection)? How large are those collections (so how long would it take to have found our error the normal way)? What's the cost of performing the type check? I don't have the answers to any of these questions. I don't even have strong suspicions. I know that Microsoft doesn't use the same optimization, but I don't know whether that was due to hard data or a gut feeling.


For the moment, I've kept all the optimizations. Here's the code:

private static bool TryElementAt(
IEnumerable source,
int index,
out TSource element)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
element = default(TSource);
if (index < 0)
{
return false;
}
ICollection collection = source as ICollection;
if (collection != null)
{
int count = collection.Count;
if (index >= count)
{
return false;
}
IList list = source as IList;
if (list != null)
{
element = list[index];
return true;
}
}

    ICollection nonGenericCollection = source as ICollection;
if (nonGenericCollection != null)
{
int count = nonGenericCollection.Count;
if (index >= count)
{
return false;
}
}
using (IEnumerator iterator = source.GetEnumerator())
{
for (int i = -1; i < index; i++)
{
if (!iterator.MoveNext())
{
return false;
}
}
element = iterator.Current;
return true;
}
}


As you can see, the optimized cases actually form the bulk of the code - part of me thinks it would be worth removing the non-IList optimizations just for clarity and brevity.


It's worth looking at the "slow" case where we actually iterate. The for loop looks odd, until you think that to get at element 0, you have to call MoveNext() once. We don't want to just add one to index or use a less-than-or-equal condition: both of those would fail in the case where index is int.MaxValue; we'd either not loop at all (by incrementing index and it overflowing either causing an exception or becoming negative) or we'd loop forever, as every int is less than or equal to int.MaxValue.


Another way to look at it is that the loop counter ("i") is the "current index" within the iterator: the iterator starts before the first element, so it's reasonable to start at -1.


The reason I'm drawing attention to this is that I got all of this wrong first time... and was very grateful for unit tests to catch me out.


For me, the most interesting part of ElementAt is the decision about optimization. I'm sure I'm not the only one who optimizes without data at times - but it's a dangerous thing to do. The problem is that this isn't the normal micro-optimization quandary of "it's always a tiny bit better, but it's probably insignificant and makes the code harder to read". For the cases where this is faster, it could make an enormous difference - asking for element one million of a linked list which doesn't quite have enough elements could be very painful. But do failure cases need to be fast? How common are they? As you can tell, I'm dithering. I think it's at least worth thinking about what optimizations might make a difference - even if we later remove them.


Next time, I think I'll tackle Contains - an operator which you might expect to be really fast on a HashSet, but which has some interesting problems of its own...


 

0 comments:

Post a Comment

 
Powered by Blogger