Pages

Monday, 21 March 2011

Reimplementing LINQ to Objects: Min/Max Asp.Net C#

The second and third AOOOD operators today... if I'm brave enough to tackle Average tomorrow, I'll have done them all. More surprises here today, this time in terms of documentation...


Min and Max are both extension methods with 22 overloads each. Min looks like this:

public static int Min(this IEnumerable source)

public static int Min(
this IEnumerable source,
Func selector)


public static int? Min(this IEnumerable source)


public static int? Min(
this IEnumerable source,
Func selector)


public static TSource Min(this IEnumerable source)


public static TResult Min(
this IEnumerable source,
Func selector
)


(Max is exactly the same as Min; just replace the name.)


The more obvious aspects of the behaviour are as follows:

source and selector mustn't be null All overloads use immediate execution The minimum or maximum value within the sequence is returned If a selector is present, it is applied to each value within source, and the maximum of the projected values is returned. (Note how the return type of these methods is TResult, not TSource.)

Some less obvious aspects - in all cases referring to the result type (as the source type is somewhat incidental when a selector is present; it doesn't affect the behaviour):

The type's IComparable implementation is used when available, otherwise IComparable is used. An ArgumentException is thrown if values can't be compared. Fortunately, this is exactly the behaviour of Comparer.Default. For any nullable type (whether it's a reference type or a nullable value type), nulls within the sequence are ignored, and an empty sequence (or one which contains only null values) will cause a null value to be returned. If there are any non-null values in the sequence, the return value will be non-null. (Note that this is different from Sum, which will return the non-null zero value for empty sequences over nullable types.)For any non-nullable value type, an empty sequence will cause InvalidOperationException to be thrown.

The first point is particularly interesting when you consider the double and float types, and their "NaN" (not-a-number) values. For example, Math.Max regards NaN as greater than positive infinity, but Enumerable.Max regards positive infinity as being the greater of the two. Math.Min and Enumerable.Min agree, however, that NaN is less than negative infinity. (It would actually make sense to me for NaN to be treated as the numeric equivalent of null here, but that would be strange in other ways...) Basically, NaN behaves oddly in all kinds of ways. I believe that IEEE-754-2008 actually specifies behaviour with NaNs which encourages the results we're getting here, but I haven't verified that yet. (I can't find a free version of the standard online, which is troubling in itself. Ah well.)


The behaviour of the nullable and non-nullable types is well documented for the type-specific overloads using int, Nullable etc. However, the generic overloads (the ones using TSource) are poorly documented:

InvalidOperationException isn't in the list of possibly-thrown arguments for any of the overloads The methods using selectors from TSource to TResult don't mention the possibility of nullity at all The methods without selectors describe the behaviour of null values for reference types, but don't mention the possibility of empty sequences for non-nullable value types, or consider nullable value types at all.

(I should point out that ArgumentException isn't actually mentioned either for the case where values are incomparable, but that feels like a slightly less important offence for some reason. Possibly just because it didn't trip me up.)


If I remember, I'll open a Connect issue against this hole in the documentation when I find time. Unlike the optimizations and set ordering (where it's reasonably forgivable to deliberately omit implementation details from the contract) you simply can't predict the behaviour in a useful way from the documentation here. And yes, I'm going on about this because it bit me. I had to resort to writing tests and running them against LINQ to Objects to see if they were correct or not. (They were incorrect in various places.)


If you look at the behaviour of the non-generic methods, the generic ones are entirely consistent of course.


There are a couple of things which you might consider "missing" in terms of Max and Min:

The ability to find out the minimum/maximum value of a sequence by a projection. For example, consider a sequence of people. We may wish to find the youngest person in the sequence, in which case we'd like to be able to write something like: var oldest = people.MaxBy(person => person.Age); We can find the maximum age itself easily enough - but then we'd need a second pass to find the first person with that age. I've addressed this in MoreLINQ with the MaxBy and MinBy operators. The System.Interactive assembly in Reactive Extensions has the same methods too. The ability to specify a custom IComparer implementation, as we can in most of the operators using IEqualityComparer. For example, we can't find the "maximum" string in a sequence, using a case-insensitive ordinal comparison.

Still, at least that means there's less to test...


I decided I really couldn't find the energy to replicate all the tests for every type involved here. Instead, I have a bunch of tests for int and Nullable, a few tests exploring the oddness of doubles, and a bunch of tests around the generic methods. In particular, I know that I've implemented decimal, float etc by calling the same methods that the int overloads use.


The tests cover:

Argument validation Empty sequences Sequences of null values where applicable Projections of the above Generic tests for nullable and non-nullable value types, and reference types (with empty sequences etc) Incomparable values

Okay, let's start off with the simplest detail: the order of implementation:

Max(int) Max(generic) Cut and paste Max implementations for other numeric types (replace the type name, basically) Cut and paste the entirety of Max to Min: Replace "Max" with "Min" everywhere Replace " < " with " > " everywhere (only 4 occurrences; basically the results of calling Compare or ComparerTo and comparing with 0)

Just as with Sum, I could have used templating - but I don't think it would actually have saved me significant time.


This time, I thought I'd use Select internally for the overloads with selectors (unlike my approach for Sum which used identity projections). There's no particular reason for this - I just thought it would be interesting to try both approaches. Overall, I think I prefer this one, but I haven't done any benchmarking to find out the relative performance penalties.


Each set of numeric overloads calls into a single pair of generic "implementation" methods. These aren't the public general-purpose ones: they require that the types in use implement IComparable, and I've added a "struct" constraint just for kicks. This is just one approach. Other options:

I could have implemented the code separately for each numeric type. That may well be faster than calling IComparable.Compare (at least for most types) as the IL would have contained the appropriate operator directly. However, it would have meant more code and explicitly dealing with the headache of NaNs for double/float. If I ever write benchmarks, I'll investigate the difference that this can make. I could have used the public generic overloads, which eventually call into Comparer.Default. Again, the penalty for this (if any) is unknown to me at this point. Can the JIT inline deeply enough to make this as fast as a "native" implementation? I wouldn't like to guess without tests.

I've separated out the nullable implementations from the non-nullable ones, as the behaviour differs significantly between the two.


Here's the public code for int:

public static int Max(this IEnumerable source)
{
return PrimitiveMax(source);
}

public static int Max(
this IEnumerable source,
Func selector)
{
return PrimitiveMax(source.Select(selector));
}


public static int? Max(this IEnumerable source)
{
return NullablePrimitiveMax(source);
}


public static int? Max(
this IEnumerable source,
Func selector)
{
return NullablePrimitiveMax(source.Select(selector));
}


All the methods consider argument validation to be somebody else's problem - either Select or the generic method we're calling to find the maximum value. Part of me thinks this is lazy; part of me likes it in terms of not repeating code. All of me would prefer the ability to specify non-nullable parameters declaratively...


Here are the "primitive" methods called into above:


private static T PrimitiveMax(IEnumerable source) where T : struct, IComparable
{
if (source == null)
{
throw new ArgumentNullException("source");
}
using (IEnumerator iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
{
throw new InvalidOperationException("Sequence was empty");
}
T max = iterator.Current;
while (iterator.MoveNext())
{
T item = iterator.Current;
if (max.CompareTo(item) < 0)
{
max = item;
}
}
return max;
}
}

private static T? NullablePrimitiveMax(IEnumerable source) where T : struct, IComparable
{
if (source == null)
{
throw new ArgumentNullException("source");
}
T? max = null;
foreach (T? item in source)
{
if (item != null &&
(max == null || max.Value.CompareTo(item.Value) < 0))
{
max = item;
}
}
return max;
}


The first method is interesting in terms of its approach to throwing an exception if the first element isn't present, and using that as an initial candidate otherwise.


The second method needs to consider nullity twice on each iteration:

Is the item from the sequence null? If so, we can ignore it. Is our "current maximum" null? If so, we can replace it with the item from the sequence without performing a comparison.

Now there's one case which is ambiguous here: when both values are null. At that point we can choose to replace our "current maximum" with the item, or not... it doesn't matter as the values are the same anyway. It is important that we don't try to perform a comparison unless both values are non-null though... the short-circuiting && and || operators keep us safe here.


Having implemented the code above, all the interesting work lies in the generic forms. Here we don't have different public methods to determine which kind of behaviour we'll use: but I wrote two private methods instead, and just delegated to the right one from the public one. This seemed cleaner than putting the code all in one method:

public static TSource Max(
this IEnumerable source)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
return default(TSource) == null ? NullableGenericMax(source) : NonNullableGenericMax(source);
}

public static TResult Max(
this IEnumerable source,
Func selector)
{
return Max(source.Select(selector));
}



private static T NonNullableGenericMax(IEnumerable source)
{
Comparer comparer = Comparer.Default;


    using (IEnumerator iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
{
throw new InvalidOperationException("Sequence was empty");
}
T max = iterator.Current;
while (iterator.MoveNext())
{
T item = iterator.Current;
if (comparer.Compare(max, item) < 0)
{
max = item;
}
}
return max;
}
}



private static T NullableGenericMax(IEnumerable source)
{
Comparer comparer = Comparer.Default;


    T max = default(T);
foreach (T item in source)
{
if (item != null &&
(max == null || comparer.Compare(max, item) < 0))
{
max = item;
}
}
return max;
}


As you can tell, there's a significant similarity between the "PrimitiveMax" and "NonNullableGenericMax" methods, and likewise between "NullablePrimitiveMax" and "NullableGenericMax". This should come as no surprise. Fundamentally the difference is just between using an IComparable implementation, and using Comparer.Default. (The argument validation occurs in a different place too, as we'll be going through a public entry point for the non-primitive code.)


Once I'd discovered the correct behaviour, this was reasonably simple. Of course, the above code wasn't my first implementation, where I'd completely forgotten about null values, and hadn't thought about how the nullability of the source type might affect the behaviour of empty sequences...


If you're ever in a company which rewards you for checking in lots of lines of code, offer to implement Sum/Min/Max. This weekend I've checked in about 2,500 lines of code in (split between production and test) and none of it's been terribly hard. Of course, if you're ever in such a company you should also consider looking for another job. (Have I mentioned that Google's hiring? Email me if you're interested. I'm serious.)


As you can tell, I was slightly irritated by the lack of clarity around the documentation in some places - but I find it interesting that even a simple-sounding function like "find the maximum value from a sequence" should need the kind of documentation that's missing here. I'm not saying it's a failure of the design - more just musing how a complete specification is almost always going to be longer than you might think at first glance. And if you think I was diligent here, think again: I didn't bother specifying which maximum or minimum value would be returned if there were two. For example, if a sequence consists of references to two equal but distinct strings, which reference should be returned? I have neither stated what my implementation (or the LINQ to Objects implementation) will do, nor tested for it.


Next up is Average - a single method with a mere 20 overloads. There are various corner cases to consider... but that's a post for another day.


 

0 comments:

Post a Comment

 
Powered by Blogger