Reimplementing LINQ to Objects: What's missing? Asp.Net C# ~ DotnetlearningSource

I mentioned before that the Zip operator was only introduced in .NET 4, so clearly there's a little wiggle room for LINQ to Object's query operators to grow in number. This post mentions some of the ones I think are most sorely lack - either because I've wanted them myself, or because I've seen folks on Stack Overflow want them for entirely reasonable use cases.

There is an issue with respect to other LINQ providers, of course: as soon as some useful operators are available for LINQ to Objects, there will be people who want to apply them to LINQ to SQL, the Entity Framework and the like. Worse, if they're not included in Queryable with overloads based on expression trees, the LINQ to Objects implementation will silently get picked - leading to what looks like a lovely query performing like treacle while the client slurps over the entire database. If they are included in Queryable, then third party LINQ providers could end up with a nasty versioning problem. In other words, some care is needed and I'm glad I'm not the one who has to decide how new features are introduced.

I've deliberately not looked at the extra set of operators introduced in the System.Interactive part of Reactive Extensions... nor have I looked back over what we've implemented in MoreLINQ (an open source project I started specifically to create new operators). I figured it would be worth thinking about this afresh - but look at both of those projects for actual implementations instead of just ideas.

Currently there's no implementation of any of this in Edulinq - but I could potentially create an "Edulinq.Extras" assembly which made it all available. Let me know if any of these sounds particularly interesting to see in terms of implementation.

I love OrderBy and ThenBy, with their descending cousins. They're so much cleaner than building a custom comparer which just performs a comparison between two properties. So why stop with ordering? There's a whole bunch of operators which could do with some "FooBy" love. For example, imagine we have a list of files, and we want to find the longest one. We don't want to perform a total ordering by size descending, nor do we want to find the maximum file size itself: we want the file with the maximum size. I'd like to be able to write that query as:

FileInfo biggestFile = files.MaxBy(file => file.Length);

Note that we can get a similar result by performing one pass to find the maximum length, and then another pass to find the file with that length. However, that's inefficient and assumes we can read the sequence twice (and get the same results both times). There's no need for that. We could get the same result using Aggregate with a pretty complicated aggregation, but I think this is a sufficiently common case to deserve its own operator.

We'd want to specify which value would be returned if multiple files had the same length (my suggestion would be the first one we encountered with that length) and we could also specify a key comparer to use. The signatures would look like this:

public static TSource MaxBy(
this IEnumerable source,
Func keySelector)

public static TSource MaxBy(
this IEnumerable source,
Func keySelector,
IComparer comparer)

Now it's not just Max and Min that gain from this "By" idea. It would be useful to apply the same idea to the set operators. The simplest of these to think about would be DistinctBy, but UnionBy, IntersectBy and ExceptBy would be reasonable too. In the case of ExceptBy and IntersectBy we could potentially take the key collection to indicate the keys of the elements we wanted to exclude/include, but it would probably be more consistent to force the two input sequences to be of the same type (as they would have to be for UnionBy and IntersectBy of course). ContainsBy might be useful, but that would effectively be a Select followed by a normal Contains - possibly not useful enough to merit its own operator.

These may sound like they belong in the FooBy section, but they're somewhat different: they're effectively specializations of OrderBy and OrderByDescending where you already know how many elements you want to preserve. The return type would be IOrderedEnumerable so you could still use ThenBy/ThenByDescending as normal. That would make the following two queries equivalent - but the second might be a lot more efficient than the first:

var takeQuery = people.OrderBy(p => p.LastName)
.ThenBy(p => p.FirstName)
.Take(3);

var topQuery = people.TopBy(p => p.LastName, 3)
.ThenBy(p => p.FirstName);

An implementation could easily delegate to various different strategies depending on the number given - for example, if you asked for more than 10 values, it may not be worth doing anything more than a simple sort and restrict the output. If you asked for just the top 3 values, that could return an IOrderedEnumerable implementation specifically hard-coded to 3 values, etc.

Aside from anything else, if you were confident in what the implementation did (and that's a very big "if") you could use a potentially huge input sequence with such a query - larger than you could fit into memory in one go. That's fine if you're only keeping the top three values you've seen so far, but would fail for a complete ordering, even one which was able to yield results before performing all the ordering: if it doesn't know you're going to stop after three elements, it can't throw anything away.

Perhaps this is too specialized an operator - but it's an interesting one to think about. It's worth noting that this probably only makes sense for LINQ to Objects, which never gets to see the whole query in one go. Providers like LINQ to SQL can optimize queries of the form OrderBy(...).ThenBy(...).Take(...) because by the time they need to translate the query into SQL, they will have an expression tree representation which includes the "Take" part.

One of the implementation details of Edulinq is its TryFastCount method, which basically encapsulates the logic around attempting to find the count of a sequence if it implements ICollection or ICollection. Various built-in LINQ operators find this useful, and anyone writing their own operators has a reasonable chance of bumping into it as well. It seems pointless to duplicate the code all over the place... why not expose it? The signatures might look something like this:

public static bool TryFastCount(
this IEnumerable source,
out int count)

public static bool TryFastElementAt(
this IEnumerable source,
int index,
out TSource value)

I would expect TryFastElementAt to use the indexer if the sequence implemented IList without performing any validation: that ought to be the responsibility of the caller. TryFastCount could use a Nullable return type instead of the return value / out parameter split, but I've kept it consistent with the methods which exist elsewhere in the framework

These are related operators in that they deal with wanting a more global view than just the current element. Scan would act similarly to Aggregate - except that it would yield the accumulator value after each element. Here's an example of keeping a running total:

public static IEnumerable Scan(
this IEnumerable source,
TAccumulate seed,
Func func)

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.Scan(0, (current, item) => current + item);
query.AssertSequenceEqual(3, 8, 10, 11, 15);

There could be a more complicated overload with an extra conversion from TAccumulate to an extra TResult type parameter. That would let us write a Fibonacci sequence query in one line, if we really wanted to...

The SelectAdjacent operator would simply present a selector function with pairs of adjacent items. Here's a similar example, this time calculating the difference between each pair:

public static IEnumerable SelectAdjacent(
this IEnumerable source,
Func selector)

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.SelectAdjacent((current, next) => next - current);
query.AssertSequenceEqual(2, -3, -1, 3);

One oddity here is that the result sequence always contains one item fewer than the source sequence. If we wanted to keep the length the same, there are various approaches we could take - but the best one would depend on the situation.

This sounds like a pretty obscure operator, but I've actually seen quite a few LINQ questions on Stack Overflow where it could have been useful. Is it useful often enough to deserve its own operator? Maybe... maybe not.

This one is really just a bit of a peeve - but again, it's a pretty common requirement. We often want to take a sequence and create a single string which is (say) a comma-delimited version. Yay, String.Join does exactly what we need - particularly in .NET 4, where there's an overload taking IEnumerable so you don't need to convert it to a string array first. However, it's still a static method on string - and the name "Join" also looks slightly odd in the context of a LINQ query, as it's got nothing to do with a LINQ-style join.

Compare these two queries: which do you think reads better, and feels more "natural" in LINQ?

var names = string.Join(",",
people.Where(p => p.Age < 18)
.Select(p => p.FirstName));

var names = people.Where(p => p.Age < 18)
.Select(p => p.FirstName)
.DelimitWith(",");

I know which I prefer :)

(Added on February 23rd 2011.)

I'm surprised I missed this one first time round - I've bemoaned its omission in various places before now. It's easy to create a list, dictionary, lookup or array from an anonymous type, but you can't create a set that way. That's mad, given how simple the relevant operator is, even with an overload for a custom equality comparer:

public static HashSet ToHashSet(
this IEnumerable source)
{
return source.ToHashSet(EqualityComparer.Default);
}

public static HashSet ToHashSet(
this IEnumerable source,
IEqualityComparer comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
return new HashSet(source, comparer ?? EqualityComparer.Default);
}

This also makes it much simpler to create a HashSet in a readable way from an existing query expression, without either wrapping the whole query in the constructor call or using a local variable.

These are just the most useful extra methods I thought of, based on the kinds of query folks on Stack Overflow have asked about. I think it's interesting that some are quite general - MaxBy, ExceptBy, Scan and so on - whereas others (TopBy, SelectAdjacent and particularly DelimitWith) are simply aimed at making some very specific but common situations simpler. It feels to me like the more general operators really are missing from LINQ - they would fit quite naturally - but the more specific ones probably deserve to be in a separate static class, as "extras".

This is only scratching the surface of what's possible, of course - System.Interactive.EnumerableEx in Reactive Extensions has loads of options. Some of them are deliberate parallels of the operators in Observable, but plenty make sense on their own too.

One operator you may have expected to see in this list is ForEach. This is a controversial topic, but Eric Lippert has written about it very clearly (no surprise there, then). Fundamentally LINQ is about querying a sequence, not taking action on it. ForEach breaks that philosophy, which is why I haven't included it here. Usually a foreach statement is a perfectly good alternative, and make the "action" aspect clearer.

DotnetlearningSource

Pages

Tuesday, 22 March 2011

Reimplementing LINQ to Objects: What's missing? Asp.Net C#

0 comments:

Post a Comment

Subscribe via email

Featured Video

Recent Posts

BThemes

BTricks

Followers

Download

Archives