Reimplementing LINQ to Objects: - Comparing implementations Asp.Net C# ~ DotnetlearningSource

While implementing Edulinq, I only focused on two implementations: .NET 4.0 and Edulinq. However, I was aware that there were other implementations available, notably LinqBridge and the one which comes with Mono. Obviously it's interesting to see how other implementations behave, so I've now made a few changes in order to make the test code run in these different environments.

I'm using Mono 2.8 (I can't remember the minor version number offhand) but I tend to think of it as "Mono 3.5" or "Mono 4.0" depending on which runtime I'm using and which base libraries I'm compiling against, to correspond with the .NET versions. Both runtimes ship as part of Mono 2.8. I will use these version numbers for this post, and ask forgiveness for my lack of precision: whenever you see "Mono 3.5" please just think "Mono 2.8 running against the 2.0 runtime, possibly using some of the class libraries normally associated with .NET 3.5".

LinqBridge is a bit like Edulinq - a clean room implementation of LINQ to Objects, but built against .NET 2.0. It contains its own Func delegate declarations and its own version of ExtensionAttribute for extension methods. In my experience this makes it difficult to use with the "real" .NET 3.5, so my build targets .NET 2.0 when running against LinqBridge. This means that tests using HashSet had to be disabled. The version of LinqBridge I'm running against is 1.2 - the latest binary available on the web site. This has AsEnumerable as a plain static method rather than an extension method; the code has been fixed in source control, but I wanted to run against a prebuilt binary, so I've just disabled my own AsEnumerable tests for LinqBridge. Likewise the tests for Zip are disabled both for LinqBridge and the "Mono 3.5" tests as Zip was only introduced in .NET 4.

The other issue of not having .NET 4 available in the tests is that the string.Join(string, IEnumerable) overload is unavailable - something I'd used quite a lot in the test code. I've created a new static class called "StringEx" and replaced string.Join with StringEx.Join everywhere.

There are batch files under a new "testing" directory which will build and run:

Microsoft's LINQ to Objects and Edulinq under .NET LinqBridge, Mono 3.5's LINQ to Objects and Edulinq under Mono 3.5 Mono 4.0's LINQ to Objects and Edulinq under Mono 4.0

Although I have LinqBridge running under .NET 2.0 in Visual Studio, it's a bit of a pain building the tests from a batch file (at least without just calling msbuild). The failures running under Mono 3.5 are the same as those running under .NET 2.0 as far as I can tell, so I'm not too worried.

Note that while I have built the Mono tests under both the 3.5 and 4.0 profiles, the results were the same other than due to generic variance, so I've only included the results of the 4.0 profile below.

Don't forget that the Edulinq tests were written in the spirit of investigation. They cover aspects of LINQ's behaviour which are not guaranteed, both in terms of optimization and simple correctness of behaviour. I have included a test which demonstrates the "issue" with calling Contains on an ICollection which uses a non-default equality comparer, as well as the known issue with OrderByDescending using a comparer which returns int.MinValue. There are optimizations which are present in Edulinq but not in LINQ to Objects, and I have tests for those, too.

The tests which fail against Microsoft's implementation (for known reasons) are normally marked with an [Ignore] attribute to prevent them from alarming me unduly during development. NUnit categories would make more sense here, but I don't believe ReSharper supports them, and that's the way I run the tests normally. Likewise the tests which take a very long time (such as counting more than int.MaxValue elements) are normally suppressed.

In order to truly run all my tests, I now have a horrible hack using conditional compilation: if the ALL_TESTS preprocessor symbol is defined, I build my own IgnoreAttribute class in the Edulinq.Tests namespace, which effectively takes precedence over the NUnit one... so NUnit will ignore the [Ignore], so to speak. Frankly all this conditional compilation is pretty horrible, and I wouldn't use it for a "real" project, but this is a slightly unusual situation.

EDIT: It turns out that ReSharper does support categories. I'm not sure how far that support goes yet, but at the very least there's "Group by categories" available. I may go through all my tests and apply a category to each one: optimization, execution mode, time-consuming etc. We'll see whether I can find the energy for that :)

So, let's have a look at what the test results are...

Unsurprisingly, Edulinq passes all its own tests, with the minor exception of CastTest.OriginalSourceReturnedDueToGenericCovariance running under Mono 3.5, which doesn't include covariance. Arguably this test should be conditionalised to not even run in that situation, as it's not expected to work.

8 failures, all expected:

Contains delegates to the ICollection.Contains implementation if it exists, rather than using the default comparer for the type. This is a design and documentation issue which I've discussed in more detail in the Contains part of this series. Optimization: ElementAt and ElementAtOrDefault don't validate the specified index eagerly when the input sequence implements ICollection but not IList. Optimization: OfType always uses an intermediate iterator even when the input sequence already implements IEnumerable and T is a non-nullable value type. Optimization: SequenceEqual doesn't compare the counts of the sequences eagerly even when both sequences implement ICollection Correctness: OrderByDescending doesn't work if you use a key comparer which returns int.MinValueConsistency: Single and SingleOrDefault (with a predicate) don't throw InvalidOperationException as soon as they encounter a second element matching the predicate; the predicate-less overloads do throw as soon as they see a second element.

All of these have been discussed already, so I won't go into them now.

LinqBridge had a total of 33 failures. I haven't looked into them in detail, but just going from the test output I've broken them down into the following broad categories:

Optimization: Cast never returns the original source, presumably always introducing an intermediate iterator. All three of Microsoft's "missed opportunities" listed above are also missed in LinqBridge Use of input sequences: Except and Intersect appear to read the first sequence first (possibly completely?) and then the second sequence. Edulinq and LINQ to Objects read the second sequence completely and then stream the first sequence. This behaviour is undocumented. Join, GroupBy and GroupJoin appear not to be deferred at all. If I'm right, this is a definite bug. Aggregation accuracy: both Average and Sum over an IEnumerable appear to use a float accumulator instead of a double. This is probably worth fixing for the sake of both range and accuracy, but isn't specified in the documentation. OrderBy (etc) appears to apply the key selector multiple times while sorting. The behaviour here isn't documented, but as I mentioned before, it could produce performance issues unnecessarily. Exceptions: ToDictionary should throw an exception if you give it duplicate keys; it appears not to - at least when a custom comparer is used. (It's possible it's just not passing the comparer along.) The generic Max and Min methods don't return the null value for the element type when that type is nullable. Instead, they throw an exception - which is the normal behaviour if the element type is non-nullable. This behaviour isn't well documented, but is consistent with the behaviour of the non-generic overloads. See the Min/Max post for more details. General bugs: The generic form of Min/Max appears not to ignore null values when the element type is nullable. OrderByDescending appears to be broken in the same way as Microsoft's implementation Range appears to be broken around its boundary testing.Join, GroupJoin, GroupBy and ToLookup break when presented with null keys

Mono failed 18 of the tests. There are fewer definite bugs than in LinqBridge, but it's definitely not perfect. Here's the breakdown:

Optimization: Mono misses the same three opportunities that LinqBridge and Microsoft miss. Contains(item) delegates to ICollection when it's implemented, just like in the Microsoft implementation. (I assume the authors would call this an "optimization", hence its location in this section.) I believe that LinqBridge has the same behaviour, but that test didn't run in the LinqBridge configuration as it uses HashSet. Average/Sum accumulator types: Mono appears to use float when working with float values, leading to more accumulator error than is necessary. Average overflow for integer types Mono appears to use checked arithmetic when summing a sequence, but not when taking the average of a sequence. So the average of { long.MaxValue, long.MaxValue, 2 } is 0. (This originally confused me into thinking it was using floating point types during the summation, but I now believe it's just a checked/unchecked issue.) Bugs: Count doesn't overflow either with or without a predicate The Max handling of double.NaN isn't in line with .NET. I haven't investigated the reason for this yet. OrderByDescending is broken in the same way as for LinqBridge and the Microsoft implementation. Range is broken for both Range(int.MinValue, 0) and Range(int.MaxValue, 1). Test those boundary cases, folks :) When reversing a list, Mono doesn't buffer the current contents. In other words, changes made while iterating over the reversed list are visible in the returned sequence. The documentation isn't very clear about the desired behaviour here, admittedly.GroupJoin and Join match null keys, unlike Microsoft's implementation.

It didn't seem fair to only test other implementations against the Edulinq tests. After all, it's only natural that my tests should work against my own code. What happens if we run the Mono and LinqBridge tests against my code?

The LinqBridge tests didn't find anything surprising. There were two failures:

I don't have the "delegate Contains to ICollection.Contains" behaviour, which the tests check for. I don't optimize First in the case of the collection implementing IList. I view this as a pretty dubious optimization to be honest - I doubt that creating an iterator to get to the first item is going to be much slower than checking for IList, fetching the count, and then fetching the first item via the indexer... and it means that all non-list implementations also have to check whether the sequence implements IList. I don't intend to change Edulinq for this.

The Mono tests picked up the same two failures as above, and two genuine bugs:

By implementing Take via TakeWhile, I was iterating too far: in order for the condition to become false, we had to iterate to the first item we wouldn't return. ToLookup didn't accept null keys - a fault which propagated to GroupJoin, Join and GroupBy too. (EDIT: It turns out that it's more subtle than that. Nothing should break, but the MS implementation ignores null keys for Join and GroupJoin. Edulinq now does the same, but I've raised a Connect issue to suggest this should at least be documented.)

I've fixed these in source control, and will add an addendum to each of the relevant posts (Take, ToLookup) when I have a moment spare.

There's one additional failure, trying to find the average of a sequence of two Int64.MaxValue values. That overflows on both Edulinq and LINQ to Objects - that's the downside of using an Int64 to sum the values. As mentioned, Mono suffers a degree of inaccuracy instead; it's all a matter of trade-offs. (A really smart implementation might use Int64 while possible, and then go up to using Double where necessary, I suppose.)

Unfortunately I don't have the tests for the Microsoft implementation, of course... I'd love to know whether there's anything I've failed with there.

This was very interesting - there's a mixture of failure conditions around, and plenty of "non-failures" where each implementation's tests are enforcing their own behaviour.

I do find it amusing that all three of the "mainstream" implementations have the same OrderByDescending bug though. Other than that, the clear bugs between Mono and LinqBridge don't intersect, which is slightly surprising.

It's nice to see that despite not setting out to create a "production-quality" implementation of LINQ to Objects, that's mostly what I've ended up with. Who knows - maybe some aspects of my implementation or tests will end up in Mono in the future :)

Given the various different optimizations mentioned in this post, I think it's only fitting that next time I'll discuss where we can optimize, where it's worth optimizing, and some more tricks we could still pull out of the bag...

DotnetlearningSource

Pages

Tuesday, 22 March 2011

Reimplementing LINQ to Objects: - Comparing implementations Asp.Net C#

0 comments:

Post a Comment

Subscribe via email

Featured Video

Recent Posts

BThemes

BTricks

Followers

Download

Archives