Nearly there now...
SequenceEqual has two overloads - the obvious two given that we're dealing with equality:
public static bool SequenceEqualthis IEnumerable
IEnumerable
public static bool SequenceEqual
this IEnumerable
IEnumerable
IEqualityComparer
The purpose of the operator is to determine if two sequences are equal; that is, if they consist of the same elements, in the same order. A custom equality comparer can be used to compare each individual pair of elements. Characteristics:
The first and second parameters mustn't be null, and are validated immediately. The comparer parameter can be null, in which case the default equality comparer for TSource is used. The overload without a comparer uses the default equality comparer for TSource (no funny discrepancies if ICollectionSo far, so good. Note that no optimizations are mentioned above. There are questions you might consider:
Should foo.SequenceEqual(foo) always return true? If either or both of the sequences implements another collection interface, does that help us?The first question sounds like it should be a no-brainer, but it's not as simple as it sounds. Suppose we have a sequence which always generates 10 random numbers. Is it equal to itself? If you iterate over it twice, you'll usually get different results. What about a sequence which explicitly changes each time you iterate over it, based on some side-effect? Both the .NET and Edulinq implementations say that these are non-equal. (The random case is interesting, of course - it could happen to yield the same elements as we iterate over the two sequences.)
The second question feels a little simpler to me. We can't take a shortcut to returning true, but it seems reasonably obvious to me that if you have two collections which allow for a fast "count" operation, and the two counts are different, then the sequences are unequal. Unfortunately, LINQ to Objects appears not to optimize for this case: if you create two huge arrays of differing sizes but equal elements as far as possible, it will take a long time for SequenceEqual to return false. Edulinq does perform this optimization. Note that just having one count isn't useful: you might expect it to, but it turns out that by the time we could realize that the lengths were different in that case, we're about to find that out in the "normal" fashion anyway, so there's no point in complicating the code to make use of the information.
As well as the obvious argument validation, I have tests for:
Collections of different lengths Ranges of different lengths, both with first shorter than second and vice versa Using a null comparer Using a custom comparer Using no comparer Equal arrays Equal ranges The non-optimization of foo.SequenceEquals(foo) (using side-effects) The optimization using Count (fails on LINQ to Objects) Ordering: { 1, 2 } should not be equal to { 2, 1 } The use of a HashSetNone of the test code is particularly interesting, to be honest.
I'm not going to show the comparer-less overoad, as it just delegates to the one with a comparer.
Before we get into the guts of SequenceEqual, it's time for a bit of refactoring. If we're going to optimize for count, we'll need to perform the same type tests as Count() twice. That would be horrifically ugly inline, so let's extract the functionality out into a private method (which Count() can then call as well):
private static bool TryFastCountIEnumerable
out int count)
{
ICollection
if (genericCollection != null)
{
count = genericCollection.Count;
return true;
}
ICollection nonGenericCollection = source as ICollection;
if (nonGenericCollection != null)
{
count = nonGenericCollection.Count;
return true;
}
count = 0;
return false;
}
Pretty simple. Note that we always have to set the out parameter to some value. We use 0 on failure - which happens to work out nicely in the Count, as we can just start incrementing if TryFastCount has returned false.
Now we can make a start on SequenceEqual. Here's the skeleton before we start doing the real work:
public static bool SequenceEqualthis IEnumerable
IEnumerable
IEqualityComparer
{
if (first == null)
{
throw new ArgumentNullException("first");
}
if (second == null)
{
throw new ArgumentNullException("second");
}
int count1;
int count2;
if (TryFastCount(first, out count1) && TryFastCount(second, out count2))
{
if (count1 != count2)
{
return false;
}
}
comparer = comparer ?? EqualityComparer
}
I could have included the comparison between count1 and count2 within the single "if" condition, like this:
if (TryFastCount(first, out count1) &&TryFastCount(second, out count2) &&
count1 != count2)
{
return false;
}
... but I don't usually like using the values of out parameters like this. The behaviour is well-defined and correct, but it just feels a little ugly to me.
Okay, now let's implement the "Main part" which at the bottom of the skeleton. The idea is simple:
Get the iterators for both sequences Use the iterators "in parallel" (not in the multithreading sense, but in the movement of the logical cursor down the sequence) to compare pairs of elements; we can return false if we ever see an unequal pair If we ever see that one sequence has finished and the other hasn't, we can return false If we get to the end of both sequences in the same iteration, we can return trueI've got three different ways of representing the basic algorithm in code though. Fundamentally, the problem is that we don't have a way of iterating over pairs of elements in two sequences with foreach - we can't use one foreach loop inside another, for hopefully obvious reasons. So we'll have to call GetEnumerator() explicitly on at least one of the sequences... and we could do it for both if we want.
The first implementation (and my least favourite) does use a foreach loop:
using (IEnumerator{
foreach (TSource item1 in first)
{
if (!iterator2.MoveNext())
{
return false;
}
if (!comparer.Equals(item1, iterator2.Current))
{
return false;
}
}
return !iterator2.MoveNext();
}
I don't have a desperately good reason for picking them this way round (i.e. foreach over first, and GetEnumerator() on second) other than that it seems to still give primacy to first somehow... only first gets the "special treatment" of a foreach loop. (I can almost hear the chants now, "Equal rights for second sequences! Don't leave us out of the loop! Stop just 'using' us!") Although I'm being frivolous, I dislike the asymmetry of this.
The second attempt is a half-way house: it's still asymmetric, but slightly less so as we're explicitly fetching both iterators:
using (IEnumeratoriterator2 = second.GetEnumerator())
{
while (iterator1.MoveNext())
{
if (!iterator2.MoveNext())
{
return false;
}
if (!comparer.Equals(iterator1.Current, iterator2.Current))
{
return false;
}
}
return !iterator2.MoveNext();
}
Note the use of the multi-variable "using" statement; this is equivalent nesting one statement inside another, of course.
The similarities between these two implementations are obvious - but the differences are worth pointing out. In the latter approach, we call MoveNext() on both sequences, and then we access the Current property on both sequences. In each case we use iterator1 before iterator2, but it still feels like they're being treated more equally somehow. There's still the fact that iterator1 is being used in the while loop condition, whereas iterator2 has to be used both inside and outside the while loop. Hmm.
The third implementation takes this even further, changing the condition of the while loop:
using (IEnumeratoriterator2 = second.GetEnumerator())
{
while (true)
{
bool next1 = iterator1.MoveNext();
bool next2 = iterator2.MoveNext();
if (next1 != next2)
{
return false;
}
if (!next1)
{
return true;
}
if (!comparer.Equals(iterator1.Current, iterator2.Current))
{
return false;
}
}
This feels about as symmetric as we can get. The use of next1 in the middle "if" condition is incidental - it could just as easily be next2, as we know the values are equal. We could switch round the order of the calls to MoveNext(), the order of arguments to comparer.Equals - the structure is symmetric.
I'm not generally a fan of while(true) loops, but in this case I think I rather like it. It makes it obvious that we're going to keep going until we've got a good reason to stop: one of the three return statements. (I suppose I should apologise to fans of the dogma around single exit points for methods, if any are reading. This must be hell for you...)
Arguably this is all a big fuss about nothing - but writing Edulinq has given me a new appreciation for diving into this level of detail to find the most readable code. As ever, I'd be interested to hear your views. (All three versions are in source control. Which one is active is defined with a #define.)
I really don't know why Microsoft didn't implement the optimization around different lengths for SequenceEqual. Arguably in the context of LINQ you're unlikely to be dealing with two materialized collections at a time - it's much more common to have one collection and a lazily-evaluated query, or possibly just two queries... but it's a cheap optimization and the benefits can be significant. Maybe it was just an oversight.
Our next operator also deals with pairs of elements, so we may be facing similar readability questions around it. It's Zip - the only new LINQ query operator in .NET 4.


0 comments:
Post a Comment