Reimplementing LINQ to Objects: Contains Asp.Net C# ~ DotnetlearningSource

After the dubious optimizations of ElementAt/ElementAtOrDefault yesterday, we meet an operator which is remarkably good at defying optimization. Sort of. Depending on how you feel it should behave.

Contains has two overloads, which only differ by whether or not they take an equality comparer - just like Distinct, Intersect and the like:

public static bool Contains(
this IEnumerable source,
TSource value)

public static bool Contains(
this IEnumerable source,
TSource value,
IEqualityComparer comparer)

The operator simply returns a Boolean indicating whether or not "value" was found in "source". The salient points of its behaviour should be predictable now:

It uses immediate execution (as it's returning a simple value instead of a sequence) The source parameter cannot be null, and is validated immediately The value parameter can be null: it's valid to search for a null value within a sequence The comparer parameter can be null, which is equivalent to passing in EquailtyComparer.Default. The overload without a comparer uses the default equality comparer too. If a match is found, the method returns immediately without reading the rest of the input sequence. There's a documented optimization for ICollection - but there are significant issues with it...

So far, so good.

Aside from argument validation, I have tests for the value being present in the source, and it not being present in the source... for the three options of "no comparer", "null comparer" and "specific comparer".

I then have one final test to validate that we return as soon as we've found a match, by giving a query which will blow up when the element after the match is computed.

Frankly none of the tests are earth-shattering, but in the spirit of giving you an idea of what they're like, here's one with a custom comparer - we use the same source and value for a "default comparer" test which doesn't find the value as the case differs:

[Test]
public void MatchWithCustomComparer()
{
string[] source = { "foo", "bar", "baz" };
Assert.IsTrue(source.Contains("BAR", StringComparer.OrdinalIgnoreCase));
}

Currently I don't have a test for the optimization mentioned in the bullet points above, as I believe it's broken. More later.

To start with, let's dispense with the overload without a comparer parameter: that just delegates to the other one by specifying EqualityComparer.Default. Trivial. (Or so we might think. There's more to this than meets the eye.)

I've got three implementations, but we'll start with just two of them. Which one you pick would depend on whether you're happy to use one operator to implement another. If you think that's okay, it's really simple:

public static bool Contains(
this IEnumerable source,
TSource value,
IEqualityComparer comparer)
{
comparer = comparer ?? EqualityComparer.Default;
return source.Any(item => comparer.Equals(value, item));
}

"Any" has exactly the traits we want, including validation of the non-nullity of "source". It's hardly complicated code if we don't use Any though:

public static bool Contains(
this IEnumerable source,
TSource value,
IEqualityComparer comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
comparer = comparer ?? EqualityComparer.Default;
foreach (TSource item in source)
{
if (comparer.Equals(value, item))
{
return true;
}
}
return false;
}

Obviously there's a slight penalty in using Any just because of executing a delegate on each iteration - and the extra memory requirement of building an object to capture the comparer. I haven't measured the performance impact of this - again, it's a candidate for benchmarking.

The implementations above are all very well, but they feel ever so simplistic. With ElementAt, we were able to take advantage of the fact that an IList allows us random access by index. Surely we've got similar collections which allow us to test for containment cheaply?

Well, yes and no. We've got IDictionary which allows you to check for the presence of a particular key - but even it would be hard to test whether the sequence we're looking at is the key sequence for some IDictionary, and somehow get back to the dictionary.

ICollection has a Contains method, but that doesn't necessarily do the right thing. This is particularly troubling, as the MSDN documentation for the comparer-less overload has contradictory information:

(Summary)

Determines whether a sequence contains a specified element by using the default equality comparer.

(Remarks)

If the type of source implements ICollection, the Contains method in that implementation is invoked to obtain the result. Otherwise, this method determines whether source contains the specified element.

Enumeration is terminated as soon as a matching element is found.

Elements are compared to the specified value by using the default equality comparer, Default.

Why is this troubling? Well, let's look at a test:

[Test]
public void SetWithDifferentComparer()
{
HashSet sourceAsSet = new HashSet(StringComparer.OrdinalIgnoreCase)
{ "foo", "bar", "baz" };
IEnumerable sourceAsSequence = sourceAsSet;
Assert.IsTrue(sourceAsSet.Contains("BAR"));
Assert.IsFalse(sourceAsSequence.Contains("BAR"));
Assert.IsFalse(sourceAsSequence.Contains("BAR", StringComparer.Ordinal));
}

(This exact code won't build in the Edulinq project configuration, as that doesn't have a reference to the System.Core assembly which contains HashSet. I've got a hack which allows me to run effectively this code though. See the source for details.)

Now this test looks correct to me: while we're regarding the set as a set, it should use the set's comparer and find "BAR" with a case-insensitive match. However, when we use it as a sequence in LINQ, it should obey the rules of Enumerable.Contains - which means that the middle call should use the default equality comparer for string. Under that equality comparer, "BAR" isn't present.

It doesn't: the above test fails on that middle call in LINQ to Objects, because HashSet implements ICollection. To fit in with the implementation, the documentation summary should actually be worded as something like:

"Determines whether a sequence contains a specified element by using the default equality comparer if the sequence doesn't implement ICollection, or whatever equality comparison the collection uses if it does implement ICollection."

Now you may be saying to yourself that this is only like relying on IList to fetch an item by index in a fashion consistent with iterating over with it - but I'd argue that any IList implementation which didn't do that was simply broken... whereas ICollection.Contains is specifically documented to allow custom comparisons:

Implementations can vary in how they determine equality of objects; for example, List uses Comparer.Default, whereas Dictionary allows the user to specify the IComparer implementation to use for comparing keys.

Let's leave aside the fact that those "Comparer" and "IComparer" should be "EqualityComparer" and "IEqualityComparer" respectively for the minute, and just note that it's entirely reasonable for an implementation not to use the default equality comparer. That makes sense - but I believe it also makes sense for source.Contains(value) to be more predictable in terms of the equality comparer it uses.

Now I would certainly agree that having a method call which changes semantics based on whether the compile-time type of the source is IEnumerable or ICollection is undesirable too... but I'm not sure there is any particularly nice solution. The options are:

The current LINQ to Objects implementation where the comparer used is hard to predict. The Edulinq implementation where the type's default comparer is always used... if the compile-time type means that Enumerable.Contains is used in the first place. Remove the comparer-less overload entirely, and force people to specify one. This is lousy for convenience.

Note that you might expect the overload which takes a comparer to work the same way if you pass in null as the comparer - but it doesn't. That overload never delegates to ICollection.Contains.

So: convenience, predictability, consistency. Pick any two. Isn't API design fun? This isn't even thinking about performance, of course...

It's worth bearing in mind that even the current behaviour which is presumably meant to encourage consistency doesn't work. One might expect that the following would always be equivalent for any sensible collection:

var first = source.Contains(value);
var second = source.Select(x => x).Contains(value);

... but of course the second line will always use EqualityComparer.Default whereas the first may or may not.

(Just for fun, think about Dictionary which implements ICollection>; its explicitly-implemented ICollection.Contains method will use its own equality comparer for the key, but the default equality comparer for the value part of the pair. Yay!)

I can think of exactly one situation which we could legitimately optimize without making the behaviour hard to predict. Basically we're fine to ask the collection to do our work for us if we can guarantee it will use the right comparer. Ironically, List.Contains has an overload which allows us to specify the equality comparer, so we could delegate to that - but it's not going to be significantly faster than doing it ourselves. It's still got to look through everything.

ISet in .NET 4 doesn't help us much - its API doesn't talk about equality comparers. (This makes a certain amount of sense - consider SortedSet which uses IComparer instead of IEqualityComparer. It wouldn't make sense to ask a SortedSet for an equality comparer - it couldn't give you one, as it wouldn't know how to produce a hash code.)

However, HashSet does give us something to work with. You can ask a HashSet which equality comparer it uses, so we could delegate to its implementation if and only if it would use the one we're interested in. We can bolt that into our existing implementation pretty easily, after we've worked out the comparer to use:

HashSet hashSet = source as HashSet;
if (hashSet != null && comparer.Equals(hashSet.Comparer))
{
return hashSet.Contains(value);
}

So is this worth including or not?

Pros:

It covers one of the biggest use cases for optimizing Contains; I suspect this is used more often than the LINQ implementation of Contains working over a dictionary. So long as the comparer doesn't override Equals in a bizarre way, it should be a true optimization with no difference in behaviour. The optimization is applied for both overloads of Enumerable.Contains, not just the comparer-less one.

Cons:

It's specific to HashSet rather than an interface type. That makes it feel a little too specific to be a good target of optimization. We've still got the issue of consistency in terms of sourceAsSet.Contains(value) vs sourceAsSequence.Contains(value) There's a tiny bit of overhead if the source isn't a hash set, and a further overhead if it is a hash set but with the wrong comparer. I'm not too bothered about this.

It's not the default implementation in Edulinq at the moment, but I could possibly be persuaded to include it. Likewise I have a conditionally-compiled version of Contains which is compatible with LINQ to Objects, with the "broken" optimization for the comparer-less overload; this is turned off by default too.

Gosh! I hadn't expected Contains to be nearly this interesting. I'd worked out that optimization would be a pain, but I hadn't expected it to be such a weird design choice.

This is the first time I've deliberately gone against the LINQ to Objects behaviour, other than the MS bug around descending orderings using "extreme" comparers. The option for compatibility is there, but I feel fairly strongly that this was a bad design decision on Microsoft's part. A bad decision out of some fairly unpleasant alternatives, I grant you. I'm willing to be persuaded of its virtues, of course - and in particular I'd welcome discussion with the LINQ team around this. In particular, it's always fun to hear about the history of design decisions.

Next up, Cast and OfType.

DotnetlearningSource

Pages

Tuesday, 22 March 2011

Reimplementing LINQ to Objects: Contains Asp.Net C#

0 comments:

Post a Comment

Subscribe via email

Featured Video

Recent Posts

BThemes

BTricks

Followers

Download

Archives