After the dubious optimizations of ElementAt/ElementAtOrDefault yesterday, we meet an operator which is remarkably good at defying optimization. Sort of. Depending on how you feel it should behave.
Contains has two overloads, which only differ by whether or not they take an equality comparer - just like Distinct, Intersect and the like:
public static bool Containsthis IEnumerable
TSource value)
public static bool Contains
this IEnumerable
TSource value,
IEqualityComparer
The operator simply returns a Boolean indicating whether or not "value" was found in "source". The salient points of its behaviour should be predictable now:
It uses immediate execution (as it's returning a simple value instead of a sequence) The source parameter cannot be null, and is validated immediately The value parameter can be null: it's valid to search for a null value within a sequence The comparer parameter can be null, which is equivalent to passing in EquailtyComparerSo far, so good.
Aside from argument validation, I have tests for the value being present in the source, and it not being present in the source... for the three options of "no comparer", "null comparer" and "specific comparer".
I then have one final test to validate that we return as soon as we've found a match, by giving a query which will blow up when the element after the match is computed.
Frankly none of the tests are earth-shattering, but in the spirit of giving you an idea of what they're like, here's one with a custom comparer - we use the same source and value for a "default comparer" test which doesn't find the value as the case differs:
[Test]public void MatchWithCustomComparer()
{
string[] source = { "foo", "bar", "baz" };
Assert.IsTrue(source.Contains("BAR", StringComparer.OrdinalIgnoreCase));
}
Currently I don't have a test for the optimization mentioned in the bullet points above, as I believe it's broken. More later.
To start with, let's dispense with the overload without a comparer parameter: that just delegates to the other one by specifying EqualityComparer
I've got three implementations, but we'll start with just two of them. Which one you pick would depend on whether you're happy to use one operator to implement another. If you think that's okay, it's really simple:
public static bool Containsthis IEnumerable
TSource value,
IEqualityComparer
{
comparer = comparer ?? EqualityComparer
return source.Any(item => comparer.Equals(value, item));
}
"Any" has exactly the traits we want, including validation of the non-nullity of "source". It's hardly complicated code if we don't use Any though:
public static bool Containsthis IEnumerable
TSource value,
IEqualityComparer
{
if (source == null)
{
throw new ArgumentNullException("source");
}
comparer = comparer ?? EqualityComparer
foreach (TSource item in source)
{
if (comparer.Equals(value, item))
{
return true;
}
}
return false;
}
Obviously there's a slight penalty in using Any just because of executing a delegate on each iteration - and the extra memory requirement of building an object to capture the comparer. I haven't measured the performance impact of this - again, it's a candidate for benchmarking.
The implementations above are all very well, but they feel ever so simplistic. With ElementAt, we were able to take advantage of the fact that an IList
Well, yes and no. We've got IDictionary
ICollection
(Summary)
Determines whether a sequence contains a specified element by using the default equality comparer.
(Remarks)
If the type of source implements ICollection
, the Contains method in that implementation is invoked to obtain the result. Otherwise, this method determines whether source contains the specified element. Enumeration is terminated as soon as a matching element is found.
Elements are compared to the specified value by using the default equality comparer, Default.
Why is this troubling? Well, let's look at a test:
[Test]public void SetWithDifferentComparer()
{
HashSet
{ "foo", "bar", "baz" };
IEnumerable
Assert.IsTrue(sourceAsSet.Contains("BAR"));
Assert.IsFalse(sourceAsSequence.Contains("BAR"));
Assert.IsFalse(sourceAsSequence.Contains("BAR", StringComparer.Ordinal));
}
(This exact code won't build in the Edulinq project configuration, as that doesn't have a reference to the System.Core assembly which contains HashSet
Now this test looks correct to me: while we're regarding the set as a set, it should use the set's comparer and find "BAR" with a case-insensitive match. However, when we use it as a sequence in LINQ, it should obey the rules of Enumerable.Contains - which means that the middle call should use the default equality comparer for string. Under that equality comparer, "BAR" isn't present.
It doesn't: the above test fails on that middle call in LINQ to Objects, because HashSet
"Determines whether a sequence contains a specified element by using the default equality comparer if the sequence doesn't implement ICollection
Now you may be saying to yourself that this is only like relying on IList
Implementations can vary in how they determine equality of objects; for example, List
uses Comparer .Default, whereas Dictionary allows the user to specify the IComparer implementation to use for comparing keys.
Let's leave aside the fact that those "Comparer
Now I would certainly agree that having a method call which changes semantics based on whether the compile-time type of the source is IEnumerable
Note that you might expect the overload which takes a comparer to work the same way if you pass in null as the comparer - but it doesn't. That overload never delegates to ICollection
So: convenience, predictability, consistency. Pick any two. Isn't API design fun? This isn't even thinking about performance, of course...
It's worth bearing in mind that even the current behaviour which is presumably meant to encourage consistency doesn't work. One might expect that the following would always be equivalent for any sensible collection:
var first = source.Contains(value);var second = source.Select(x => x).Contains(value);
... but of course the second line will always use EqualityComparer
(Just for fun, think about Dictionary
I can think of exactly one situation which we could legitimately optimize without making the behaviour hard to predict. Basically we're fine to ask the collection to do our work for us if we can guarantee it will use the right comparer. Ironically, List
ISet
However, HashSet
if (hashSet != null && comparer.Equals(hashSet.Comparer))
{
return hashSet.Contains(value);
}
So is this worth including or not?
Pros:
It covers one of the biggest use cases for optimizing Contains; I suspect this is used more often than the LINQ implementation of Contains working over a dictionary. So long as the comparer doesn't override Equals in a bizarre way, it should be a true optimization with no difference in behaviour. The optimization is applied for both overloads of Enumerable.Contains, not just the comparer-less one.Cons:
It's specific to HashSetIt's not the default implementation in Edulinq at the moment, but I could possibly be persuaded to include it. Likewise I have a conditionally-compiled version of Contains which is compatible with LINQ to Objects, with the "broken" optimization for the comparer-less overload; this is turned off by default too.
Gosh! I hadn't expected Contains to be nearly this interesting. I'd worked out that optimization would be a pain, but I hadn't expected it to be such a weird design choice.
This is the first time I've deliberately gone against the LINQ to Objects behaviour, other than the MS bug around descending orderings using "extreme" comparers. The option for compatibility is there, but I feel fairly strongly that this was a bad design decision on Microsoft's part. A bad decision out of some fairly unpleasant alternatives, I grant you. I'm willing to be persuaded of its virtues, of course - and in particular I'd welcome discussion with the LINQ team around this. In particular, it's always fun to hear about the history of design decisions.
Next up, Cast and OfType.


0 comments:
Post a Comment