A few parts ago, I jotted down a few thoughts on optimization. Three more topics on that general theme have occurred to me, one of them prompted by the comments.
I mentioned last time that for micro-optimization purposes, we could derive a tiny benefit if there were operators which allowed us to turn off potential optimizations - effectively declare in the LINQ query that we believed the input sequence would never be an IList
However, going the other way is entirely possible. Imagine if we could say, "There are probably a lot of items in this collection, and the operations I want to perform on them are independent and thread-safe. Feel free to parallelize them."
That's exactly what Parallel LINQ gives you, of course. A simple call to AsParallel() somewhere in the query - often at the start, but it doesn't have to be - enables parallelism. You need to be careful how you use this, of course, which is why it's opt-in... and it gives you a fair amount of control in terms of degrees of potential parallelism, whether the results are required in the original order and so on.
In some ways my "TopBy" proposal is similar in a very small way, in that it gives information relatively early in the query, allowing the subsequent parts (ThenBy clauses) to take account of the extra information provided by the user. On the other hand, the effect is extremely localized - basically just for the sequence of clauses to do with ordering.
Related to the idea of parallelism is the idea of side-effects, and how they affect LINQ to Objects itself.
The optimizations in LINQ to Objects appear to make some assumptions about side-effects:
Iterating over a collection won't cause any side-effects Predicates may cause side-effectsWithout the first point, all kinds of optimizations would effectively be inappropriate. As the simplest example, Count() won't use an iterator - it will just take the count of the collection. What if this was an odd collection which mutated something during iteration, though? Or what if accessing the Count property itself had side-effects? At that point we'd be violating our principle of not changing observable behaviour by optimizing. Again, the optimizations are basically assuming "sensible" behaviour from collections.
There's a rather more subtle possible cause of side-effects which I've never seen discussed. In some situations - most obviously Skip - an operator can be implemented to move over an iterator for a time without taking each "current" value. This is due to the separation of MoveNext() from Current. What if we were dealing with an iterator which had side-effects only when Current was fetched? It would be easy to write such a sequence - but again, I suspect there's an implicit assumption that such sequences simply don't exist, or that it's reasonable for the behaviour of LINQ operators with respect to them to be left unspecified.
Predicates, on the other hand, might not be so sensible. Suppose we were computing "sequence.Last(x => 10 / x > 1)" on the sequence { 5, 0, 2 }. Iterating over the sequence forwards, we end up with a DivideByZeroException - whereas if we detected that the sequence was a list, and worked our way backwards from the end, we'd see that 10 / 2 > 1, and return that last element (2) immediately. Of course, exceptions aren't the only kind of side-effect that a predicate can have: it could mutate other state. However, it's generally easier to spot that and cry foul of it being a proper functional predicate than notice the possibility of an exception.
I believe this is the reason the predicated Last overload isn't optimized. It would be nice if these assumptions were documented, however.
There's a final set of assumptions which the common ICollection
I've had this problem before, removing items from a HashSet in Java. The problem is that there's no way of communicating this information in a standardized way. We could use attributes for everything, but it gets very complicated, and I strongly suspect it would be a complete pain to use. Basically, performance is one area where abstractions just don't hold up - or rather, the abstractions aren't designed to include performance characteristics.
Even if we knew the complexity of (say) Count that still wouldn't help us necessarily. Suppose it's an O(n) operation - that sounds bad, until you discover that for this particular horrible collection, each iteration step is also O(n) for some reason. Or maybe there's a collection with an O(1) count but a horrible constant value, whereas iterating is really quick per item... so for small values of O(n), iteration would be faster. Then you've got to bear in mind how much processor time is needed trying to work out the fastest approach... it's all bonkers.
So instead we make these assumptions, and for the most part they're correct. Just be aware of their presence.
I have reached the conclusion that I'm tired, and need sleep. I might write about Queryable, IQueryable and query expressions next time.


0 comments:
Post a Comment