[swift-evolution] [RFC] New collections model: collections advance indices

Fri Mar 4 08:50:28 CST 2016

on Thu Mar 03 2016, Károly Lőrentey <swift-evolution at swift.org> wrote:

> On 2016-03-03 16:25:45 +0000, Dave Abrahams via swift-evolution said:
>>
>> on Tue Mar 01 2016, Károly Lőrentey <swift-evolution at swift.org> wrote:
>>
>>> This looks interesting! As the author of a number of custom collection
>>> implementations, including a rather elaborate B-tree package
>>> (https://github.com/lorentey/BTree),
>>
>> FWIW, I am impressed by (even jealous of) this work, so your feedback in
>> this area is much appreciated.
>
> Why thank you. *blush*
>
>> One thing that hasn't been mentioned so far is that when the algorithms
>> using indices are extensions on the collection protocol (or members of a
>> specific collection) these APIs can be used without qualification, which
>> makes them read like free functions, which ends up looking quite natural
>> to my eye.
>
> Nice!
>
>>> - I know that it isn't a new requirement, but I do dislike that
>>> `Indexable` specifies the complexity of index operations; this puts
>>> a hard constraint on custom collection design. I do understand the
>>> desire for concrete complexity promises on operations using
>>> indexes, but can't we express these instead e.g. in terms of number
>>> of index accesses?
>>
>> The problem is that eventually it becomes really difficult to simply
>> describe the efficiency of any given algorithm.  We don't want people to
>> have to do complex algebra to understand how algorithms compose with
>> data structures.
>>
>> Yes, it's a trade-off, and loosening the upper bound on the cost of
>> indexing was one of the things we considered.  We tried to carefully
>> think through different possible collection designs (trees in
>> particular!) to understand what the implications of keeping the O(1)
>> upper bound were.  We'd be happy to discuss specific tradeoffs with you.
>
> That's fair! So far, I've always been able to implement constant access,
> and the code has only become better of it.
>
> The only place where I intentionally decided against it is the tree-backed,
> Array-like `List`. It really wants to be indexed by an Int, to emulate
> Array's loose approach to index invalidation. (Although, like you said,
> B-tree lookup is practically O(1) anyway, so maybe I'll just call it that.)
>
> This reminds me that index invalidation rules are all over the place across
> the various collection types and not very well documented. Some ideas for
> improvements:
>
> - From the sample code in its documentation, it looks like
>  `MutableCollection` requires subscript assignment to not invalidate
>  indices. Can we make this a formal requirement?

IMO yes, we should; IIRC it's intended to be one.  Part of Dmitri's
project has been to develop an understanding of what indices are
supposed to be, conceptually.  We settled on “a path to a node in the
data structure,” which allows us to say with confidence that subscript
assignment doesn't invalidate any indices.  Making this guarantee does
have some side-effects, e.g., that when assigning through a subscript of
a COW value-type data structure with a non-unique reference, you may not
be able to rebalance or compress away unused storage in the new unique
copy.

> - What about range assignment in `MutableCollection`? E.g., what
> happens to indices if it changes the element count? 

I think that's unspecified and in the general case that means indices
are invalidated.

> (Is supporting that a requirement?)

Yes, IIRC supporting that is a requirement; it's sort of implied by
being able to do in-place mutation like c[a..<b].sort().

> - Does `RangeReplacableCollection` imply Array-like index
> invalidation?  I think it should.

“Array-like” is too vague for me to answer that question.  Certainly
array will give some invalidation guarantees that are not always
available from `RangeReplacableCollection`s in general.

> By the way, does the requirement for (amortized) constant complexity also
> apply to advancing an index? That's OK for trees, but it may not come cheap
> for hashed collections. 

I think you have to handle this by ensuring some minimum load factor.
As a result, deletions must be allowed to invalidate indices of elements
that aren't being deleted, which is, after all, true of Array.

> Native dictionaries & sets currently have an index.successor() with
> O(n) complexity (if I'm not mistaken).

Yes, ensuring minimum load factor is something we haven't gotten around
to yet.

> What about `Collection.indices`? Getting the first index will definitely
> take O(log(n)) for tree collections, 

Not if you store that index in the root of the tree.

> and bridged dictionaries/sets have this at O(n).

Bridging throws all efficiency guarantees out the window anyway, since
anybody is allowed to subclass any of those Foundation classes and give
it arbitrarily bad performance.

>>> - I'm using weak references inside the index, with a (seriously
>>> underdeveloped) index invalidation method that happens to be closer
>>> to #2b than #2a. I'm not happy about using weak references, but this
>>> seemed the most sensible thing to do. I'd love to replace them with
>>> `unowned(unsafe)`, and the mutation counter seems like a great idea.
>>> The ARC issue mentioned at the end of the proposal is rather scary,
>>> though -- I don't know how I would protect against that.
>>
>> Hmm, Dmitri, I thought we decided that using unowned(unsafe) to manage
>> the "indices" collection was simply untenable.  Remember the
>> thread-safety issue, iterating indices on one thread while mutating the
>> collection on another?
>
> Hm, aren't multithreaded mutations supposed to use COW to get their own
> copy to mutate?

Yes, but if you use unowned(unsafe), you don't allow the evaluation of
`x.indices` to bump the reference count, and you thereby defeat this
mechanism.  That's my point.

> Or is this about invalidation of the return value of
> `Collection.indices`?

That's one way to look at it.

> 
> That's not supposed to survive (all) mutations, is it?

It is.

> (Ugh, how do I refer to an instance of Indices (the associated type) without
> confusing it with indices in the sense of multiple `Index`es?)

“an instance of Indices” works fine.  You can also use `x.indices`.

>>> - I'm almost positive this has been discussed before, but what is the
>>> rationale behind allowing non-Int `IndexDistance`s?
>>
>> One could imagine indexing a file-backed thing on a 32-bit platform,
>> which could require 64-bit distances.
>
> Ah, OK! But it begs the question: would a file-backed collection be able
> to satisfy the O(1) indexing requirement without making a complete
> mockery of it? :-)

Sure; it has a large constant factor, but we do expect disk access times
to be bounded by a constant.

>>> The distance is getting cast to Int in a lot of places anyway (IIRC,
>>> even the stdlib uses numericCasts to cut a way through it.)
>>
>> We've tried to be careful enough in balancing ease-of-use (Int
>> almost everywhere) with flexibility, but we might have made mistakes,
>> for sure.
>
> I think the largest speed bump there is that sequences and collections
> are often loaded into Arrays, and an oversized collection would have to
> tread very carefully to avoid all of these places. 

Sequences much more so than collections: this happens when we need to
create multi-pass-ness from thin air.

> The API is geared towards finite (and relatively small-ish)
> collections/sequences.  I suspect a Collection with a billion elements
> wouldn't work much better than an infinite sequence.

I don't know why you say so.  I could easily imagine doing a binary
search in such a thing, for example.

>>> - In this declaration:
>>>
>>> subscript(position: Index) -> Generator.Element { get }
>>>
>>> I find the argument name rather unfortunate, because I've been using
>>> the term "position" to consistently refer to the (numerical)
>>> position of an element in an ordered collection,
>>
>> “Offset” would be a better name for that, IMO.
>
> That's spot on! Although in some contexts it could be misunderstood to
> mean a delta. I'll sleep on it, but I think you're right.

I think “numerical position” wouldn't hurt your prose much, and if it
does you're probably overly-reliant on that concept where “position”
would do fine.  But I haven't seen your text, so I'm really just
speculating.

-- 
-Dave