[swift-evolution] [RFC] New collections model: collections advance indices

Thu Mar 3 18:48:39 CST 2016

Hi Károly,

Sorry for a delayed reply!

On Tue, Mar 1, 2016 at 10:39 PM, Károly Lőrentey <karoly at lorentey.hu> wrote:
> This looks interesting! As the author of a number of custom collection
> implementations, including a rather elaborate B-tree package
> (https://github.com/lorentey/BTree), it always felt strange to me that
> indices are expected to be able to move around the collection on their
> own, while element access has to go through by the collection. It is a
> great idea to fix this asymmetry.
>
> I’ll have to carefully read through it a couple more times and look at
> the prototype branch to form a real opinion, but at first glance I
> like the proposal. Here are a couple of quick thoughts, with more to
> come once I had time to think about the implications in detail:
>
> - I’m not at a great fan of the `*IndexType` protocols in Swift 2. I
>   do not believe they work hard enough compared to how hard it is to
>   implement them, and I welcome any change that makes them even a
>   little bit simpler for the collection writer.

Thank you!  I added your point to the proposal.

> - Having to call a collection method to increment an index looks
>   unnatural at first glance, but I can see myself getting used to it
>   in a few hours.
>
> - I know that it isn't a new requirement, but I do dislike that
>   `Indexable` specifies the complexity of index operations; this puts
>   a hard constraint on custom collection design. I do understand the
>   desire for concrete complexity promises on operations using
>   indexes, but can't we express these instead e.g. in terms of number
>   of index accesses?

I think Dave answered this one.

> - I love that there is a section with detailed guidance on designing
>   tree-based collections. It’s interesting and informative.
>
> - My B-trees are persistent data structures, thus my nodes cannot have
>   parent or sibling links. Index lookup and navigation is still O(1)
>   though, as my indices contain pointers to every node on the path to
>   the current element.

Are these pointers unsafe pointers or ARC references?  If they are ARC
references, then we get extra retain/releases.  If they are unsafe
pointers, we need a safety check -- and I am interested to know what
was your approach to this.

> - I found that the most straightforward place to implement tree
>   navigation methods like `next(:)` and `advance(:by:)` is on the path
>   struct that contains the actual node references. There is no reason
>   I couldn't have the new collection methods simply call through to
>   these path methods, though -- I am currently doing the same thing in
>   the BTreeIndex type anyway.
>
> - I'm using weak references inside the index, with a (seriously
>   underdeveloped) index invalidation method that happens to be closer
>   to #2b than #2a. I'm not happy about using weak references, but this
>   seemed the most sensible thing to do. I'd love to replace them with
>   `unowned(unsafe)`, and the mutation counter seems like a great idea.
>   The ARC issue mentioned at the end of the proposal is rather scary,
>   though -- I don't know how I would protect against that.

Does your tree have in-place mutation?  (E.g., if a tree is uniquely
referenced, and there's only one live index, does tree[i] = 42
reallocate any nodes?)

I think the reasoning about unowned(safe) applies to weak references.
(The ARC issue.)

Fundamentally, the problem is that if you want an entity to be an
independent value, it has to hold strong references to all its
critical parts.  For example, an index can hold a weak reference to
something, but only if it does not care and continues to be fully
functional if that reference suddenly becomes nil.  But if a valid
index can't operate without a reference to collection's storage, it
has to be a strong reference.

You can replace a strong reference with an unsafe(unowned) reference,
but as an optimization, when it is possible to guarantee memory safety
by other means that are cheaper than ARC.  You should see no
functional difference between using a strong reference and
unsafe(unowned) + necessary safety checks.

>   Generators/iterators look to be safe from this issue,
>   so I’ll probably optimize their path representation first.

Generators/iterators are separate values.  So they should be
traversing a snapshot of the collection value.  Iterators shouldn't be
auto-updating with concurrent collection mutations.  For example, for
any collection implementation, this code terminates and appends
collection's elements to itself:

var c = getCollection()
for item in c {
  c.append(item)
}

Thus, iterators should be holding a strong reference to the
collection's storage, and make it non-uniquely referenced.

> - For mutation, I think custom APIs often make much more sense
>   than general-purpose solutions. I try to discourage use of the normal
>   B-tree index for doing complex tree mutations, and I instead provide
>   a cursor construct that was designed especially for performing
>   a batch of mutations in a batch:
>
>   https://github.com/lorentey/BTree/blob/master/Sources/BTreeCursor.swift#L295-L707
>
>   The cursor is like an index on steroids. It has an identity with
>   mutable state on its own, and it takes unique ownership of the tree
>   while it is active. This frees the cursor to disable some costly
>   invariants (such as maintaining up-to-date descendant counts in each
>   node). This in turn allows for convenient batch editing of elements
>   in the tree, with amortized O(1) insertion and removal operations.
>
>   The cursor's approach goes the exact opposite way of this proposal:
>   not only is the collection not necessary to use the cursor, but the
>   collection's value isn't even available while there is an active
>   cursor on it. (This is like how
>   `Array.withUnsafeMutableBufferPointer()` works.)

That's a very powerful concept!

> - I'm almost positive this has been discussed before, but what is the
>   rationale behind allowing non-Int `IndexDistance`s? The distance is
>   getting cast to Int in a lot of places anyway (IIRC, even the stdlib
>   uses numericCasts to cut a way through it.)
>
>     associatedtype IndexDistance : SignedIntegerType = Int
>
> - The `Indices` associated type is intriguing. I assume it is brand new?
>   It seems strange that it is allowed to hold a strong reference, but
>   I’ll have to look through the prototype code to grok it.

No, it is not new.  It is a special type for `var indices`, which is
the collection of indices.  This variable used to be typed as
`Range<Index>` -- basically wrapping a pair of indices.
`Range<Index>` won't be able to conform to Collection in this
proposal, since you can't increment all indices in isolation.
Therefore, in the general case, `var indices` has to capture
startIndex, endIndex and the collection itself.  In some cases (for
example, Array) it can be Range<Index>.  To allow using the cheapest
possible implementation, we are making the type of `var indices` an
associated type.

The reason why it needs to hold a strong reference is the same as why
generators need to hold a strong reference.  This index collection is
a separate value.  The value should be usable regardless of what
happens to other values.

Consider this example:

var c = getCollection()
for i in c.indices {
  print(i)
  c.append(c[i])
}

Imagine that `c` is uniquely referenced, and `c.indices` holds an
unowned reference to `c` that does not increase the reference count.
Then, when a new element is appended, the collection might need to
reallocate the storage.  The old storage will be deallocated.  When
`c.indices` needs to increment the index, it will try to access an
unowned reference that was deallocated, and the program will trap.

>   Superficial comment: I’m not too happy with the name. The irregular
>   plural is hard on non-native English speakers, plus it seems weird
>   to have both an `Index` and an `Indices` type. The `indices` property
>   calls it `IndexRange` (I assume by accident); I think I like that
>   name better.

I'll let Dave answer this one :)

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/