[swift-evolution] [RFC] New collections model: collections advance indices

Tue Mar 8 09:09:35 CST 2016

On 2016-03-04 00:48:39 +0000, Dmitri Gribenko via swift-evolution said:

> Hi Károly,
> 
> Sorry for a delayed reply!

Ditto on my part! :-)

> On Tue, Mar 1, 2016 at 10:39 PM, Károly Lőrentey <karoly at lorentey.hu> wrote:
>> - My B-trees are persistent data structures, thus my nodes cannot have
>> parent or sibling links. Index lookup and navigation is still O(1)
>> though, as my indices contain pointers to every node on the path to
>> the current element.
> 
> Are these pointers unsafe pointers or ARC references?  If they are ARC
> references, then we get extra retain/releases.  If they are unsafe
> pointers, we need a safety check -- and I am interested to know what
> was your approach to this.

They are strong ARC references in iterators (and cursors), and weak
references in indices.

It seems to me I could replace all references on the path with
unowned(unsafe), except for a single strong (or weak) reference to the
root node. Keeping a reference to the root node will keep the rest
of the nodes in its subtree alive anyway.

The path rarely changes during iteration (there can be hundreds of
elements under each node), so I expect eliminating retain/release
would have a minimal effect there; but path creation and some
random-access navigation patterns could benefit from it.

>> - I found that the most straightforward place to implement tree
>> navigation methods like `next(:)` and `advance(:by:)` is on the path
>> struct that contains the actual node references. There is no reason
>> I couldn't have the new collection methods simply call through to
>> these path methods, though -- I am currently doing the same thing in
>> the BTreeIndex type anyway.
>> 
>> - I'm using weak references inside the index, with a (seriously
>> underdeveloped) index invalidation method that happens to be closer
>> to #2b than #2a. I'm not happy about using weak references, but this
>> seemed the most sensible thing to do. I'd love to replace them with
>> `unowned(unsafe)`, and the mutation counter seems like a great idea.
>> The ARC issue mentioned at the end of the proposal is rather scary,
>> though -- I don't know how I would protect against that.
> 
> Does your tree have in-place mutation?  (E.g., if a tree is uniquely
> referenced, and there's only one live index, does tree[i] = 42
> reallocate any nodes?)

Yes, BTree does in-place mutation for uniquely referenced nodes.
The half-baked index invalidation algorithm I currently have in place
does not always detect that this happened; I'd rather prefer to
invalidate all indices on most mutations. A per-node mutation counter
seems like a good way to ensure this.

> I think the reasoning about unowned(safe) applies to weak references.
> (The ARC issue.)
> 
> Fundamentally, the problem is that if you want an entity to be an
> independent value, it has to hold strong references to all its
> critical parts.  For example, an index can hold a weak reference to
> something, but only if it does not care and continues to be fully
> functional if that reference suddenly becomes nil.  But if a valid
> index can't operate without a reference to collection's storage, it
> has to be a strong reference.

Yes, but #2a-style strict invalidation means that an index whose node(s)
have been deallocated is guaranteed to be invalid, either because of a
tree mutation, or because the tree that produced the index is not around
anymore (which is a kind of tree mutation).

> You can replace a strong reference with an unsafe(unowned) reference,
> but as an optimization, when it is possible to guarantee memory safety
> by other means that are cheaper than ARC.  You should see no
> functional difference between using a strong reference and
> unsafe(unowned) + necessary safety checks.
> 
>> Generators/iterators look to be safe from this issue,
>> so I’ll probably optimize their path representation first.
> 
> Generators/iterators are separate values.  So they should be
> traversing a snapshot of the collection value.  Iterators shouldn't be
> auto-updating with concurrent collection mutations.  For example, for
> any collection implementation, this code terminates and appends
> collection's elements to itself:
> 
> var c = getCollection()
> for item in c {
>   c.append(item)
> }
> 
> Thus, iterators should be holding a strong reference to the
> collection's storage, and make it non-uniquely referenced.

Understood.

>> - The `Indices` associated type is intriguing. I assume it is brand new?
>> It seems strange that it is allowed to hold a strong reference, but
>> I’ll have to look through the prototype code to grok it.
> 
> No, it is not new.  It is a special type for `var indices`, which is
> the collection of indices.  This variable used to be typed as
> `Range<Index>` -- basically wrapping a pair of indices.
> `Range<Index>` won't be able to conform to Collection in this
> proposal, since you can't increment all indices in isolation.
> Therefore, in the general case, `var indices` has to capture
> startIndex, endIndex and the collection itself.  In some cases (for
> example, Array) it can be Range<Index>.  To allow using the cheapest
> possible implementation, we are making the type of `var indices` an
> associated type.
> 
> The reason why it needs to hold a strong reference is the same as why
> generators need to hold a strong reference.  This index collection is
> a separate value.  The value should be usable regardless of what
> happens to other values.

Ah, interesting.

> Consider this example:
> 
> var c = getCollection()
> for i in c.indices {
>   print(i)
>   c.append(c[i])
> }
> 
> Imagine that `c` is uniquely referenced, and `c.indices` holds an
> unowned reference to `c` that does not increase the reference count.
> Then, when a new element is appended, the collection might need to
> reallocate the storage.  The old storage will be deallocated.  When
> `c.indices` needs to increment the index, it will try to access an
> unowned reference that was deallocated, and the program will trap.

My problem is that this example assumes that mutating `c` at a certain
index won't invalidate earlier indices. And even if that is true for
`c`, whenever `c.indices` holds a strong reference, an idiom like

  for i in c.indices {
    c.mutate(i)
  }

will certainly lead to COW copying that isn't happening today with

  var i = c.startIndex
  while i != c.endIndex {
    c.mutate(i)
    i = i.successor()
  }

How would I emulate the exact semantics of today's code with `c.indices`?

-- 
Károly