[swift-evolution] [RFC] New collections model: collections advance indices

Thu Mar 3 10:25:45 CST 2016

on Tue Mar 01 2016, Károly Lőrentey <swift-evolution at swift.org> wrote:

> This looks interesting! As the author of a number of custom collection
> implementations, including a rather elaborate B-tree package
> (https://github.com/lorentey/BTree), 

FWIW, I am impressed by (even jealous of) this work, so your feedback in
this area is much appreciated.

> it always felt strange to me that indices are expected to be able to
> move around the collection on their own, while element access has to
> go through by the collection. It is a great idea to fix this
> asymmetry.
>
> I’ll have to carefully read through it a couple more times and look at
> the prototype branch to form a real opinion, but at first glance I
> like the proposal. Here are a couple of quick thoughts, with more to
> come once I had time to think about the implications in detail:
>
> - I’m not at a great fan of the `*IndexType` protocols in Swift 2. I
>   do not believe they work hard enough compared to how hard it is to
>   implement them, and I welcome any change that makes them even a 
>   little bit simpler for the collection writer.
>
> - Having to call a collection method to increment an index looks
>   unnatural at first glance, but I can see myself getting used to it
>   in a few hours.

One thing that hasn't been mentioned so far is that when the algorithms
using indices are extensions on the collection protocol (or members of a
specific collection) these APIs can be used without qualification, which
makes them read like free functions, which ends up looking quite natural
to my eye.

> - I know that it isn't a new requirement, but I do dislike that
>   `Indexable` specifies the complexity of index operations; this puts
>   a hard constraint on custom collection design. I do understand the
>   desire for concrete complexity promises on operations using
>   indexes, but can't we express these instead e.g. in terms of number
>   of index accesses?

The problem is that eventually it becomes really difficult to simply
describe the efficiency of any given algorithm.  We don't want people to
have to do complex algebra to understand how algorithms compose with
data structures.

Yes, it's a trade-off, and loosening the upper bound on the cost of
indexing was one of the things we considered.  We tried to carefully
think through different possible collection designs (trees in
particular!) to understand what the implications of keeping the O(1)
upper bound were.  We'd be happy to discuss specific tradeoffs with you.

> - I love that there is a section with detailed guidance on designing 
>   tree-based collections. It’s interesting and informative.
>
> - My B-trees are persistent data structures, thus my nodes cannot have
>   parent or sibling links. Index lookup and navigation is still O(1)
>   though, as my indices contain pointers to every node on the path to
>   the current element. Since I have to keep looking up these nodes
>   anyway to retrieve elements and to navigate around in the tree, I
>   simply decided to keep them directly in the index. B-trees are super
>   shallow, so there are only a handful of nodes on any path.

Yeah, one other way to get O(1) here is to consider that log N is
bounded by a small constant, for practical purposes.

> - I found that the most straightforward place to implement tree
>   navigation methods like `next(:)` and `advance(:by:)` is on the path
>   struct that contains the actual node references. There is no reason
>   I couldn't have the new collection methods simply call through to
>   these path methods, though -- I am currently doing the same thing in
>   the BTreeIndex type anyway.
>
> - I'm using weak references inside the index, with a (seriously
>   underdeveloped) index invalidation method that happens to be closer
>   to #2b than #2a. I'm not happy about using weak references, but this
>   seemed the most sensible thing to do. I'd love to replace them with
>   `unowned(unsafe)`, and the mutation counter seems like a great idea.
>   The ARC issue mentioned at the end of the proposal is rather scary,
>   though -- I don't know how I would protect against that.

Hmm, Dmitri, I thought we decided that using unowned(unsafe) to manage
the "indices" collection was simply untenable.  Remember the
thread-safety issue, iterating indices on one thread while mutating the
collection on another?

>   Generators/iterators look to be safe from this issue,
>   so I’ll probably optimize their path representation first. 
>
> - For mutation, I think custom APIs often make much more sense
>   than general-purpose solutions. I try to discourage use of the normal
>   B-tree index for doing complex tree mutations, and I instead provide 
>   a cursor construct that was designed especially for performing 
>   a batch of mutations in a batch:
>
>   https://github.com/lorentey/BTree/blob/master/Sources/BTreeCursor.swift#L295-L707 
>
>   The cursor is like an index on steroids. It has an identity with
>   mutable state on its own, and it takes unique ownership of the tree
>   while it is active. This frees the cursor to disable some costly
>   invariants (such as maintaining up-to-date descendant counts in each
>   node). This in turn allows for convenient batch editing of elements
>   in the tree, with amortized O(1) insertion and removal operations.
>
>   The cursor's approach goes the exact opposite way of this proposal:
>   not only is the collection not necessary to use the cursor, but the
>   collection's value isn't even available while there is an active
>   cursor on it. (This is like how
>   `Array.withUnsafeMutableBufferPointer()` works.)
>
> - I'm almost positive this has been discussed before, but what is the
>   rationale behind allowing non-Int `IndexDistance`s? 

One could imagine indexing a file-backed thing on a 32-bit platform,
which could require 64-bit distances.

>   The distance is getting cast to Int in a lot of places anyway (IIRC,
>   even the stdlib uses numericCasts to cut a way through it.)

We've tried to be careful enough in balancing ease-of-use (Int
almost everywhere) with flexibility, but we might have made mistakes,
for sure.

>     associatedtype IndexDistance : SignedIntegerType = Int

> - The `Indices` associated type is intriguing. I assume it is brand new?
>   It seems strange that it is allowed to hold a strong reference, but
>   I’ll have to look through the prototype code to grok it.
>
>   Superficial comment: I’m not too happy with the name. The irregular
>   plural is hard on non-native English speakers, plus it seems weird
>   to have both an `Index` and an `Indices` type. The `indices` property
>   calls it `IndexRange` (I assume by accident); I think I like that 
>   name better.
>
> - In this declaration:
>
>     subscript(position: Index) -> Generator.Element { get }
>
>   I find the argument name rather unfortunate, because I've been using
>   the term "position" to consistently refer to the (numerical)
>   position of an element in an ordered collection, 

“Offset” would be a better name for that, IMO.

>   which is typically not the same as the element's index. Could we
>   just quietly rename this to `index` or `i`? :-)
>
>
>> On 2016-03-02, at 03:04, Dmitri Gribenko <gribozavr at gmail.com> wrote:
>> 
>> Hi,
>> 
>> We would like to propose a major change to how collection indices
>> work.  The standard library team has discussed this idea internally
>> and we wrote a prototype.  Now we think it is a viable direction to
>> consider, and we are bringing it for wider public discussion.
>> 
>> I'm pasting the first section of the proposal below to give you a
>> general idea about this change, but please read the proposal to
>> understand the full details.
>> 
>> You can find the most up to date version of the proposal at
>> https://github.com/gribozavr/swift-evolution/blob/new-collections/proposals/NNNN-collections-move-indices.md
>> 
>> Permalink:
>> https://github.com/gribozavr/swift-evolution/blob/87df19a9a9d73e64a2a966b807440216a608b8ad/proposals/NNNN-collections-move-indices.md
>> 
>> Dmitri
>> 
>> ## Introduction
>> 
>> We are proposing a new model for collections, where indices can only be
>> advanced forward or backward by the corresponding collection instance.
>> Indices become opaque tokens representing collection positions, that can
>> be produced and consumed by collection APIs.  This allows us to reduce
>> the amount of data stored in indices to the bare minimum.
>> 
>> Compared to the current state, the new scheme simplifies implementation
>> of non-trivial indices, and fixes concurrency issues in `Set` and
>> `Dictionary` indices.  It also allows us to eliminate reference-counted
>> stored properties from most indices, including non-trivial ones, like
>> `Set.Index` and `Dictionary.Index`, creating more optimizable code.
>> 
>> Out of scope for this proposal:
>> 
>> * Expanding the set of concrete collections provided by the standard
>>  library.
>> 
>> * Expanding the set of collection protocols to provide functionality
>>  beyond what is already provided (for example, protocols for sorted
>>  collections, queues etc.)  Discussing how other concrete collections
>>  fit into the current protocol hierarchy is in scope, though.

-- 
-Dave