[swift-evolution] Strings in Swift 4

Fri Jan 20 16:57:43 CST 2017

on Fri Jan 20 2017, Karl Wagner <swift-evolution at swift.org> wrote:

> Very nice improvements overall!
>
>> To ease the pain of type mismatches, Substring should be a subtype
>> of String in the same way that Int is a subtype of
>> Optional<Int>. This would give users an implicit conversion from
>> Substring to String, as well as the usual implicit conversions such
>> as [Substring] to [String] that other subtype relationships receive.
>
> As others have said, it would be nice for this to be more
> general. Perhaps we can have a special type or protocol, something
> like RecursiveSlice?

A general feature for subtyping is out-of-scope for the String
redesign, so I ask that you bring it up in a separate discussion.

>> A Substring passed where String is expected will be implicitly
>> copied. When compared to the “same type, copied storage” model, we
>> have effectively deferred the cost of copying from the point where a
>> substring is created until it must be converted to Stringfor use
>> with an API.
>
> Could noescape parameters/new memory model with borrowing make this
> more general? 

It makes everything more general.  That said, we don't have a design
yet, and one of the important premises is that users who don't want to
think about borrowing won't have to.  So because we have to design
Strings for the language we have today, and because they have to work
in a user-model without borrowing, we're going to focus on traditional
copyable value semantics.

> Again it seems very useful for all kinds of Collections.  

> The “Empty Subscript”
> 
> Empty subscript seems weird. IMO, it’s because of the asymmetry
> between subscripts and computed properties. I would favour a model
> which unifies computed properties and subscripts (e.g. computed
> properties could return “addressors” for in-place mutation).
> Maybe this could be an “entireCollection”/“entireSlice" computed
> property?

It could, but x.entireSlice is syntactically heavyweight compared to
x[], and x[] lives on a continuum with x[a...], x[..<b], and x[a..<b]

>> The goal is that Unicode exposes the underlying encoding and code
>> units in such a way that for types with a known representation
>> (e.g. a high-performance UTF8String) that information can be known
>> at compile-time and can be used to generate a single path, while
>> still allowing types like String that admit multiple representations
>> to use runtime queries and branches to fast path specializations.
>
> Typo: “unicodeScalars" is in the protocol twice.

Nice catch!

> If I understand it, CodeUnits is the thing which should always be
> defined by conformers to Unicode, and UnicodeScalars and ExtendedASCII
> could have default implementations (for example,
> UTF8String/UTF16String/3rd party conformers will use those), and
> String might decide to return its native buffer (e.g. if
> Encoding.CodeUnit == UnicodeScalar).

Yes.  I'm not certain that associated types are needed for anything but
the CodeUnits, FWIW.

> I’m just wondering how difficult it would be for a 3rd-party type to
> conform to Unicode. If you’re developing a text editor, for example,
> it’s possible that you may need to implement your own String-like type
> with some optimised storage model and it would be nice to be able to
> use generic algorithms with them. I’m thinking that you will have some
> kind of backing buffer, and you will want to expose regions of that to
> clients as Strings so that they can render them for UI or search
> through them, etc, without introducing a copy just for the semantic
> understanding that this data region contains some text content.
>
> I’ll need to examine the generic String idea more, but it’s certainly
> very interesting...
>
>> Indexes
>
> One thing which I think it critical is the ability to advance an index
> by a given number of codeUnits. I was writing some code which
> interfaced with the Cocoa NSTextStorage class, tagging parts of a
> string that a user was editing. If this was an Array, when the user
> inserts some elements before your stored indexes, those indexes become
> invalid but you can easily advance by the difference to efficiently
> have your indexes pointing to the same characters.

  a = Index(codeUnitOffset: a.codeUnitOffset + 5)

> Currently, that’s impossible with String. 

No, it's just super-cumbersome.  You have to go through the utf16 view.
But I agree we need to make it easier.

> If the user inserts a string at a given index, your old indexes may
> not even point to the start of a grapheme cluster any more, and
> advancing the index is needlessly costly. For example:
>
> var characters = "This is a test".characters
> assert(characters.count == 14)
>
> // Store an index to something.
> let endBeforePrepending = characters.endIndex
>
> // Insert some characters somewhere.
> let insertedCharacters = "[PREPENDED]".characters
> assert(insertedCharacters.count == 11)
> characters.replaceSubrange(characters.startIndex..<characters.startIndex, with: insertedCharacters)
>
> // This isn’t really correct.
> let endAfterPrepending = characters.index(endBeforePrepending, offsetBy: insertedCharacters.count)
> assert(endAfterPrepending == characters.endIndex) // Fails Anyway. 24 != 25
>
> The manifesto is correct to emphasise machine processing of Strings,
> but it should also ensure that machine processing of mutable Strings
> is efficient. That way we can tag backing-Strings inside
> user-interface components and maintain those indices in a unicode-safe
> way.
>
> The way to solve this would be that, when replacing or removing a
> portion of a String, you learn how many CodeUnits in the receiver’s
> encoding were inserted/removed so you can shift your indexes
> accordingly.

replaceRange should always return the new range of the replaced
elements, for all RangeReplaceableCollections.  We just need to
implement that.

-- 
-Dave