[swift-evolution] Strings in Swift 4

Fri Jan 20 16:43:58 CST 2017

on Fri Jan 20 2017, Tony Allevato <swift-evolution at swift.org> wrote:

> I'm excited to see this taking shape. Thanks for all the hard work putting
> this together!
>
> A few random thoughts I had while reading it:
>
> * You talk about an integer `codeUnitOffset` property for indexes. Since
> the current String implementation can switch between backing storage of
> ASCII or UTF-16 depending on the content of the string and how it's
> obtained, presumably this means that integer is not necessarily the same as
> the offset into the buffer, correct? (In other words, for a UTF-16-stored
> string, you would have to multiply it by 2.)

Details of the buffer should not be exposed to users in the API.  This
is not an offset in bytes, but an offset in codeUnits.  Expressing
that was the point of the name `codeUnitOffset`.  Maybe we could have
chosen better.

> * You discuss the possibility of exposing some String methods, like
> `uppercase()`, on Character. Since Swift abstracts away the encoding, it
> seems like Characters are essentially Strings that are enforced at runtime
> (and sometimes at compile time, in the case of initialization from
> literals) to contain exactly 1 grapheme cluster. Given that, I think it
> would be worthwhile for Character to support *any* method on String that
> would be sensical to operate on a single character—case transformations
> (though perhaps not titlecase?), accessing its UTF-8 or UTF-16 views, and
> so forth. 

We thought about that; it would essentially mean conforming `Character`
to `Unicode`, which would make `Character` a `BidirectionalCollection`
of `Character` elements.  I was worried that it might be confusing for
users, so didn't want to propose it.  It's hard to say whether it would
in fact be OK.

> I would ask whether it makes sense to have a shared protocol between
> Character and String that defines those methods, but I'll defer on
> that because it feels like it would be a "bag of methods" rather than
> semantically meaningful.
>
> On that same point, if I have a lightweight (<= 63 bit) Character, many of
> those operations can only currently be performed by constructing a String
> from it, which incurs a time and heap allocation penalty. (And indeed,
> there are TODOs in the code base to avoid doing such things internally, in
> the case of Character comparisons.) Which leads me to my next thought,
> since I've been doing a lot with Swift String performance lately...
>
> * Currently, Character and String have divergent internal implementations.
> A Character can be "small" (<= 63 bits in UTF-8 packed into an integer) or
> "large" (> 63 bits with a heap-allocated buffer). 

We've been meaning to make Character's "small" representation be UTF-16,
and we intend to give String a few "small" representations.

> Strings are just backed by a heap-allocated buffer. In this write-up,
> you say "Many strings are short enough to store in 64 bits"—not just
> characters. If that's the case, can those optimizations be lowered
> into _StringCore (or its new-world counterpart), which would allow
> both Characters *and* small Strings to reap the benefits of the more
> efficient implementation? 

That's the plan.

> This would let Characters get implementations of common methods like
> `uppercase()` for free, and there would be a zero-cost conversion from
> Characters to Strings.  The only real difference between the types
> would be the APIs they vend, the semantic concept that they represent
> to users, and validation.
>
> * The talk about implicit conversions between Substring and String bums me
> out, even though I see the importance of it in this context and know that
> it outweighs the alternatives. Given that the Swift team seems to prefer
> explicit to implicit conversions in general, I would hope that if they feel
> it's important enough to make a special case for the standard library, it
> could be a language feature that you'd consider making available to
> anyone.

Not speaking for the whole team, I personally feel we should make it
generally available, but I also recognize that we'll likely have to roll
out the String reimplementation before we have time to properly design
a general “struct subtyping” feature for end-users.

> On Fri, Jan 20, 2017 at 7:35 AM Ben Cohen via swift-evolution <
> swift-evolution at swift.org> wrote:
>
>>
>> On Jan 19, 2017, at 10:42 PM, Jose Cheyo Jimenez <cheyo at masters3d.com>
>> wrote:
>>
>> I just have one concern about the slice of a string being called
>> Substring. Why not StringSlice? The word substring can mean so many things,
>> specially in cocoa.
>>
>>
>> This idea has a lot of merit, as does the option of not giving them a
>> top-level name at all e.g. they could be String.Slice or
>> String.SubSequence. It would underscore that they really aren’t meant to be
>> used except as the result of a slicing operation or to efficiently pass a
>> slice. OTOH, Substring is a term of art so can help with clarity.
>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>

-- 
-Dave