[swift-evolution] Pitch: String Index Overhaul

Wed May 31 15:37:41 CDT 2017

on Tue May 30 2017, Jordan Rose <swift-evolution at swift.org> wrote:

>> On May 30, 2017, at 16:13, Dave Abrahams <dabrahams at apple.com>
> wrote:
>> 
>> 
>> on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com <http://at-apple.com/>> wrote:
>> 
>
>>>> On May 30, 2017, at 14:53, Dave Abrahams <dabrahams at apple.com> wrote:
>>>> 
>>>> 
>>>> on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com> wrote:
>>>> 
>>>>> My knee-jerk reaction is to say it's too late in Swift 4 for this kind
>>>>> of change, but with that out of the way, I'm most concerned about what
>>>>> it means to have, say, a UTF-8 index that's not on a UTF-16 boundary.
>>>>> 
>>>>> let str = "言"
>>>>> let oneUnitIn = str.utf8.index(after: str.utf8.startIndex)
>>>>> let trailingBytes = str.utf8[oneUnitIn...]
>>>> 
>>>> This is not new; it exists today.
>>> 
>>> Yes, I think that’s valuable. What’s different is that it’s not a String.Index.
>>> 
>>>> 
>>>>> What can I do with 'oneUnitIn'? 
>>>> 
>>>> All the usual stuff; we're not proposing to change what you can do with
>>>> it.
>>> 
>>> By changing the type, you have increased the scope of where an index
>>> can be used. What happens when I use it in one of the other views and
>>> it’s not on a boundary?
>>> 
>>> (I suspect the answer is “it traps” but the proposal should spell that
>>> out explicitly.)
>> 
>> Sorry, I mistakenly limited the “rounding down” behavior to slicing and
>> range replacement.  The index would be rounded down to the previous
>> boundary, and then used as ever.
>
> Makes sense!
>
>> 
>>> 
>>>> 
>>>>> How do I test to see if it's on a Character boundary or a
>>>>> UnicodeScalar boundary?
>>>> 
>>>> as noted,
>>>> 
>>>> Replacing the failable APIs listed [above](#motivation) that detect
>>>> whether an index represents a valid position in a given view, and
>>>> enhancement that explicitly round index positions to nearby boundaries
>>>> in a given view, are left to a later proposal.  For now, we do not
>>>> propose to remove the existing index conversion APIs.
>>>> 
>>>> That means you can use oneUnitIn.samePosition(in: str) or
>>>> oneUnitIn.samePosition(in: str.unicodeScalars) to find out if it's on ta
>>>> character or unicode scalar boundary.
>>> 
>>> I’m sorry, I completely missed that. This part of the question is withdrawn.
>>> 
>>> I’m also concerned about putting “UTF-16” in the documentation for
>>> encodedOffset. Either it’s a ‘utf16Offset’ or it isn’t
>> 
>> It is today; hopefully it won't be someday
>> 
>>> ; if it’s an opaque value then it should be treated as such. 
>> 
>> Today a String has underlying UTF-16-compatible storage and that's
>> documented as such, but we intend to lift that restriction and don't
>> want the names to lock us into semantics.
>
> I don’t think you should promise that about new APIs, then, or someone
> will start relying on it.

Okay, we could leave it out of this doc comment.  But as long as
something documents that Strings are stored as UTF-16 (e.g. we say you
get random-access performance for the utf16 view when Foundation is
loaded), the implication is there.

>>> (It’s also a little disturbing that round-tripping through
>>> encodedOffset isn’t guaranteed to give you the same index back.)
>> 
>> Define “same.”  
>> 
>> The encodedOffset is not the full value of an *arbitrary* index, and
>> doesn't claim to be.  The indices that can be serialized and
>> reconstructed exactly using encodedOffset are those that fall on code
>> unit boundaries.  Today, that means everything but UTF-8 indices.  We
>> could consider exposing the transcodedOffset (offset within the UTF8
>> encoding of the scalar) as well, but I want to be conservative.
>
> I’m not sure it’s clear from the name “encodedOffset” that this is a
> lossy conversion. 

It's not a conversion :-)

> I’d say it should be an optional property, but that’s probably too
> annoying in the invalid case. Maybe it should trap.

I really don't think so; IMO that would be inconsistent with the
“rounding down” behavior proposed.  I think either all misaligned
accesses should trap or they should do something lenient.  I proposed
lenience, but trapping is still an option.

-- 
-Dave