[swift-evolution] Pitch: String Index Overhaul
Dave Abrahams
dabrahams at apple.com
Tue May 30 18:13:04 CDT 2017
on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com> wrote:
>> On May 30, 2017, at 14:53, Dave Abrahams <dabrahams at apple.com> wrote:
>>
>>
>> on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com> wrote:
>>
>>> My knee-jerk reaction is to say it's too late in Swift 4 for this kind
>>> of change, but with that out of the way, I'm most concerned about what
>>> it means to have, say, a UTF-8 index that's not on a UTF-16 boundary.
>>>
>>> let str = "言"
>>> let oneUnitIn = str.utf8.index(after: str.utf8.startIndex)
>>> let trailingBytes = str.utf8[oneUnitIn...]
>>
>> This is not new; it exists today.
>
> Yes, I think that’s valuable. What’s different is that it’s not a String.Index.
>
>>
>>> What can I do with 'oneUnitIn'?
>>
>> All the usual stuff; we're not proposing to change what you can do with
>> it.
>
> By changing the type, you have increased the scope of where an index
> can be used. What happens when I use it in one of the other views and
> it’s not on a boundary?
>
> (I suspect the answer is “it traps” but the proposal should spell that
> out explicitly.)
Sorry, I mistakenly limited the “rounding down” behavior to slicing and
range replacement. The index would be rounded down to the previous
boundary, and then used as ever.
>
>>
>>> How do I test to see if it's on a Character boundary or a
>>> UnicodeScalar boundary?
>>
>> as noted,
>>
>> Replacing the failable APIs listed [above](#motivation) that detect
>> whether an index represents a valid position in a given view, and
>> enhancement that explicitly round index positions to nearby boundaries
>> in a given view, are left to a later proposal. For now, we do not
>> propose to remove the existing index conversion APIs.
>>
>> That means you can use oneUnitIn.samePosition(in: str) or
>> oneUnitIn.samePosition(in: str.unicodeScalars) to find out if it's on ta
>> character or unicode scalar boundary.
>
> I’m sorry, I completely missed that. This part of the question is withdrawn.
>
> I’m also concerned about putting “UTF-16” in the documentation for
> encodedOffset. Either it’s a ‘utf16Offset’ or it isn’t
It is today; hopefully it won't be someday
> ; if it’s an opaque value then it should be treated as such.
Today a String has underlying UTF-16-compatible storage and that's
documented as such, but we intend to lift that restriction and don't
want the names to lock us into semantics.
> (It’s also a little disturbing that round-tripping through
> encodedOffset isn’t guaranteed to give you the same index back.)
Define “same.”
The encodedOffset is not the full value of an *arbitrary* index, and
doesn't claim to be. The indices that can be serialized and
reconstructed exactly using encodedOffset are those that fall on code
unit boundaries. Today, that means everything but UTF-8 indices. We
could consider exposing the transcodedOffset (offset within the UTF8
encoding of the scalar) as well, but I want to be conservative.
--
-Dave
More information about the swift-evolution
mailing list