[swift-evolution] Pitch: String Index Overhaul
Dave Abrahams
dabrahams at apple.com
Wed May 31 15:37:41 CDT 2017
on Tue May 30 2017, Jordan Rose <swift-evolution at swift.org> wrote:
>> On May 30, 2017, at 16:13, Dave Abrahams <dabrahams at apple.com>
> wrote:
>>
>>
>> on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com <http://at-apple.com/>> wrote:
>>
>
>>>> On May 30, 2017, at 14:53, Dave Abrahams <dabrahams at apple.com> wrote:
>>>>
>>>>
>>>> on Tue May 30 2017, Jordan Rose <jordan_rose-AT-apple.com> wrote:
>>>>
>>>>> My knee-jerk reaction is to say it's too late in Swift 4 for this kind
>>>>> of change, but with that out of the way, I'm most concerned about what
>>>>> it means to have, say, a UTF-8 index that's not on a UTF-16 boundary.
>>>>>
>>>>> let str = "言"
>>>>> let oneUnitIn = str.utf8.index(after: str.utf8.startIndex)
>>>>> let trailingBytes = str.utf8[oneUnitIn...]
>>>>
>>>> This is not new; it exists today.
>>>
>>> Yes, I think that’s valuable. What’s different is that it’s not a String.Index.
>>>
>>>>
>>>>> What can I do with 'oneUnitIn'?
>>>>
>>>> All the usual stuff; we're not proposing to change what you can do with
>>>> it.
>>>
>>> By changing the type, you have increased the scope of where an index
>>> can be used. What happens when I use it in one of the other views and
>>> it’s not on a boundary?
>>>
>>> (I suspect the answer is “it traps” but the proposal should spell that
>>> out explicitly.)
>>
>> Sorry, I mistakenly limited the “rounding down” behavior to slicing and
>> range replacement. The index would be rounded down to the previous
>> boundary, and then used as ever.
>
> Makes sense!
>
>>
>>>
>>>>
>>>>> How do I test to see if it's on a Character boundary or a
>>>>> UnicodeScalar boundary?
>>>>
>>>> as noted,
>>>>
>>>> Replacing the failable APIs listed [above](#motivation) that detect
>>>> whether an index represents a valid position in a given view, and
>>>> enhancement that explicitly round index positions to nearby boundaries
>>>> in a given view, are left to a later proposal. For now, we do not
>>>> propose to remove the existing index conversion APIs.
>>>>
>>>> That means you can use oneUnitIn.samePosition(in: str) or
>>>> oneUnitIn.samePosition(in: str.unicodeScalars) to find out if it's on ta
>>>> character or unicode scalar boundary.
>>>
>>> I’m sorry, I completely missed that. This part of the question is withdrawn.
>>>
>>> I’m also concerned about putting “UTF-16” in the documentation for
>>> encodedOffset. Either it’s a ‘utf16Offset’ or it isn’t
>>
>> It is today; hopefully it won't be someday
>>
>>> ; if it’s an opaque value then it should be treated as such.
>>
>> Today a String has underlying UTF-16-compatible storage and that's
>> documented as such, but we intend to lift that restriction and don't
>> want the names to lock us into semantics.
>
> I don’t think you should promise that about new APIs, then, or someone
> will start relying on it.
Okay, we could leave it out of this doc comment. But as long as
something documents that Strings are stored as UTF-16 (e.g. we say you
get random-access performance for the utf16 view when Foundation is
loaded), the implication is there.
>>> (It’s also a little disturbing that round-tripping through
>>> encodedOffset isn’t guaranteed to give you the same index back.)
>>
>> Define “same.”
>>
>> The encodedOffset is not the full value of an *arbitrary* index, and
>> doesn't claim to be. The indices that can be serialized and
>> reconstructed exactly using encodedOffset are those that fall on code
>> unit boundaries. Today, that means everything but UTF-8 indices. We
>> could consider exposing the transcodedOffset (offset within the UTF8
>> encoding of the scalar) as well, but I want to be conservative.
>
> I’m not sure it’s clear from the name “encodedOffset” that this is a
> lossy conversion.
It's not a conversion :-)
> I’d say it should be an optional property, but that’s probably too
> annoying in the invalid case. Maybe it should trap.
I really don't think so; IMO that would be inconsistent with the
“rounding down” behavior proposed. I think either all misaligned
accesses should trap or they should do something lenient. I proposed
lenience, but trapping is still an option.
--
-Dave
More information about the swift-evolution
mailing list