[swift-evolution] [Review] SE-0180: String Index Overhaul

Mon Jun 12 13:11:11 CDT 2017

> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution <swift-evolution at swift.org> wrote:
> on Fri Jun 09 2017, Kevin Ballard <swift-evolution at swift.org <mailto:swift-evolution at swift.org>> wrote:
>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
<snip>
>> 
>> Ah, right. So a String.Index is actually something similar to
>> 
>> public struct Index {
>>    public var encodedOffset: Int
>>    private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>> }
> 
> Similar.  I'd write it this way:
> 
> public struct Index {
>   public var encodedOffset: Int
> 
>   // Offset into a UnicodeScalar represented in an encoding other
>   // than the String's underlying encoding
>   private var transcodedOffset: Int 
> }

I *think* the following is what the proposal is saying, but let me walk through it:

My understanding would be:
- An index manipulated at the string level points to the start a grapheme cluster which is also a particular code point and to a code unit of the underlying string backing data
- The unicodeScalar view can be intra-grapheme cluster, pointing at a code point
- The utf-16 index can be intra-codepoint, since some code points are represented by two code units
- The uff-8 index can be intra-codepoint as well,  since code points are represented by up to four code units

So is the idea of the Index struct is that the encodedOffset is an offset in the native representation of the string (byte offset, word offset, etc) to the start of a grapheme, and transcodedOffset is data for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset within a grapheme to a code point or code unit?

My feeling is that ‘encoded’ is not enough to distinguish whether encodedOffset is meant to indicate an offset in graphemes, code points, or code units, or to specify that an index to the same character in two normalized strings may be different if one is backed by UTF-8 and the other UTF-16. “encodedCharacterOffset” may be better.

This index struct does limit some sorts of imagined string implementations, such as a string maintained piecewise across multiple allocation units or strings using a stateful character encoding like ISO/IEC 2022.

-DW

P.S. I’m also curious why the methods are optional failing vs retaining the current API and having them fatal error.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170612/5d96d919/attachment.html>