[swift-evolution] [Review] SE-0180: String Index Overhaul

Wed Jun 14 11:56:13 CDT 2017

<snipped>

> On Jun 13, 2017, at 3:21 PM, Dave Abrahams via swift-evolution <swift-evolution at swift.org> wrote:
> 
> 
> on Mon Jun 12 2017, David Waite <swift-evolution at swift.org <mailto:swift-evolution at swift.org>> wrote:
> 
>> So is the idea of the Index struct is that the encodedOffset is an
>> offset in the native representation of the string (byte offset, word
>> offset, etc) to the start of a grapheme, and transcodedOffset is data
>> for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
>> within a grapheme to a code point or code unit?
> 
> Almost.  First, remember that transcodedOffset is currently just a
> conceptual thing and not part of the proposed API.  But if we exposed
> it, the following would be true:
> 
>  s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
>  s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil
> 
> and, because the native encoding of Strings is currently always UTF-16 compatible
> 
>  s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil
> 
> In other words, a non-zero transcodedOffset can only occur in indices
> from views that represent the string as code units in something other
> than its native encoding, and only if that view is not UTF-32.

My main misconception appears to be that the implementation would track the beginning of a grapheme as an offset of code units, with additional tracking of the offset within a grapheme to a code unit or of state during transcoding. This would allow an index to track if it is misaligned with regard to the string, to make translations of indexes safer.

Thinking about this more, it would cause creating an index from an encodedOffset or incrementing an index to be a potentially O(n) operation as it walks the string tracking grapheme clusters.

> 
>> or to specify that an index to the same character in two normalized
>> strings may be different if one is backed by UTF-8 and the other
>> UTF-16. “encodedCharacterOffset” may be better.
> 
> In what way does bringing the word “Character” into this improve things?

It doesn’t; it is based on my misconception above :-)

>> or strings using a stateful character encoding like ISO/IEC 2022.
> 
> I don't believe it prevents that either.  The index already has state to
> avoid repeating work when in a loop such as:
> 
>   var i = someView.startIndex
>   while i != someView.endIndex {
>      somethingWith(someView[i])   // 1
>      i = someView.index(after: i) // 2
>   }
> 
> where lines 1 and 2 both require determining the extent of the element
> in underlying code units.  There's no reason it couldn't acquire
> additional state.
> 
> The most efficient way to deal with a String in a particular encoding is
> to make a new instance of StringProtocol (say ISO_IEC_2022String), which
> would not have to use this index type.
> 
> It is planned that eventually String could actually use something like
> ISO_IEC_2022String as its backing store.  At that point, we'd have a
> choice:
> 
> 1. Allow String.Index to store arbitrary state, burdening it with the
>   cost of potential ARC traffic, or
> 
> 2. Create a limited “scratch space” using fundamental types (e.g., one
>   UInt) that every instance of StringProtocol would have to be able to
>   use to represent its state.

Yes, this is what I was thinking, the Index becomes more complex as the # of types the system is leveraging the Index for state grows.

-DW
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170614/5cef6d72/attachment.html>