[swift-evolution] [Review] SE-0180: String Index Overhaul

Tue Jun 13 16:21:47 CDT 2017

on Mon Jun 12 2017, David Waite <swift-evolution at swift.org> wrote:

>> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution
> <swift-evolution at swift.org> wrote:
>> on Fri Jun 09 2017, Kevin Ballard
>> <swift-evolution at swift.org
>> <mailto:swift-evolution at swift.org>>
>> wrote:
>>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
> <snip>
>>> 
>>> Ah, right. So a String.Index is actually something similar to
>>> 
>>> public struct Index {
>>>    public var encodedOffset: Int
>
>>>    private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>>> }
>> 
>> Similar.  I'd write it this way:
>> 
>> public struct Index {
>>   public var encodedOffset: Int
>> 
>>   // Offset into a UnicodeScalar represented in an encoding other
>>   // than the String's underlying encoding
>>   private var transcodedOffset: Int 
>> }
>
> I *think* the following is what the proposal is saying, but let me
> walk through it:

OK. I'm going to be extremely nitpicky about terminology just to ensure
complete clarity; please don't take it as criticism.

> My understanding would be:
> - An index manipulated at the string level points to the start a
> grapheme cluster which is also a particular code point 

* A grapheme cluster is not a code point

* Probably you mean that it also points to the start of a code point

* We try not to say “code point” because

  a) despite its loose and liberal use in the Unicode standard,
     according to Unicode experts that term technically means something
     having specifically to do with UTF-16 (IIRC the space of code
     points includes surrogate values), and while it was the same thing
     as a Unicode scalar value in the days of UCS-2, is mostly not a
     useful concept today.

  b) the potential for confusion between “code unit” and “code point” is
     huge; people mix them up all the time.

  c) Instead we use “Unicode scalar value” or “Unicode scalar” for
     short; my advice is to banish the term “code point” from your
     vocabulary as I have—except when picking nits ;-)

> and to a code unit of the underlying string backing data 

Yes.  If String indices were Hashable, then these would all be true:

    Set(s.indices).isSubset(of: s.unicodeScalars.indices)
    Set(s.unicodeScalars.indices).isSubset(of: s.utf16.indices)
    Set(s.unicodeScalars.indices).isSubset(of: s.utf8.indices)

(the views also all have the same endIndex)

Today, the code units are utf16.  If we lift that restriction and add a
codeUnits view, then

    Set(s.indices).isSubset(of: s.codeUnits.indices)

> - The unicodeScalar view can be intra-grapheme cluster, pointing at a
> code point 

I don't follow, sorry.  I think the unicodeScalar view doesn't point at
anything.

> - The utf-16 index can be intra-codepoint, since some code points are
> represented by two code units - The uff-8 index can be intra-codepoint
> as well, since code points are represented by up to four code units

if we s/codepoint/unicode scalar/, then yes.

> So is the idea of the Index struct is that the encodedOffset is an
> offset in the native representation of the string (byte offset, word
> offset, etc) to the start of a grapheme, and transcodedOffset is data
> for Unicode Scalar, UTF-16 and UTF-8 views to represent an offset
> within a grapheme to a code point or code unit?

Almost.  First, remember that transcodedOffset is currently just a
conceptual thing and not part of the proposed API.  But if we exposed
it, the following would be true:

  s.indices.index(where: { $0.transcodedOffset != 0 }) == nil
  s.unicodeScalars.indices.index(where: { $0.transcodedOffset != 0 }) == nil

and, because the native encoding of Strings is currently always UTF-16 compatible

  s.utf16.indices.index(where: { $0.transcodedOffset != 0 }) == nil

In other words, a non-zero transcodedOffset can only occur in indices
from views that represent the string as code units in something other
than its native encoding, and only if that view is not UTF-32.

> My feeling is that ‘encoded’ is not enough to distinguish whether
> encodedOffset is meant to indicate an offset in graphemes, code
> points, or code units, 

IMO if you know Unicode, it does, because **Unicode encoding** is
specifically about *representation* in terms of code units.  The
question, then, is whether it's confusing for people who know Unicode
less well, and whether that actually matters.  My supposition has been
that, when all the right high-level APIs are in place, most people will
never touch encodedOffset(s).  But I could be wrong.

The best alternative I can come up with is “nativeCodeUnitOffset,” which
is a mouthful.  We can't just use “codeUnitOffset” because, for example,
in the utf8 view of today's UTF-16-encoded string, this is not about
counting UTF-8 code units; it's still about UTF-16 code units.

> or to specify that an index to the same character in two normalized
> strings may be different if one is backed by UTF-8 and the other
> UTF-16. “encodedCharacterOffset” may be better.

In what way does bringing the word “Character” into this improve things?

> This index struct does limit some sorts of imagined string
> implementations, such as a string maintained piecewise across multiple
> allocation units

I'm pretty certain it does not rule out such an implementation.  It was
designed to allow that.

> or strings using a stateful character encoding like ISO/IEC 2022.

I don't believe it prevents that either.  The index already has state to
avoid repeating work when in a loop such as:

   var i = someView.startIndex
   while i != someView.endIndex {
      somethingWith(someView[i])   // 1
      i = someView.index(after: i) // 2
   }

where lines 1 and 2 both require determining the extent of the element
in underlying code units.  There's no reason it couldn't acquire
additional state.

The most efficient way to deal with a String in a particular encoding is
to make a new instance of StringProtocol (say ISO_IEC_2022String), which
would not have to use this index type.

It is planned that eventually String could actually use something like
ISO_IEC_2022String as its backing store.  At that point, we'd have a
choice:

1. Allow String.Index to store arbitrary state, burdening it with the
   cost of potential ARC traffic, or

2. Create a limited “scratch space” using fundamental types (e.g., one
   UInt) that every instance of StringProtocol would have to be able to
   use to represent its state.

> P.S. I’m also curious why the methods are optional failing vs
> retaining the current API and having them fatal error.

Swift 3 has APIs like this:

   extension String.UnicodeScalarView.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index
   }
   extension String.UTF8View.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index?
   }

when String.UnicodeScalarView.Index and String.UTF8View.Index become the
same type (also as String.Index), you're left with:

   extension String.Index {
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index
     func samePositionIn(_:String.UTF16View) -> String.UTF16View.Index?
   }

If you leave these two overloads in place, you break code because

   let x = i.samePositionIn(s.utf16)

is now ambiguous.  The only way to keep code functioning is to have
these APIs return optionals.

Hope this helps,

-- 
-Dave