[swift-evolution] [Review] SE-0180: String Index Overhaul

Tue Jun 13 19:02:42 CDT 2017

on Tue Jun 06 2017, Dave Abrahams <swift-evolution at swift.org> wrote:

>> Overall it looks pretty good. But unfortunately the answer to "Will
>> applications still compile but produce different behavior than they
>> used to?" is actually "Yes", when using APIs provided by
>> Foundation. This is because Foundation is currently able to return
>> String.Index values that don't point to Character boundaries.
>>
>> Specifically, in Swift 3, the following code:
>>
>> import Foundation
>>
>> let str = "e\u{301}galite\u{301}"
>> let r = str.rangeOfCharacter(from: ["\u{301}"])!
>> print(str[r] == "\u{301}")
>>
>> will print “true”, because the returned range identifies the combining
>> acute accent only. But with the proposed String.Index revisions, the
>> `str[r]` subscript will return the whole "e\u{301}” combined
>> character.
>
> Hmm, true.
>
> This doesn't totally invalidate the concern, but...
>
> The existing behavior is a bug in the way Foundation interfaces with the
> 3.0 standard library.  str.rangeOfCharacter (which should be
> str.rangeOfUnicodeScalar) should be returning
> Range<String.UnicodeScalarView.Index> but is returning a misaligned
> Range<String.Index>.  Everything in the 3.0 standard library design is
> engineered to ensure that misaligned String indices don't happen at all
> (although they still can—just use an index from string1 in string2),
> thus the rigorous failable index conversion APIs.
>
> It's easy to produce results with this API that don't make sense in
> Swift 3:
>
>   let str = "e\u{301}\u{302}galite\u{301}"
>   str.rangeOfCharacter(from: ["\u{301}"])!
>   print(str[r.lowerBound] == "\u{301}") // false
>
>> This is, of course, an edge case, but we need to consider the
>> implications of this and determine if it actually affects anything
>> that’s likely to be a problem in practice.
>
> I agree.  It would also be reasonable to pick a different behavior for
> misaligned indices, for example:
>
>   Indices *that don't fall on a code unit boundary* are “rounded down”
>   before use.
>
> The existing behaviors for these cases are a cluster of coincidences,
> and were never designed.  I doubt that preserving them in their current
> form makes sense and will lead to a usable string semantics for the long
> term, but if they do in fact happen to make sense, we'd still need to
> codify the rules so we can keep future behaviors consistent.

Having considered this further, I'd like to propose these revised semantics for
misaligned indices, to preserve the behavior of rangeOfCharacter and its
ilk:

* Definition: an index i is aligned with respect to a string view v iff 

     v.indices.contains(i) || v.endIndex == i

  If i is not aligned with respect to v it is *misaligned* with respect
  to v.

* When i is misaligned with respect to a String/Substring view s.xxx
  (imagining s itself could also be spelled as s.xxx), combining s.xxx
  and i is done in terms of underlying code units and i.encodedOffset.

  It's very hard to write these semantics down precisely in terms of
  existing constructs, but this should give you a sense of what I have
  in mind:

  1. the suffix beginning at i is formed by slicing the underlying
    codeUnits at i.encodedOffset, forming a new Substring around that
    slice, and getting its corresponding xxx view

     s.xxx[i...] 

  is roughly equivalent to:

    Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx

  (given that we currently have UTF-16 code units)

  2. similarly

     s.xxx[..<i] 

  is equivalent to something like:

    Substring(s.utf16[..<String.Index(encodedOffset: i.encodedOffset)]).xxx

  3. s.xxx[i] is equivalent to s.xxx[i...].first!

  4. s.xxx.index(after: i) is equivalent to s.xxx[i...].indices.dropFirst().first!

  5. s.xxx.index(before: i) is equivalent to s.xxx[..<i].indices.last!

I'm concerned that we have no precise way to specify the semantics of #1
and #2, to the point where it might be better to implement them that way
but leave the semantics unspecified.  Another alternative would be to
add the APIs needed to make it possible to express a precise equivalence
instead of a rough equivalence.  If anyone has better ideas, I'm all ears.

-- 
-Dave