[swift-evolution] [Review] SE-0180: String Index Overhaul

Tue Jun 6 12:57:37 CDT 2017

on Mon Jun 05 2017, Kevin Ballard <swift-evolution at swift.org> wrote:

> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
> <https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md>
>
> Overall it looks pretty good. But unfortunately the answer to "Will
> applications still compile but produce different behavior than they
> used to?" is actually "Yes", when using APIs provided by
> Foundation. This is because Foundation is currently able to return
> String.Index values that don't point to Character boundaries.
>
> Specifically, in Swift 3, the following code:
>
> import Foundation
>
> let str = "e\u{301}galite\u{301}"
> let r = str.rangeOfCharacter(from: ["\u{301}"])!
> print(str[r] == "\u{301}")
>
> will print “true”, because the returned range identifies the combining
> acute accent only. But with the proposed String.Index revisions, the
> `str[r]` subscript will return the whole "e\u{301}” combined
> character.

Hmm, true.

This doesn't totally invalidate the concern, but...

The existing behavior is a bug in the way Foundation interfaces with the
3.0 standard library.  str.rangeOfCharacter (which should be
str.rangeOfUnicodeScalar) should be returning
Range<String.UnicodeScalarView.Index> but is returning a misaligned
Range<String.Index>.  Everything in the 3.0 standard library design is
engineered to ensure that misaligned String indices don't happen at all
(although they still can—just use an index from string1 in string2),
thus the rigorous failable index conversion APIs.

It's easy to produce results with this API that don't make sense in
Swift 3:

  let str = "e\u{301}\u{302}galite\u{301}"
  str.rangeOfCharacter(from: ["\u{301}"])!
  print(str[r.lowerBound] == "\u{301}") // false

> This is, of course, an edge case, but we need to consider the
> implications of this and determine if it actually affects anything
> that’s likely to be a problem in practice.

I agree.  It would also be reasonable to pick a different behavior for
misaligned indices, for example:

  Indices *that don't fall on a code unit boundary* are “rounded down”
  before use.

The existing behaviors for these cases are a cluster of coincidences,
and were never designed.  I doubt that preserving them in their current
form makes sense and will lead to a usable string semantics for the long
term, but if they do in fact happen to make sense, we'd still need to
codify the rules so we can keep future behaviors consistent.

> There’s also the curious case where I can have two String.Index values
> that compare unequal but actually return the same value when used in a
> subscript. 
> For example, with the above string, if I have a
> String.Index(encodedOffset: 0) and a String.Index(encodedOffset:
> 1). This may not be a problem in practice, but it’s something to be
> aware of.

I don't think this one even rises to that level.

let s = "aaa"
var si = s.indices.makeIterator()
let i0 = si.next()!
let i1 = si.next()!
print(i0 == i1)       // false
print(s[i0] == s[i1]) // true.  Surprised?

> I’m also confused by the paragraph about index comparison. It talks
> about if two indices are valid in a single String view, comparison
> semantics are according to Collection, and otherwise indexes are
> compared using encodedOffsets, and this means indexes aren’t totally
> ordered. But I’m not sure what the first part is supposed to mean. How
> is comparing indices that are valid within a single view any different
> than comparing the encodedOffsets?

In today's String, encodedOffset is an offset in UTF-16.  Two indices
into a UTF-8 view may be unequal yet have the same encodedOffset.

Regards,

-- 
-Dave