[swift-evolution] [Review] SE-0180: String Index Overhaul
Xiaodi Wu
xiaodi.wu at gmail.com
Tue Jun 13 20:16:27 CDT 2017
I’m coming to this conversation rather late, so forgive the naive question:
Your proposal claims that current code with failable APIs is needlessly
awkward and that most code only interchanges indices that are known to
succeed. So, why is it not simply a precondition of string slicing that the
index be correctly aligned? It seems like this would simplify the behavior
greatly.
On Tue, Jun 13, 2017 at 19:04 Dave Abrahams via swift-evolution <
swift-evolution at swift.org> wrote:
>
> on Tue Jun 06 2017, Dave Abrahams <swift-evolution at swift.org> wrote:
>
> >> Overall it looks pretty good. But unfortunately the answer to "Will
> >> applications still compile but produce different behavior than they
> >> used to?" is actually "Yes", when using APIs provided by
> >> Foundation. This is because Foundation is currently able to return
> >> String.Index values that don't point to Character boundaries.
> >>
> >> Specifically, in Swift 3, the following code:
> >>
> >> import Foundation
> >>
> >> let str = "e\u{301}galite\u{301}"
> >> let r = str.rangeOfCharacter(from: ["\u{301}"])!
> >> print(str[r] == "\u{301}")
> >>
> >> will print “true”, because the returned range identifies the combining
> >> acute accent only. But with the proposed String.Index revisions, the
> >> `str[r]` subscript will return the whole "e\u{301}” combined
> >> character.
> >
> > Hmm, true.
> >
> > This doesn't totally invalidate the concern, but...
> >
> > The existing behavior is a bug in the way Foundation interfaces with the
> > 3.0 standard library. str.rangeOfCharacter (which should be
> > str.rangeOfUnicodeScalar) should be returning
> > Range<String.UnicodeScalarView.Index> but is returning a misaligned
> > Range<String.Index>. Everything in the 3.0 standard library design is
> > engineered to ensure that misaligned String indices don't happen at all
> > (although they still can—just use an index from string1 in string2),
> > thus the rigorous failable index conversion APIs.
> >
> > It's easy to produce results with this API that don't make sense in
> > Swift 3:
> >
> > let str = "e\u{301}\u{302}galite\u{301}"
> > str.rangeOfCharacter(from: ["\u{301}"])!
> > print(str[r.lowerBound] == "\u{301}") // false
> >
> >> This is, of course, an edge case, but we need to consider the
> >> implications of this and determine if it actually affects anything
> >> that’s likely to be a problem in practice.
> >
> > I agree. It would also be reasonable to pick a different behavior for
> > misaligned indices, for example:
> >
> > Indices *that don't fall on a code unit boundary* are “rounded down”
> > before use.
> >
> > The existing behaviors for these cases are a cluster of coincidences,
> > and were never designed. I doubt that preserving them in their current
> > form makes sense and will lead to a usable string semantics for the long
> > term, but if they do in fact happen to make sense, we'd still need to
> > codify the rules so we can keep future behaviors consistent.
>
> Having considered this further, I'd like to propose these revised
> semantics for
> misaligned indices, to preserve the behavior of rangeOfCharacter and its
> ilk:
>
> * Definition: an index i is aligned with respect to a string view v iff
>
> v.indices.contains(i) || v.endIndex == i
>
> If i is not aligned with respect to v it is *misaligned* with respect
> to v.
>
> * When i is misaligned with respect to a String/Substring view s.xxx
> (imagining s itself could also be spelled as s.xxx), combining s.xxx
> and i is done in terms of underlying code units and i.encodedOffset.
>
> It's very hard to write these semantics down precisely in terms of
> existing constructs, but this should give you a sense of what I have
> in mind:
>
> 1. the suffix beginning at i is formed by slicing the underlying
> codeUnits at i.encodedOffset, forming a new Substring around that
> slice, and getting its corresponding xxx view
>
> s.xxx[i...]
>
> is roughly equivalent to:
>
> Substring(s.utf16[String.Index(encodedOffset: i.encodedOffset)...]).xxx
>
> (given that we currently have UTF-16 code units)
>
> 2. similarly
>
> s.xxx[..<i]
>
> is equivalent to something like:
>
> Substring(s.utf16[..<String.Index(encodedOffset: i.encodedOffset)]).xxx
>
> 3. s.xxx[i] is equivalent to s.xxx[i...].first!
>
> 4. s.xxx.index(after: i) is equivalent to
> s.xxx[i...].indices.dropFirst().first!
>
> 5. s.xxx.index(before: i) is equivalent to s.xxx[..<i].indices.last!
>
> I'm concerned that we have no precise way to specify the semantics of #1
> and #2, to the point where it might be better to implement them that way
> but leave the semantics unspecified. Another alternative would be to
> add the APIs needed to make it possible to express a precise equivalence
> instead of a rough equivalence. If anyone has better ideas, I'm all ears.
>
> --
> -Dave
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170614/75433483/attachment.html>
More information about the swift-evolution
mailing list