[swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

Tue Jun 27 19:58:11 CDT 2017

on Thu Jun 22 2017, Kevin Ballard <swift-evolution at swift.org> wrote:

>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
> Given the discussion in the original thread about potentially having
> Strings backed by something other than utf16 code units, I'm somewhat
> concerned about having this kind of vague `encodedOffset` that happens
> to be UTF16 code units. 

What's vague about it?

> If this is supposed to represent an offset into whatever code units
> the String is backed by, then it's going to be a problem because the
> user isn't supposed to know or care what the underlying storage for
> the String is. 

In the long run a user that cares about performance may very well care
about the underlying encoding.  However, that, like access to the
index's encodedOffset, is an expert-level concern.

> And I can imagine potential issues with archiving/unarchiving where
> the unarchived String has a different storage type than the archived
> one, and therefore `encodedOffset` would gain a new meaning that
> screws up unarchived String.Index values.  

Yes.  If you round-trip serialize the index and you don't also preserve
the encoding of the string, you will get nonsense.  As far as I know
that is a feature of any universe in which multiple arbitrary encodings
exist... like the universe we live in.

> The other problem with using this as utf16 is how am I supposed to
> archive/unarchive a String.Index that comes from String.UTF8View?

You can serialize its encodedOffset, which will work as long as the
index is on a Unicode scalar boundary.  If it is not, you can compute
the notional transcodedOffset and serialize that too, by measuring the
distance in the UTF8 view to your index from the previous unicode scalar
boundary.  Then you can reconstruct that position.  Obviously this is
inconvenient.  The proposal is not trying to solve the problem of doing
that conveniently today; it's only trying to lay the necessary
groundwork.

> AFAICT the only way to do that is to ignore encodedOffset entirely and
> instead calculate the distance between s.utf8.startIndex and my index
> (and then recreate the index later on by advancing from
> startIndex). 
>
> But this RFC explicitly says that archiving/unarchiving indices is one
> of the goals of this overhaul.  --
>
> The section on comparison still talks about how this is a weak
> ordering.  In the other thread it was explained as being done so
> because the internal transcodedOffset isn't public, but that still
> makes this explanation very odd. String.Index comparison should not be
> weak ordering, because all indices can be expressed in the utf8View if
> nothing else, and in that view they have a total order. So it should
> just be defined as a total order, based on the position in the
> utf8View that the index corresponds with.  --

An index between the halves of a UTF16 surrogate pair has no
corresponding position in the UTF8 view.  You could arbitrarily choose
one (e.g. two UTF8 code units past the start of the unicode scalar), but
I'm not sure that would produce better results.

> The detailed design of the index has encodedOffset being mutable (and
> this was confirmed in the other thread as intentional). I don't think
> this is a good idea, because it makes the following code behave oddly:
>   let x = index.encodedOffset
>   index.encodedOffset = x
>
> Specifically, this resets the private transcodedOffset, so if you do
> this with an intra-code-unit Index taken from the utf8View, the
> modified Index may point to a different byte. I'm also not sure why
> you'd ever want to do this operation anyway. If you want to change the
> encodedOffset, you can just say `index = String.Index(encodedOffset:
> x)`.

I can take or leave the mutability of encodedOffset, personally.

-- 
-Dave