[swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

Sun Jul 2 21:28:27 CDT 2017

Hi Karl,

It was pointed out to me that I never answered this thoughtful post of
yours...

on Mon Jun 26 2017, Karl Wagner <swift-evolution at swift.org> wrote:

>> On 23. Jun 2017, at 02:59, Kevin Ballard via swift-evolution
>> <swift-evolution at swift.org> wrote:
>> 
>> https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md
>> <https://github.com/apple/swift-evolution/blob/master/proposals/0180-string-index-overhaul.md>
>> 
>> Given the discussion in the original thread about potentially having
>> Strings backed by something other than utf16 code units, I'm
>> somewhat concerned about having this kind of vague `encodedOffset`
>> that happens to be UTF16 code units. If this is supposed to
>> represent an offset into whatever code units the String is backed
>> by, then it's going to be a problem because the user isn't supposed
>> to know or care what the underlying storage for the String is.
>
> Is that true? The String manifesto shows a design where the underlying
> Encoding and code-units are exposed.

That is the eventual goal.  Note that with this proposal we are making
progress towards that goal, but getting all the way there is out of
scope for this release.

> From the talk about String’s being backed by something that isn’t
> UTF-16, I took that to mean that String might one-day become
> generic. Defaults for generic parameters have been mentioned on the
> list before, so “String” could still refer to “String<UTF16Encoding>”
> on OSX and maybe “String<UTF8Encoding>” on Linux.

I think you may have misunderstood.  String currently supports a few
different underlying representations (ASCII, UTF-16, NSString), all of
which happen to use a UTF-16-compatible encoding.  The eventual goal is
to expand the possible underlying representations of String to
accomodate other encodings.

That said, the underlying representation of String is *not* part of
String's type, and we don't intend to change that.  When String APIs
access the underlying representation, that access is dynamically
dispatched.  If the encoding were a generic parameter, then it would be
statically dispatched (at least in part), but it would also become part
of String's type, and, for example, you would get an error when trying
to pass a String<Unicode.ASCII> where a String<Unicode.UTF16> was
expected.  It's important that code passing Strings around remain
smoothly interoperable, so we don't want to introduce this sort of type
mismatch.

Instead, the intention is that someone could make a UTF8String type that
conformed to StringProtocol, and that String itself could be constructed
from any instance of StringProtocol to be used as its underlying
representation.  That way, if you need the performance that comes with
knowing and manipulating the underlying encoding, you can use UTF8String
directly, and if you need to interoperate with code that uses the
lingua-franca String type, you can wrap String around your UTF8String
and pass that.

> I would support a definition of encodedOffset that removed mention of
> UTF-16 and phrased things in terms of String.Encoding and
> code-units. 

Well, a few points about this:

I support removing the text “(UTF-16)” from the initial documentation
comments on these APIs, which is, AFAICT, the only source of the concern
you and others have expressed. That said, Strings are in fact currently
encoded as UTF-16 and as long as Cocoa interop is important, that too is
important and useful information, so it should be documented somewhere.

I don't support describing anything in terms of String.Encoding at this
time.  That enum was added to String by the Foundation overlay, and is not
part of the plan for String except insofar as it is required for source
compatibility and Cocoa interop.  A more appropriate way to describe the
encoding in terms of the language would be as something like
Unicode.UTF16 (at compile-time) or an instance of Unicode.Encoding.Type
(at runtime).  But I see no need to describe it in language terms until
we are ready to add APIs to String that can support multiple encodings
and/or report the underlying encoding, and we are not ready to do that
yet.

> For example, I would like to be able to construct new String indices
> from a known index plus a quantity of code-units known to represent a
> sequence of characters:
>
> var stringOne = “Hello,“
> let stringTwo = “ world"
>
> var idx = stringOne.endIndex
> stringOne.append(contentsOf: stringTwo)
> idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
> assert(idx == stringOne.endIndex)

I'm not sure what you mean by “represent a sequence of characters” in
this context.  Don't a sequence of code units always represent a
sequence of characters?

The code you wrote above would (almost) work as written under this
proposal, given that Strings always have an encoding that's compatible
with some default.  In other words, making it work *depends* on the fact
that the encoding of stringTwo is compatible with (has a non-strict
sub/superset relation with) that of stringOne.  If stringOne were
encoded as today but stringTwo were encoded with some other encoding,
say, Shift-JIS, the code might not work.  So making code like this work
depends on the very information that you have expressed a concern abut
seeing in the documentation.

The reason I wrote “(almost)” above is that we are not yet proposing to
expose a “codeUnits” view on String, and shouldn't do so until we are
ready to introduce the more flexible encoding options discussed earlier.
Today, you'd use the utf16 view to get that information.  We are headed
down the road in your vision, but we can't arrive there in this release.

Hope this helps,

-- 
-Dave