[swift-evolution] [swift-evolution-announce] [Revised and review extended] SE-0180 - String Index Overhaul

Drew Crawford drew at sealedabstract.com
Tue Jun 27 12:58:48 CDT 2017




On June 26, 2017 at 5:43:42 PM, Karl Wagner via swift-evolution (swift-evolution at swift.org) wrote:

I would support a definition of encodedOffset that removed mention of UTF-16 and phrased things in terms of String.Encoding and code-units. For example, I would like to be able to construct new String indices from a known index plus a quantity of code-units known to represent a sequence of characters:

var stringOne = “Hello,“
let stringTwo = “ world"

var idx = stringOne.endIndex
stringOne.append(contentsOf: stringTwo)
idx = String.Index(encodedOffset: idx.encodedOffset + stringTwo.codeUnits.count)
assert(idx == stringOne.endIndex)


I second this concern.  We currently use a non-Foundation library that prefers UTF8 encoding, I think UTF8-backed strings are important.

The choice of UTF16 as string storage in Swift makes historical sense (e.g. runtime interop with ObjC-backed strings) but as Swift moves forward it makes less sense.  We need a string system that behaves more like a lightweight accessor for the underlying storage (e.g. if you like your input's encoding you can keep it) unless you do something (like peruse a view) that requires promotion to a new format.  That's a different proposal, but that's the direction I'd like to see us head.

This proposal is in many ways the opposite of that, it specifies that we standardize on UTF16, and in particular we have in view the problem of file archiving (where we would have long-term unarchival guarantees) that complicate backing this out later.  This feels like a kludge to support Foundation.  In the archive context the offset should either be "whatever the string is" (which you would have to know anyway to archive/unarchive that string) or a full-fledged offset type that specifies the encoding such as

let i = String.Index (
    encoding: .utf16
    offset: 36
)

the latter of which would be used to port an Index between string representations if that's a useful feature.

More broadly though, I disagree with the motivation of the proposal, specifically

The result is a great deal of API surface area for apparently little gain in ordinary code

In ordinary code, we work with a single string representation (e.g. in Cocoa it's UTF16), and there is a correspondence between our UTF16 offset and our UTF16 string such that index lookups will succeed.  When we collapse indexes, we lose the information to make this correspondence, which were previously encoded into the typesystem.  So the "gain in ordinary code" is that programmers do not have to sprinkle `!` in the common case of string index lookups because we can infer at compile time from the type correspondence it is unnecessary.

Under this proposal, they will have to sprinkle the `!`, which adds friction and performance impact (`!` is a runtime check, and UTF16 promotions are expensive).  I don't believe the simplicity of implementing archival (which one has to only write once) is worth the hassle of complicating all string index lookups.

Does this proposal fit well with the feel and direction of Swift?

To me, one of Swift's greatest strengths is the type system.  We can encode information into the type system and find our bugs at compile time instead of runtime.

Here, we are proposing to partially erase a type because it's annoying to write code that deals with string encodings.  But our code will deal with string encodings somehow whether `utf16` appears in our sourcecode or not.

When we erase the type of our offset, we lose a powerful tool to prove the correctness of our string encodings, that is, the compiler can check our utf16 offset is used with a utf16 string.  Without that tool, we either have to check that dynamically, or, worst case, there are bugs in our program.

Under this proposal we would encourage the use of a bare-integer offsets for string lookup.  That does not seem Swifty to me.  A Swifty solution would be to add a dynamically-checked type-erased String.Index alongside the existing statically-checked fully-typed String.UTF8/16View.Index so that the programmer can choose the abstraction with the performance/simplicity behavior appropriate for their problem.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170627/02b26839/attachment.html>


More information about the swift-evolution mailing list