[swift-evolution] [Pitch] String revision proposal #1

Thu Mar 30 04:48:46 CDT 2017

> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution at swift.org> wrote:
> 
> Hi Swift Evolution,
> 
> Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.
> 
> Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md

Really great stuff, guys. Thanks for your work on this!

> In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

> Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.

If not, then given this:

> The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)

> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?

> The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.

> The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

> A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:
> 
> public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.

> /// Indicates valid input was recognized.
> ///
> /// `resumptionPoint` is the end of the parsed region
> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

> /// Indicates invalid input was recognized.
> ///
> /// `resumptionPoint` is the next position at which to continue parsing after
> /// the invalid input is repaired.
> case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.

> /// An encoding for text with UnicodeScalar as a common currency type
> public protocol UnicodeEncoding {
>   /// The maximum number of code units in an encoded unicode scalar value
>   static var maxLengthOfEncodedScalar: Int { get }
>   
>   /// A type that can represent a single UnicodeScalar as it is encoded in this
>   /// encoding.
>   associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?

>   @discardableResult
>   public static func parseForward<C: Collection>(
>     _ input: C,
>     repairingIllFormedSequences makeRepairs: Bool = true,
>     into output: (EncodedScalar) throws->Void
>   ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>   
>   @discardableResult    
>   public static func parseReverse<C: BidirectionalCollection>(
>     _ input: C,
>     repairingIllFormedSequences makeRepairs: Bool = true,
>     into output: (EncodedScalar) throws->Void
>   ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>   where C.SubSequence : BidirectionalCollection,
>         C.SubSequence.SubSequence == C.SubSequence,
>         C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
> }

Are there constraints missing on `parseForward`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?

> Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

> This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!

-- 
Brent Royal-Gordon
Architechies