[swift-evolution] [Pitch] String revision proposal #1

Ben Cohen ben_cohen at apple.com
Thu Mar 30 16:36:35 CDT 2017


Hi Brent,

Thanks for the notes. Replies inline.

> On Mar 30, 2017, at 2:48 AM, Brent Royal-Gordon <brent at architechies.com> wrote:
> 
>> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>> Hi Swift Evolution,
>> 
>> Below is a pitch for the first part of the String revision. This covers a number of changes that would allow the basic internals to be overhauled.
>> 
>> Online version here: https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
> 
> Really great stuff, guys. Thanks for your work on this!
> 
>> In order to be able to write extensions accross both String and Substring, a new Unicode protocol to which the two types will conform will be introduced. For the purposes of this proposal, Unicode will be defined as a protocol to be used whenver you would previously extend String. It should be possible to substitute extension Unicode { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the “memory leak” problems described above.
> 
> I continue to feel that `Unicode` is the wrong name for this protocol, essentially because it sounds like a protocol for, say, a version of Unicode or some kind of encoding machinery instead of a Unicode string. I won't rehash that argument since I made it already in the manifesto thread, but I would like to make a couple new suggestions in this area.
> 
> Later on, you note that it would be nice to namespace many of these types:
> 
>> Several of the types related to String, such as the encodings, would ideally reside inside a namespace rather than live at the top level of the standard library. The best namespace for this is probably Unicode, but this is also the name of the protocol. At some point if we gain the ability to nest enums and types inside protocols, they should be moved there. Putting them inside String or some other enum namespace is probably not worthwhile in the mean-time.
> 
> Perhaps we should use an empty enum to create a `Unicode` namespace and then nest the protocol within it via typealias. If we do that, we can consider names like `Unicode.Collection` or even `Unicode.String` which would shadow existing types if they were top-level.
> 

We’re a bit on the fence about whether Unicode or StringProtocol is the better name.

The big win for Unicode is it is short. We want to encourage people to write their extensions on this protocol. We want people who previously extended String to feel very comfortable extending Unicode. It also helps emphasis how important the Unicode-ness of Swift.String is. I like the idea of Unicode.Collection, but it is a little intimidating and making it even a tiny bit intimidating is worrying to me from an adoption perspective. 


> If not, then given this:
> 
>> The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.
> 
> We may simply want to wait to choose a name. As the protocol develops, we may discover a theme in its requirements which would suggest a good name. For instance, we may realize that the core of what the protocol abstracts is grouping code units into characters, which might suggest a name like `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or what-have-you.
> 
> (By the way, I hope that the eventual protocol requirements will be put through the review process, if only as an amendment, once they're determined.)
> 

Definitely. We just want to minimize churn on the group to keep the discussion followable on the broader principles for as many as possible. Once it’s firmed up and we’ve had implementation/useability/performance feedback, we’ll be back.

>> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future Unicode-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).
> 
> I'm a little worried about this because it seems to imply that the protocol cannot include any mutation operations that aren't in `RangeReplaceableCollection`. For instance, it won't be possible to include an in-place `applyTransform` method in the protocol. Do you anticipate that being an issue? Might it be a good idea to define a parallel `Mutable` or `RangeReplaceable` protocol?
> 

You can always assign to self. Then provide more efficient implementations where RangeReplaceableCollection. We do this elsewhere in the std lib with collections e.g. https://github.com/apple/swift/blob/master/stdlib/public/core/Collection.swift#L1277.

Proliferating protocol combinations is problematic (looking at you, BidirectionalMutableRandomAccessSlice).

>> The C string interop methods will be updated to those described here: a single withCString operation and two init(cString:) constructors, one for UTF8 and one for arbitrary encodings.
> 
> Sorry if I'm repeating something that was already discussed, but is there a reason you don't include a `withCString` variant for arbitrary encodings? It seems like an odd asymmetry.
> 

Hmm. Is this a common use-case people have? Symmetry for the sake of it doesn’t seem enough. If uncommon, you can do it via an Array that you nul-terminate manually.

>> The standard library currently lacks a Latin1 codec, so a enum Latin1: UnicodeEncoding type will be added.
> 
> Nice. I wrote one of those once; I'll enjoy deleting it.
> 
>> A new protocol, UnicodeEncoding, will be added to replace the current UnicodeCodec protocol:
>> 
>> public enum UnicodeParseResult<T, Index> {
> 
> Either `T` should be given a more specific name, or the enum should be given a less specific one, becoming `ParseResult` and being oriented towards incremental parsing of anything from any kind of collection.
> 

Yeah, it’s tempting to make ParseResult general, and the only reason we held off is because we don’t want making sure it’s generally useful to be a distraction.

As a rule, T is as good as any other name when another name (say, “Value”) would that name would be vacuous or tortured. Even with it being specific to Unicode, there isn’t really a good other name for it.

(for an example elsewhere in the stdlib, we use T for min<T: Comparable>(x: T, y: T) -> Bool – trying to force in Value or MyComparable or SomeComparableThing wouldn’t be helpful).

>> /// Indicates valid input was recognized.
>> ///
>> /// `resumptionPoint` is the end of the parsed region
>> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?
> 
> No, I think this is the right order. The thing that's valid is the code point.
> 

Oops meant to delete that FIXME for the purposes of the proposal!

>> /// Indicates invalid input was recognized.
>> ///
>> /// `resumptionPoint` is the next position at which to continue parsing after
>> /// the invalid input is repaired.
>> case error(resumptionPoint: Index)
> 
> I know this is abbreviated documentation, but I hope the full version includes a good usage example demonstrating, among other things, how to detect partial characters and defer processing of them instead of rejecting them as erroneous.
> 

This documentation should definitely happen as part of the fuller implementation, yes.

>> /// An encoding for text with UnicodeScalar as a common currency type
>> public protocol UnicodeEncoding {
>>  /// The maximum number of code units in an encoded unicode scalar value
>>  static var maxLengthOfEncodedScalar: Int { get }
>> 
>>  /// A type that can represent a single UnicodeScalar as it is encoded in this
>>  /// encoding.
>>  associatedtype EncodedScalar : EncodedScalarProtocol
> 
> There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it do? What are its semantics? How does `EncodedScalar` relate to the old `CodeUnit`?
> 

Ah, yes. Here it is:

public protocol EncodedScalarProtocol : RandomAccessCollection {
  init?(_ scalarValue: UnicodeScalar)
  var utf8: UTF8.EncodedScalar { get }
  var utf16: UTF16.EncodedScalar { get }
  var utf32: UTF32.EncodedScalar { get }
}

This is only really here as a (possibly premature) optimization – a fast path to go from very common encodings of scalars to another without having to turn them into a scalar and back. It doesn’t relate to much else.

>>  @discardableResult
>>  public static func parseForward<C: Collection>(
>>    _ input: C,
>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>    into output: (EncodedScalar) throws->Void
>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>> 
>>  @discardableResult    
>>  public static func parseReverse<C: BidirectionalCollection>(
>>    _ input: C,
>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>    into output: (EncodedScalar) throws->Void
>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>  where C.SubSequence : BidirectionalCollection,
>>        C.SubSequence.SubSequence == C.SubSequence,
>>        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
>> }
> 
> Are there constraints missing on `parseForward`?
> 

Yep – see the note that appears a little later. They’re really implementation details – so not something to capture in the proposal – which may or may not be needed depending on whether this lands before or after the generics features that make them redundant.

> What do these do if `makeRepairs` is false? Would it be clearer if we made an enum that described the behaviors and changed the label to something like `ifIllFormed:`?
> 

The Unicode standard specifies values to substitute when making repairs.

>> Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.
> 
> Will there be a way to make this also work with a real Swift 3 compiler? For instance, can you define `typealias Substring = String` in such a way that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?
> 

Are you talking about this as a way for people to change their code, while still being able to compile their code with the old compiler? Yes, that might be a good strategy, will think about that.

>> This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.
> 
> This is a sensible approach.
> 
> Thank you for developing this into a full proposal. I discussed the plans for Swift 4 with a local group of programmers recently, and everyone was pleased to hear that `String` would get an overhaul, that the `characters` view would be integrated into the string, etc. We even talked a little about `Substring` and people thought it was a good idea. This proposal is shaping up to impact a lot of people, but in a good way!
> 

This is good to hear, including the last part, thanks.

> -- 
> Brent Royal-Gordon
> Architechies
> 



More information about the swift-evolution mailing list