[swift-evolution] [Pitch] String revision proposal #1

Zach Waldowski zach at waldowski.me
Thu Mar 30 10:45:52 CDT 2017

Loving it so far.

`encode` and `parseScalar[Forward|Backward]` feel asymmetric. What's
wrong with `decode[Forward|Backward]`?

`UnicodeParseResult<T, Index>` really feels like it could/should be
defined as `UnicodeEncoding.ParseResult<Index>` (or `DecodeResult`,
given the above). I can't remember if that generics limitation was
being lifted?


  Zachary Waldowski

  zach at waldowski.me

On Wed, Mar 29, 2017, at 08:32 PM, Ben Cohen via swift-evolution wrote:
> Hi Swift Evolution,


> Below is a pitch for the first part of the String revision. This
> covers a number of changes that would allow the basic internals to be
> overhauled.

> Online version here:
> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md


> String Revision: Collection Conformance, C Interop, Transcoding

>  * Proposal: SE-0161
>  * Authors: Ben Cohen[1], Dave Abrahams[2]
>  * Review Manager: TBD
>  * Status: *Awaiting review*
> Introduction

> This proposal is to implement a subset of the changes from the Swift 4
> String Manifesto[3].
> Specifically:

>  * Make String conform to BidirectionalCollection
>  * Make String conform to RangeReplaceableCollection
>  * Create a Substring type for String.SubSequence
>  * Create a Unicode protocol to allow for generic operations over both
>    types.
>  * Consolidate on a concise set of C interop methods.
>  * __Revise the transcoding infrastructure.
> Other existing aspects of String remain unchanged for the purposes of
> this proposal.
> Motivation

> This proposal follows up on a number of recommendations found in the
> manifesto:
> Collection conformance was dropped from String in Swift 2. After
> reevaluation, the feeling is that the minor semantic discrepancies
> (mainly with RangeReplaceableCollection) are outweighed by the
> significant benefits of restoring these conformances. For more detail
> on the reasoning, see here[4]
> While it is not a collection, the Swift 3 string does have slicing
> operations. String is currently serving as its own subsequence,
> allowing substrings to share storage with their “owner”. This can lead
> to memory leaks when small substrings of larger strings are stored long-
> term (see here[5] for more detail on this problem). Introducing a
> separate type of Substring to serve as String.Subsequence is
> recommended to resolve this issue, in a similar fashion to ArraySlice.
> As noted in the manifesto, support for interoperation with nul-
> terminated C strings in Swift 3 is scattered and incoherent, with 6
> ways to transform a C string into a String and four ways to do the
> inverse. These APIs should be replaced with a simpler set of methods
> on String.
> Proposed solution

> A new type, Substring, will be introduced. Similar to ArraySlice it
> will be documented as only for short- to medium-term storage:
>> *Important*

>> Long-term storage of Substring instances is discouraged. A substring
>> holds a reference to the entire storage of a larger string, not just
>> to the portion it presents, even after the original string’s lifetime
>> ends. Long-term storage of a substring may therefore prolong the
>> lifetime of elements that are no longer otherwise accessible, which
>> can appear to be memory leakage.
> Aside from minor differences, such as having a SubSequence of Self and
> a larger size to describe the range of the subsequence, Substring will
> be near-identical from a user perspective.
> In order to be able to write extensions accross both String and
> Substring, a new Unicode protocol to which the two types will conform
> will be introduced. For the purposes of this proposal, Unicode will be
> defined as a protocol to be used whenver you would previously extend
> String. It should be possible to substitute extension Unicode { ... }
> in Swift 4 wherever extension String { ... } was written in Swift 3,
> with one exception: any passing of self into an API that takes a
> concrete String will need to be rewritten as String(self). If Self is
> a String then this should effectively optimize to a no-op, whereas if
> Self is a Substring then this will force a copy, helping to avoid the
> “memory leak” problems described above.
> The exact nature of the protocol – such as which methods should be
> protocol requirements vs which can be implemented as protocol
> extensions, are considered implementation details and so not covered
> in this proposal.
> Unicode will conform to BidirectionalCollection.
> RangeReplaceableCollection conformance will be added directly onto the
> String and Substring types, as it is possible future Unicode-
> conforming types might not be range-replaceable (e.g. an immutable
> type that wraps a const char *).
> The C string interop methods will be updated to those described
> here[6]: a single withCString operation and two init(cString:)
> constructors, one for UTF8 and one for arbitrary encodings. The
> primary change is to remove “non-repairing” variants of construction
> from nul-terminated C strings. In both of the construction APIs, any
> invalid encoding sequence detected will have its longest valid prefix
> replaced by U+FFFD, the Unicode replacement character, per the Unicode
> specification. This covers the common case. The replacement is done
> physically in the underlying storage and the validity of the result is
> recorded in the String’s encoding such that future accesses need not
> be slowed down by possible error repair separately. Construction that
> is aborted when encoding errors are detected can be accomplished using
> APIs on the encoding.
> The current transcoding support will be updated to improve usability
> and performance. The primary changes will be:

>  * to allow transcoding directly from one encoding to another without
>    having to triangulate through an intermediate scalar value
>  * to add the ability to transcode an input collection in reverse,
>    allowing the different views on String to be made bi-directional
>  * to have decoding take a collection rather than an iterator, and
>    return an index of its progress into the source, allowing that
>    method to be static
> The standard library currently lacks a Latin1 codec, so a enum Latin1:
> UnicodeEncoding type will be added.
> Detailed design

> The following additions will be made to the standard library:

> *protocol* *Unicode*: *BidirectionalCollection* {  *// Implementation
> detail as described above* }

> *extension* *String*: *Unicode*, *RangeReplaceableCollection* {
> *typealias* *SubSequence* = *Substring* }

> *struct* *Substring*: *Unicode*, *RangeReplaceableCollection* {
> *typealias* *SubSequence* = *Substring*  *// near-identical API
> surface area to String* }
> The subscript operations on String will be amended to return
> Substring:

> *struct* *String* {  *subscript*(bounds: *Range*<*String*.*Index*>) ->
> *Substring* { *get* }  *subscript*(bounds:
> *ClosedRange*<*String*.*Index*>) -> *Substring* { *get* } }
> Note that properties or methods that due to their nature create new
> String storage (such as lowercased()) will *not* change.
> C string interop will be consolidated on the following methods:

> *extension* *String* {  */// Constructs a `String` having the same
> contents as `nulTerminatedUTF8`.*  *///*  */// - Parameter
> nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded *  *///
> bytes ending just before the first zero byte (NUL character).*
> *init*(cString nulTerminatedUTF8: *UnsafePointer*<*CChar*>)   *///
> Constructs a `String` having the same contents as
> `nulTerminatedCodeUnits`.*  *///*  */// - Parameter
> nulTerminatedCodeUnits: a sequence of contiguous code units in*  *///
> the given `encoding`, ending just before the first zero code unit.*
> */// - Parameter encoding: describes the encoding in which the code
> units*  *///   should be interpreted.*  *init*<*Encoding*:
> *UnicodeEncoding*>( cString nulTerminatedCodeUnits:
> *UnsafePointer*<*Encoding*.*CodeUnit*>, encoding: *Encoding*)   *///
> Invokes the given closure on the contents of the string, represented
> as a*  */// pointer to a null-terminated sequence of UTF-8 code
> units.*  *func* *withCString*<Result>(  _ body: (UnsafePointer<CChar>)
> *throws* -> *Result*) *rethrows* -> *Result* }
> Additionally, the current ability to pass a Swift String into C
> methods that take a C string will remain as-is.
> A new protocol, UnicodeEncoding, will be added to replace the current
> UnicodeCodec protocol:

> *public* *enum* *UnicodeParseResult*<*T*, *Index*> { */// Indicates
> valid input was recognized.* *///* */// `resumptionPoint` is the end
> of the parsed region* *case* valid(*T*, resumptionPoint: *Index*)  *//
> FIXME: should these be reordered?* */// Indicates invalid input was
> recognized.* *///* */// `resumptionPoint` is the next position at
> which to continue parsing after* */// the invalid input is repaired.*
> *case* error(resumptionPoint: *Index*)

> */// Indicates that there was no more input to consume.* *case*
> emptyInput   */// If any input was consumed, the point from which to
> continue parsing.*  *var* resumptionPoint: *Index*? {  *switch* *self*
> {  *case* .valid(_,*let* r): *return* r  *case* .error(*let* r):
> *return* r  *case* .emptyInput: *return* nil } } }

> */// An encoding for text with UnicodeScalar as a common currency
> type* *public* *protocol* *UnicodeEncoding* {  */// The maximum number
> of code units in an encoded unicode scalar value*  *static* *var*
> maxLengthOfEncodedScalar: *Int* { *get* }   */// A type that can
> represent a single UnicodeScalar as it is encoded in this*  *///
> encoding.* associatedtype *EncodedScalar* : *EncodedScalarProtocol*
> */// Produces a scalar of this encoding if possible; returns `nil`
> otherwise.*  *static* *func* *encode*<Scalar: EncodedScalarProtocol>(
> _:Scalar) -> *Self*.*EncodedScalar*?   */// Parse a single unicode
> scalar forward from `input`.*  *///*  */// - Parameter knownCount: a
> number of code units known to exist in `input`.*  *///   **Note:**
> passing a known compile-time constant is strongly advised,*  *///
> even if it's zero.*  *static* *func* *parseScalarForward*<C:
> Collection>(  _ input: C, knownCount: Int */* = 0, via extension */* )
> -> *ParseResult*<*EncodedScalar*, *C*.*Index*>  *where*
> *C*.*Iterator*.*Element* == *EncodedScalar*.*Iterator*.*Element*
> */// Parse a single unicode scalar in reverse from `input`.*  *///*
> */// - Parameter knownCount: a number of code units known to exist in
> `input`.*  *///   **Note:** passing a known compile-time constant is
> strongly advised,*  *///   even if it's zero.*  *static* *func*
> *parseScalarReverse*<C: BidirectionalCollection>(  _ input: C,
> knownCount: Int */* = 0 , via extension */* ) ->
> *ParseResult*<*EncodedScalar*, *C*.*Index*>  *where*
> *C*.*Iterator*.*Element* == *EncodedScalar*.*Iterator*.*Element* }

> */// Parsing multiple unicode scalar values* *extension*
> *UnicodeEncoding* { @discardableResult  *public* *static* *func*
> *parseForward*<C: Collection>(  _ input: C,
> repairingIllFormedSequences makeRepairs: Bool = true, into output:
> (EncodedScalar) *throws*->*Void* ) *rethrows* -> (remainder:
> *C*.*SubSequence*, errorCount: *Int*)  @discardableResult  *public*
> *static* *func* *parseReverse*<C: BidirectionalCollection>(  _ input:
> C, repairingIllFormedSequences makeRepairs: Bool = true, into output:
> (EncodedScalar) *throws*->*Void* ) *rethrows* -> (remainder:
> *C*.*SubSequence*, errorCount: *Int*)  *where* *C*.*SubSequence* :
> *BidirectionalCollection*,  *C*.*SubSequence*.*SubSequence* ==
> *C*.*SubSequence*,  *C*.*SubSequence*.*Iterator*.*Element* ==
> *EncodedScalar*.*Iterator*.*Element* }
> UnicodeCodec will be updated to refine UnicodeEncoding, and all
> existing codecs will conform to it.
> Note, depending on whether this change lands before or after some of
> the generics features, generic where clauses may need to be added
> temporarily.
> Source compatibility

> Adding collection conformance to String should not materially impact
> source stability as it is purely additive: Swift 3’s String interface
> currently fulfills all of the requirements for a bidirectional range
> replaceable collection.
> Altering String’s slicing operations to return a different type is
> source breaking. The following mitigating steps are proposed:

>  * Add a deprecated subscript operator that will run in Swift 3
>    compatibility mode and which will return a String not a Substring.

>  * Add deprecated versions of all current slicing methods to similarly
>    return a String.
> i.e.:

> *extension* *String* {  *@available*(swift, obsoleted: 4)
> *subscript*(bounds: *Range*<*Index*>) -> *String* {  *return*
> *String*(characters[bounds]) }   *@available*(swift, obsoleted: 4)
> *subscript*(bounds: *ClosedRange*<*Index*>) -> *String* {  *return*
> *String*(characters[bounds]) } }
> In a review of 77 popular Swift projects found on GitHub, these
> changes resolved any build issues in the 12 projects that assumed an
> explicit String type returned from slicing operations.
> Due to the change in internal implementation, this means that these
> operations will be *O(n)* rather than *O(1)*. This is not expected to
> be a major concern, based on experiences from a similar change made to
> Java, but projects will be able to work around performance issues
> without upgrading to Swift 4 by explicitly typing slices as Substring,
> which will call the Swift 4 variant, and which will be available but
> not invoked by default in Swift 3 mode.
> The C string interoperability methods outside the ones described in
> the detailed design will remain in Swift 3 mode, be deprecated in
> Swift 4 mode, and be removed in a subsequent release. UnicodeCodec
> will be similarly deprecated.
> Effect on ABI stability

> As a fundamental currency type for Swift, it is essential that the
> String type (and its associated subsequence) is in a good long-
> term state before being locked down when Swift declares ABI
> stability. Shrinking the size of String to be 64 bits is an
> important part of this.
> Effect on API resilience

> Decisions about the API resilience of the String type are still to be
> determined, but are not adversely affected by this proposal.
> Alternatives considered

> For a more in-depth discussion of some of the trade-offs in string
> design, see the manifesto and associated evolution thread[7].
> This proposal does not yet introduce an implicit conversion from
> Substring to String. The decision on whether to add this will be
> deferred pending feedback on the initial implementation. The intention
> is to make a preview toolchain available for feedback, including on
> whether this implicit conversion is necessary, prior to the release of
> Swift 4.
> Several of the types related to String, such as the encodings, would
> ideally reside inside a namespace rather than live at the top level of
> the standard library. The best namespace for this is probably Unicode,
> but this is also the name of the protocol. At some point if we gain
> the ability to nest enums and types inside protocols, they should be
> moved there. Putting them inside String or some other enum namespace
> is probably not worthwhile in the mean-time.
> _________________________________________________

> swift-evolution mailing list

> swift-evolution at swift.org

> https://lists.swift.org/mailman/listinfo/swift-evolution


  1. https://github.com/airspeedswift
  2. http://github.com/dabrahams/
  3. https://github.com/apple/swift/blob/master/docs/StringManifesto.md
  4. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again
  5. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#substrings
  6. https://github.com/apple/swift/blob/master/docs/StringManifesto.md#c-string-interop
  7. https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170116/thread.html#30497
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170330/4c549b40/attachment.html>

More information about the swift-evolution mailing list