[swift-evolution] Strings in Swift 4

Dave Abrahams dabrahams at apple.com
Sat Jan 21 14:31:20 CST 2017



Sent from my iPad

On Jan 21, 2017, at 3:49 AM, Brent Royal-Gordon <brent at architechies.com> wrote:

>> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>> 
>> Probably best read in rendered markdown on GitHub:
>> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>> 
>> We’re eager to hear everyone’s thoughts.
> 
> There is so, so much good stuff here.

Right back atcha, Brent!  Thanks for the detailed review!

> I'm really looking forward to seeing how these ideas develop and enter the language.
> 
>> #### Future Directions
>> 
>> One of the most common internationalization errors is the unintentional
>> presentation to users of text that has not been localized, but regularizing APIs
>> and improving documentation can go only so far in preventing this error.
>> Combined with the fact that `String` operations are non-localized by default,
>> the environment for processing human-readable text may still be somewhat
>> error-prone in Swift 4.
>> 
>> For an audience of mostly non-experts, it is especially important that naïve
>> code is very likely to be correct if it compiles, and that more sophisticated
>> issues can be revealed progressively.  For this reason, we intend to
>> specifically and separately target localization and internationalization
>> problems in the Swift 5 timeframe.
> 
> I am very glad to see this statement in a Swift design document. I have a few ideas about this, but they can wait until the next version.
> 
>> At first blush this just adds work, but consider what it does
>> for equality: two strings that normalize the same, naturally, will collate the
>> same.  But also, *strings that normalize differently will always collate
>> differently*.  In other words, for equality, it is sufficient to compare the
>> strings' normalized forms and see if they are the same.  We can therefore
>> entirely skip the expensive part of collation for equality comparison.
>> 
>> Next, naturally, anything that applies to equality also applies to hashing: it
>> is sufficient to hash the string's normalized form, bypassing collation keys.
> 
> That's a great catch.
> 
>> This leaves us executing the full UCA *only* for localized sorting, and ICU's
>> implementation has apparently been very well optimized.
> 
> Sounds good to me.
> 
>> Because the current `Comparable` protocol expresses all comparisons with binary
>> operators, string comparisons—which may require
>> additional [options](#operations-with-options)—do not fit smoothly into the
>> existing syntax.  At the same time, we'd like to solve other problems with
>> comparison, as outlined
>> in
>> [this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
>> (implemented by changes at the head
>> of
>> [this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
>> We should adopt a modification of that proposal that uses a method rather than
>> an operator `<=>`:
>> 
>> ```swift
>> enum SortOrder { case before, same, after }
>> 
>> protocol Comparable : Equatable {
>> func compared(to: Self) -> SortOrder
>> ...
>> }
>> ```
>> 
>> This change will give us a syntactic platform on which to implement methods with
>> additional, defaulted arguments, thereby unifying and regularizing comparison
>> across the library.
>> 
>> ```swift
>> extension String {
>> func compared(to: Self) -> SortOrder
>> 
>> }
>> ```
> 
> While it's great that `compared(to:case:etc.)` is parallel to `compared(to:)`, you don't actually want to *use* anything like `compared(to:)` if you can help it. Think about the clarity at the use site:
> 
>    if foo.compared(to: bar, case: .insensitive, locale: .current) == .before { … }

Right.  We intend to keep the usual comparison operators.

Poor readability of "foo <=> bar == .before" is another reason we think that giving up on "<=>" is no great loss.

> The operands and sense of the comparison are kind of lost in all this garbage. You really want to see `foo < bar` in this code somewhere, but you don't.

Yeah, we thought about trying to build a DSL for that, but failed.  I think the best possible option would be something like:

  foo.comparison(case: .insensitive, locale: .current) < bar

The biggest problem is that you can build things like

    fu = foo.comparison(case: .insensitive, locale: .current)
    br = bar.comparison(case: .sensitive)
    fu < br // what does this mean?

We could even prevent such nonsense from compiling, but the cost in library API surface area is quite large.

> I'm struggling a little with the naming and syntax, but as a general approach, I think we want people to use something more like this:
> 
>    if StringOptions(case: .insensitive, locale: .current).compare(foo < bar) { … }

Yeah, we can't do that without making 

	let a = foo < bar

ambiguous

> Which might have an implementation like:
> 
>    // This protocol might actually be part of your `Unicode` protocol; I'm just breaking it out separately here.
>    protocol StringOptionsComparable {
>        func compare(to: Self, options: StringOptions) -> SortOrder
>    }
>    extension StringOptionsComparable {
>        static func < (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
>            return (lhs, rhs, { $0 == .before })
>        }
>        static func == (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
>            return (lhs, rhs, { $0 == .same })
>        }
>        static func > (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op: (SortOrder) -> Bool) {
>            return (lhs, rhs, { $0 == .after })
>        }
>        // etc.
>    }
>    
>    struct StringOptions {
>        // Obvious properties and initializers go here
>        
>        func compare<StringType: StringOptionsComparable>(_ expression: (lhs: StringType, rhs: StringType, op: (SortOrder) -> Bool)) -> Bool {
>            return expression.op( expression.lhs.compare(to: expression.rhs, options: self) )
>        }
>    }
> 
> You could also imagine much less verbose syntaxes using custom operators. Strawman example:
> 
>    if foo < bar %% (case: .insensitive, locale: .current) { … }
> 
> I think this would make human-friendly comparisons much easier to write and understand than adding a bunch of options to a `compared(to:)` call.

That one has the same problem with ambiguity of "a < b".  There might be an answer here but it's not obvious and I feel solving it can wait a little.

>> This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
>> comport perfectly with Unicode. We think the concatenation problem is tolerable,
>> because the cases where it occurs all represent partially-formed constructs. 
>> ...
>> Admitting these cases encourages exploration of grapheme composition and is
>> consistent with what appears to be an overall Unicode philosophy that “no
>> special provisions are made to get marginally better behavior for… cases that
>> never occur in practice.”[2]
> 
> This sounds good to me.
> 
>> ### Unification of Slicing Operations
> 
> I think you know what I think about this. :^)
> 
> (By the way, I've at least partially let this proposal drop for the moment because it's so dependent on generic subscripts to really be an improvement. I do plan to pick it up when those arrive; ping me then if I don't notice.)

Okeydoke.

> A question, though. We currently have a couple of methods, mostly with `subrange` in their names, that can be thought of as slicing operations but aren't:
> 
>    collection.removeSubrange(i..<j)
>    collection[i..<j].removeAll()
>    
>    collection.replaceSubrange(i..<j, with: others)
>    collection[i..<j].replaceAll(with: others)        // hypothetically
> 
> Should these be changed, too? Can we make them efficient (in terms of e.g. copy-on-write) if we do?

We could, once the ownership model is implemented.  However, I'm not sure whether it's enough of an improvement to be worth doing.  You could go all the way to

	collection[i..<j] = EmptyCollection()
	collection[i..<j] = others

But for that we'd need to (at least) introduce write-only subscripts.

>> ### Substrings
>> 
>> When implementing substring slicing, languages are faced with three options:
>> 
>> 1. Make the substrings the same type as string, and share storage.
>> 2. Make the substrings the same type as string, and copy storage when making the substring.
>> 3. Make substrings a different type, with a storage copy on conversion to string.
>> 
>> We think number 3 is the best choice.
> 
> I agree, and I think `Substring` is the right name for it: parallel to `SubSequence`, explains where it comes from, captures the trade-offs nicely. `StringSlice` is parallel to `ArraySlice`, but it strikes me as a "foolish consistency", as the saying goes; it avoids a term of art for little reason I can see.
> 
> However, is there a reason we're talking about using a separate `Substring` type at all, instead of using `Slice<String>`?

Yes: we couldn't specialize its representation to store short substrings inline, at least not without introducing an undesirable level of complexity.
 
> Perhaps I'm missing something, but I *think* it does everything we need here. (Of course, you could say the same thing about `ArraySlice`, and yet we have that, too.)

ArraySlice is doomed :-)

https://bugs.swift.org/browse/SR-3631

>> The downside of having two types is the inconvenience of sometimes having a
>> `Substring` when you need a `String`, and vice-versa. It is likely this would
>> be a significantly bigger problem than with `Array` and `ArraySlice`, as
>> slicing of `String` is such a common operation. It is especially relevant to
>> existing code that assumes `String` is the currency type. To ease the pain of
>> type mismatches, `Substring` should be a subtype of `String` in the same way
>> that `Int` is a subtype of `Optional<Int>`.
> 
> I've seen people struggle with the `Array`/`ArraySlice` issue when writing recursive algorithms, so personally, I'd like to see a more general solution that handles all `Collection`s.

The more general solution is "extend Unicode" or "extend Collection" (and when a String parameter is needed, "make your method generic over Collection/Unicode").

> Rather than having an implicit copying conversion from `String` to `Substring` (or `Array` to `ArraySlice`, or `Collection` to `Collection.SubSequence`), I wonder if implicitly converting in the other direction might be more useful, at least in some circumstances. Converting in this direction does *not* involve an implicit copy, merely calculating a range, so you won't have the same performance surprises. On the other hand, it's also useful in fewer situations.

That's the problem, right there, combined with the fact that we don't have a terse syntax like s[] for going the other way.  I think it would be a much more elegant design, personally, but I don't see the tradeoffs working out.  If we can come up with a way to do it that works, we should.  So far, Ben and I have failed.

> (If we did go with consistently using `Slice<T>`, this might merely be a special-cased `T -> Slice<T>` conversion. One type, special-cased until we feel comfortable inventing a general mechanism.)
> 
>> A user who needs to optimize away copies altogether should use this guideline:
>> if for performance reasons you are tempted to add a `Range` argument to your
>> method as well as a `String` to avoid unnecessary copies, you should instead
>> use `Substring`.
> 
> I do like this as a guideline, though. There's definitely room in the standard library for "a string and a range of that string to operate upon".

I don't know what you mean.  It's our intention that nothing but the lowest level operations (e.g. replaceRange) would work on ranges when they could instead be working on slices.

>> ##### The “Empty Subscript”
>> 
>> To make it easy to call such an optimized API when you only have a `String` (or
>> to call any API that takes a `Collection`'s `SubSequence` when all you have is
>> the `Collection`), we propose the following “empty subscript” operation,
>> 
>> I```swift
>> extension Collection {
>> subscript() -> SubSequence { 
>>   return self[startIndex..<endIndex] 
>> }
>> }
>> ```
>> 
>> which allows the following usage:
>> 
>> ```swift
>> funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
>> ```
> 
> That's a little bit funky, but I guess it might work.
> 
>> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
>> without the `NSRange` argument.  The Objective-C importer should be changed to
>> give these APIs special treatment so that when a `Substring` is passed, instead
>> of being converted to a `String`, the full `NSString` and range are passed to
>> the Objective-C method, thereby avoiding a copy.
>> 
>> As a result, you would never need to pass an `NSRange` to these APIs, which
>> solves the impedance problem by eliminating the argument, resulting in more
>> idiomatic Swift code while retaining the performance benefit.  To help users
>> manually handle any cases that remain, Foundation should be augmented to allow
>> the following syntax for converting to and from `NSRange`:
>> 
>> ```swift
>> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
>> let iToJ = Range(nsr, in: s)    // Equivalent to i..<j
>> ```
> 
> I sort of like this, but note that if we use `String` -> `Substring` conversion instead of the other way around, there's less magic needed to get this effect: `NSString, NSRange` can be imported as `Substring`, which automatically converts from `String` in exactly the manner we want it to.

Indeed.

> 
>> Since Unicode conformance is a key feature of string processing in swift, we
>> call that protocol `Unicode`:
> 
> I'm sorry, I think the name is too clever by half. It sounds something like what `UnicodeCodec` actually is. Or maybe a type representing a version of the Unicode standard or something. I'd prefer something more prosaic like `StringProtocol`.

It's an option we considered.  So far I think Unicode is better (most especially if we end up with a "facade" design) but we should discuss it.
 
> 
>> **Note:** `Unicode` would make a fantastic namespace for much of
>> what's in this proposal if we could get the ability to nest types and
>> protocols in protocols.
> 
> I mean, sure, but then you imagine it being used generically:
> 
>    func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
>    // which concrete types can `source` be???

All "string" types, including String, Substring, UTF8String, StaticString, etc.

>> We should provide convenient APIs processing strings by character.  For example,
>> it should be easy to cleanly express, “if this string starts with `"f"`, process
>> the rest of the string as follows…”  Swift is well-suited to expressing this
>> common pattern beautifully, but we need to add the APIs.  Here are two examples
>> of the sort of code that might be possible given such APIs:
>> 
>> ```swift
>> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
>> somethingWith(input) // process the rest of input
>> }
>> 
>> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
>>  ...
>> }
>> ```
>> 
>> The specific spelling and functionality of APIs like this are TBD.  The larger
>> point is to make sure matching-and-consuming jobs are well-supported.
> 
> Yes.
> 
>> #### Unified Pattern Matcher Protocol
>> 
>> Many of the current methods that do matching are overloaded to do the same
>> logical operations in different ways, with the following axes:
>> 
>> - Logical Operation: `find`, `split`, `replace`, match at start
>> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
>> - Options, e.g. case/diacritic sensitivity, locale.  Sometimes a part of
>> the method name, and sometimes an argument
>> - Whole string or subrange.
>> 
>> We should represent these aspects as orthogonal, composable components,
>> abstracting pattern matchers into a protocol like
>> [this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
>> that can allow us to define logical operations once, without introducing
>> overloads, and massively reducing API surface area.
> 
> *Very* yes.
> 
>> For example, using the strawman prefix `%` syntax to turn string literals into
>> patterns, the following pairs would all invoke the same generic methods:
>> 
>> ```swift
>> if let found = s.firstMatch(%"searchString") { ... }
>> if let found = s.firstMatch(someRegex) { ... }
>> 
>> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
>> for m in s.allMatches(someRegex) { ... }
>> 
>> let items = s.split(separatedBy: ", ")
>> let tokens = s.split(separatedBy: CharacterSet.whitespace)
>> ```
> 
> Very, *very* yes.
> 
> If we do this, rather than your `%` operator (or whatever it becomes), I wonder if we can have these extensions:
> 
>    // Assuming a protocol like:
>    protocol Pattern {
>        associatedtype PatternElement
>        func matches<CollectionType: Collection>(…) -> … where CollectionType.Element == Element
>    }
>    extension Equatable: Pattern {
>        typealias PatternElement = Self
>>    }
>    extension Collection: Pattern where Element: Equatable {
>        typealias PatternElement = Element
>    }
> 
> ...although then `Collection` would conform to `Pattern` through both itself and (conditionally) `Equatable`. Hmm.
> 
> I suppose we faced this same problem elsewhere and ended up with things like:
> 
>    mutating func append(_ element: Element)
>    mutating func append<Seq: Sequence>(contentsOf seq: Seq) where Seq.Iterator.Element == Element
> 
> So we could do things like:
> 
>    str.firstMatch("x")    // single element, so this is a Character
>    str.firstMatch(contentsOf("xy"))
>    str.firstMatch(anyOf(["x", "y"] as Set))

I really, really want to explore these ideas further, and I really, really don't want to do it in this thread, if you don't mind.  There are lots of ways to slice this particular cupcake.

> 
>> #### Index Interchange Among Views
> 
> I really, really, really want this.
> 
>> We think random-access
>> *code-unit storage* is a reasonable requirement to impose on all `String`
>> instances.
> 
> Wait, you do? Doesn't that mean either using UTF-32, inventing a UTF-24 to use, or using some kind of complicated side table that adjusts for all the multi-unit characters in a UTF-16 or UTF-8 string? None of these sound ideal.

No; I'm not sure why you would think that.

>> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
>> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
>> seamless by having them share an index type (semantics of indexing a `String`
>> between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
> 
> I think it should be forgiving, and I think it should be forgiving in a very specific way: It should treat indexing in the middle of a cluster as though you indexed at the beginning.

That's my intuition as well.

> The reason is `AttributedString`. You can think of `AttributedString` as being a type which adds additional views to a `String`; these views are indexed by `String.Index`, just like `String`, `String.UnicodeScalarView`, et.al., and advancing an index with these views advances it to the beginning of the next run. But you can also just subscript these views with an arbitrary index in the middle of a run, and it'll work correctly.
> 
> I think it would be useful for this behavior to be consistent among all `String` views.
> 
>> Having a common index allows easy traversal into the interior of graphemes,
>> something that is often needed, without making it likely that someone will do it
>> by accident.
>> 
>> - `String.index(after:)` should advance to the next grapheme, even when the
>>  index points partway through a grapheme.
>> 
>> - `String.index(before:)` should move to the start of the grapheme before
>>  the current position.
> 
> Good.
> 
>> Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
>> crucial, as the specifics of encoding should not be a concern for most use
>> cases, and would impose needless costs on the indices of other views.
> 
> I don't know about this, at least for the UTF-16 view. Here's why:
> 
>> That leaves the interchange of bare indices with Cocoa APIs trafficking in
>> `Int`.  Hopefully such APIs will be rare, but when needed, the following
>> extension, which would be useful for all `Collections`, can help:
>> 
>> ```swift
>> extension Collection {
>> func index(offset: IndexDistance) -> Index {
>>   return index(startIndex, offsetBy: offset)
>> }
>> func offset(of i: Index) -> IndexDistance {
>>   return distance(from: startIndex, to: i)
>> }
>> }
>> ```
>> 
>> Then integers can easily be translated into offsets into a `String`'s `utf16`
>> view for consumption by Cocoa:
>> 
>> ```swift
>> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>> ```
> 
> I worry that this conversion will be too obscure.

I very much hope it will be rare enough that it'll be OK, but if it isn't, we can always have

	let cocoaIndex = s.utf16Offset(of: i)

and/or take other measures to simplify it.

> In Objective-C, you don't really think very much about what "character" means; it's just an index that points to a location inside the string. I don't think people will know to use the `utf16` view instead of the others—especially the plain `String` version, which would be the most obvious one to use.
> 
> I think I'd prefer to see the following:
> 
> 1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.

It will be, in the common case, but many people seem to want plain String to be able to store UTF-8, and I'm not yet prepared to rule that out.

> 2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16 offset.
> 
> 3. With just the standard library imported, `String.Index` does not have any obvious way to convert to or from an `Int` offset; you use `index(_:offsetBy:)` on one of the views. `utf16`'s implementation is just faster than the others.

This is roughly where we are today.

> 4. Foundation adds `init(_:)` methods to `String.Index` and `Int`, as well as `Range<String.Index>` and `NSRange`, which perform mutual conversions:
> 
>    XCTAssertEqual(Int(String.Index(cocoaIndex)), cocoaIndex)
>    XCTAssertEqual(NSRange(Range<String.Index>(cocoaRange)), cocoaRange)
> 
> I think this would really help to guide people to the right APIs for the task.
> 
> (Also, it would make my `AttributedString` thing work better, too.)
> 
>> ### Formatting
> 
> Briefly: I am, let's say, 95% on board with your plan to replace format strings with interpolation and format methods. The remaining 5% concern is that it we'll need an adequate replacement for the ability to load a format string dynamically and have it reorder or alter the formatting of interpolated values. Obviously dynamic format strings are dangerous and limited, but where you *can* use them, they're invaluable.\

Yes.  We have ideas, though they're far from baked.

>> #### String Interpolation
>> 
>> Swift string interpolation provides a user-friendly alternative to printf's
>> domain-specific language (just write ordinary swift code!) and its type safety
>> problems (put the data right where it belongs!) but the following issues prevent
>> it from being useful for localized formatting (among other jobs):
>> 
>> * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
>>   types used in string interpolation.
>> * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
>>   distinguish (fragments of) the base string from the string substitutions.
> 
> If I find some copious free time, I could try to develop proposals for one or both of these. Would there be interest in them at this point? (Feel free to contact me off-list about this, preferably in a new thread.)
> 
> (Okay, one random thought, because I can't resist: Perhaps the "\(…)" syntax can be translated directly into an `init(…)` on the type you're creating. That is, you can write:
> 
>    let x: MyString = "foo \(bar) baz \(quux, radix: 16)"
> 
> And that translates to:
> 
>    let x = MyString(stringInterpolationSegments:
>        MyString(stringLiteral: "foo "),
>        MyString(bar),
>        MyString(stringLiteral: " baz "),
>        MyString(quux, radix: 16)
>    )
> 
> That would require you to redeclare `String` initializers on your own string type, but you probably need some of your own logic anyway, right?)

Let's go to a separate thread for this, as you suggested.

>> In the long run, we should improve Swift string interpolation to the point where
>> it can participate in most any formatting job.  Mostly this centers around
>> fixing the interpolation protocols per the previous item, and supporting
>> localization.
> 
> For what it's worth, by using a hacky workaround for SR-1260, I've written (Swift 2.0) code that passes strings with interpolations through the Foundation localized string tables: <https://gist.github.com/brentdax/79fa038c0af0cafb52dd> Obviously that's just a start, but it is incredibly convenient.

I know; it's an inspiration :-)

>> ### C String Interop
>> 
>> Our support for interoperation with nul-terminated C strings is scattered and
>> incoherent, with 6 ways to transform a C string into a `String` and four ways to
>> do the inverse.  These APIs should be replaced with the following
> 
> These APIs are much better than the status quo, but it's a shame that we can't have them handle non-nul-terminated data, too.

We thought about unifying them with other transcoding APIs, but the pointer-to-nul-terminated-code-units case is sufficiently important that we think they deserve dedicated support.

> Actually... (Begin shaggy dog story...)
> 
> Suppose you introduce an `UnsafeNulTerminatedBufferPointer` type. Then you could write a *very* high-level API which handles pretty much every conversion under the sun:
> 
>    extension String {
>        /// Constructs a `String` from a sequence of `codeUnits` in an indicated `encoding`.
>        /// 
>        /// - Parameter codeUnits: A sequence of code units in the given `encoding`.
>        /// - Parameter encoding: The encoding the code units are in.
>        init<CodeUnits: Sequence, Encoding: UnicodeEncoding>(_ codeUnits: CodeUnits, encoding: Encoding)
>            where CodeUnits.Iterator.Element == Encoding.CodeUnit
>    }

Yes, we intend to support something like that.

> For UTF-8, at least, that would cover reading from `Array`, `UnsafeBufferPointer`, `UnsafeRawBufferPointer`, `UnsafeNulTerminatedBufferPointer`, `Data`, you name it. Maybe we could have a second one that always takes something producing bytes, no matter the encoding used:
> 
>    extension String {
>        /// Constructs a `String` from the code units contained in `bytes` in a given `encoding`.
>        /// 
>        /// - Parameter bytes: A sequence of bytes expressing code units in the given `encoding`.
>        /// - Parameter encoding: The encoding the code units are in.
>        init<Bytes: Sequence, Encoding: UnicodeEncoding>(_ codeUnits: CodeUnits, encoding: Encoding)
>            where CodeUnits.Iterator.Element == UInt8
>    }
> 
> These two initializers would replace...um, something like eight existing ones, including ones from Foundation. On the other hand, this is *very* generic. And, unless we actually changed the way `char *` imported to `UnsafeNulTerminatedBufferPointer<CChar>`, the C string call sequence would be pretty complicated:
> 
>    String(UnsafeNulTerminatedBufferPointer(start: cString), encoding: UTF8.self)
> 
> So you might end up having to wrap it in an `init(cString:)` anyway, just for convenience. Oh well, it was worth exploring.

I think you ended up where we did.

> Prototype of the above: https://gist.github.com/brentdax/8b71f46b424dc64abaa77f18556e607b
> 
> (Hmm...maybe bridge `char *` to a type like this instead?
> 
>    struct CCharPointer {
>        var baseAddress: UnsafePointer<CChar> { get }
>        var nulTerminated: UnsafeNulTerminatedBufferPointer<CChar> { get }
>        func ofLength(_ length: Int) -> UnsafeBufferPointer<CChar>
>    }
> 
> Nah, probably not gonna happen...)
> 
>> init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
> 
> By the way, I just noticed an impedance mismatch in current Swift: `CChar` is usually an `Int8`, but `UnicodeScalar` and `UTF8` currently want `UInt8`. It'd be nice to address this somehow, if only by adding some signed variants or something.

We thought about that problem and landed on the proposed interface above as all that is needed to resolve it.

> 
>> ### High-Performance String Processing
>> 
>> Many strings are short enough to store in 64 bits, many can be stored using only
>> 8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
>> us already in some other encoding, such as UTF-8, that would be costly to
>> translate.  Supporting these formats while maintaining usability for
>> general-purpose APIs demands that a single `String` type can be backed by many
>> different representations.
> 
> Just putting a pin in this, because I'll want to discuss it a little later.
> 
>> ### Parsing ASCII Structure
>> 
>> Although many machine-readable formats support the inclusion of arbitrary
>> Unicode text, it is also common that their fundamental structure lies entirely
>> within the ASCII subset (JSON, YAML, many XML formats).  These formats are often
>> processed most efficiently by recognizing ASCII structural elements as ASCII,
>> and capturing the arbitrary sections between them in more-general strings.  The
>> current String API offers no way to efficiently recognize ASCII and skip past
>> everything else without the overhead of full decoding into unicode scalars.
>> 
>> For these purposes, strings should supply an `extendedASCII` view that is a
>> collection of `UInt32`, where values less than `0x80` represent the
>> corresponding ASCII character, and other values represent data that is specific
>> to the underlying encoding of the string.
> 
> This sounds interesting, but:
> 
> 1. It doesn't sound like you anticipate there being any way to compare an element of the `extendedASCII` view to a character literal. That seems like it'd be really useful.

We don't have character literals :-)

However, I agree that there needs to be a way to do it.  The thing would be to make it easy to construct a UInt8 from a string literal.  There are a few possibilities; I'm a little nervous about making this work:

	if c == "X" { ... }

but maybe I should just get over it.  The cleanest alternative I can think of is:

	if c == ascii("X") { ... }

where "X" is required by the compiler to be a single ascii character.

I guess another possibility is to introduce an ASCII type and overload operators so it can be compared with all the Ints:

	if c == "X" as ASCII { ... }

> 2. I don't really understand how you envision using the "data specific to the underlying encoding" sections. Presumably you'll want to convert that data into a string eventually, right?

It already is in a string.  The point is that we have a way to scan the string looking for ASCII patterns without transcoding it.

> Do you have pseudocode or something lying around that might help us understand how you think this might be used?

Not exactly.  The pattern matching prototype you referred to earlier would be enhanced to use the extendedASCII view when it was available and the pattern being matched was suitably restricted.  How, exactly, that works is still a bit of a research project though.

>> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>> 
>> UnicodeEncoding has an associated type, but it may be important to be able to
>> traffic in completely dynamic encoding values, e.g. for “tell me the most
>> efficient encoding for this string.”
> 
> As long as you're here, we haven't talked about `UnicodeEncoding` much. I assume this is a slightly modified version of `UnicodeCodec`? Anything to say about it?

That's basically right.  You can see a first cut at it in the unicode-rethink branch on GitHub.

> If it *is* similar to `UnicodeCodec`, one thing I will note is that the way `UnicodeCodec` works in code units is rather annoying for I/O. It may make sense to have some sort of type-erasing wrapper around `UnicodeCodec` which always uses bytes. (You then have to worry about endianness, of course...)

Take a look at the branch and let me know how this looks like it would work for I/O.

By the way, I think I/O really needs a special kind of collection: a sort of deque built out of I/O buffer-sized chunks that are filled on demand from a Sequence.  That is part, at least, of how I justify UnicodeEncoding having a Collection-based interface where UnicodeCodec used Iterator.

> 
>> ### Should there be a string “facade?”
>>>> An interesting variation on this design is possible if defaulted generic
>> parameters are introduced to the language:
>> 
>> ```swift
>> struct String<U: Unicode = StringStorage> 
>> : BidirectionalCollection {
>> 
>> // ...APIs for high-level string processing here...
>> 
>> var unicode: U // access to lower-level unicode details
>> }
>> 
>> typealias Substring = String<StringStorage.SubSequence>
>> ```
> 
> I think this is a very, very interesting idea. A few notes:
> 
> * Earlier, I said I didn't like `Unicode` as a protocol name. If we go this route, I think `StringStorage` is a good name for that protocol. The default storage might be something like `UTF16StringStorage`, or just, you know, `DefaultStringStorage`.
> 
> * Earlier, you mentioned the tension between using multiple representations for flexibility and pinning down one representation for speed. One way to handle this might be to have `String`'s default `StringStorage` be a superclass or type-erased wrapper or something.

Yes, that's the idea.

> That way, if you just write `String`, you get something flexible; if you write `String<NFCNormalizedUTF16StringStorage>`, you get something fast.

This only works in the "facade" variant where you have a defaulted generic parameter feature, but yes, that's the idea of that variant.

> * Could `NSString` be a `StringStorage`, or support a trivial wrapper that converts it into a `StringStorage`? Would that be helpful at all?

Yes, that's part of the idea.

> * If we do this, does `String.Index` become a type-specific thing? That is, might `String<UTF8Storage>.Index` be different from `String<UTF16Storage>.Index`?

Yes.

> What does that mean for `String.Index` unification?

Not much.  We never intended for indices to be interchangeable among different specific string types (other than a string and its SubSequence).

>> ### `description` and `debugDescription`
>> 
>> * Should these be creating localized or non-localized representations?
> 
> `debugDescription`, I think, is non-localized; it's something helpful for the programmer, and the programmer's language is not the user's. It's also usually something you don't want to put *too* much effort into, other than to dump a lot of data about the instance.
> 
> `description` would have to change to be localizable. (Specifically, it would have to take a locale.) This is doable, of course, but it hasn't been done yet.

Well, it could use the current locale.  These things are supposed to remain lightweight.

>> * Is returning a `String` efficient enough?
> 
> I'm not sure how important efficiency is for `description`, honestly.

It depends how intimately this is tied into interpolation and formatting, I think.

>> * Is `debugDescription` pulling the weight of the API surface area it adds?
> 
> Maybe? Or maybe it's better off as part of the `Mirror` instead of a property on the instance itself.

That's a very interesting thought!

>> ### `StaticString`
>> 
>> `StaticString` was added as a byproduct of standard library developed and kept
>> around because it seemed useful, but it was never truly *designed* for client
>> programmers.  We need to decide what happens with it.  Presumably *something*
>> should fill its role, and that should conform to `Unicode`.
> 
> Maybe. One complication there is that `Unicode` presumably supports mutation, which `StaticString` doesn't.

No, Unicode doesn't support mutation.  A mutable Unicode will usually conform to Unicode and RangeReplaceableCollection (but not MutableCollection, because replacing a grapheme is not an O(1) operation).

> Another possibility I've discussed in the past is renaming `StaticString` to `StringLiteral` and using it largely as a way to initialize `String`. (I mentioned that in a thread about the need for public integer and floating-point literal types that are more expressive now that we're supporting larger integer/float types.)

Yes, a broad redesign of all literals is crucial.  However, there are other sources of static string data than literals and those need to be accommodated.

> It could have just enough API surface to access it as a buffer of UTF-8 bytes and thereby build a `String` or `Data` from it.
> 
> Well, that's it for this massive email. You guys are doing a hell of a job on this.

Thanks for all the feedback, and the encouragement!

> Hope this helps,
> -- 
> Brent Royal-Gordon
> Architechies
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170121/943d0179/attachment.html>


More information about the swift-evolution mailing list