[swift-evolution] Strings in Swift 4
Xiaodi Wu
xiaodi.wu at gmail.com
Sat Jan 21 18:22:31 CST 2017
(Top-replying because Google Inbox.)
You mentioned a syntax like `let a = ascii("X")`. You also mentioned this
idea of a facade or currency type. Was any consideration given to making
String an enum?
Freehanding on a phone (not even close to being valid Swift, but hopefully
conveys the gist of what I'm saying):
```
enum String {
typealias Index = <whatever we decide on code unit indices>
case ascii(ASCIIString)
case utf8(UTF8String)
case utf16(UTF16String)
//...
case slice(Substring)
}
extension String {
subscript(_ r: Range<Index>) -> String {
return .slice(Substring(_storage: self._storage, _range: r))
}
}
extension ASCIIString : StringProtocol, Unicode { ... }
// etc.
extension String : StringProtocol, Unicode {
// forward to underlying type where appropriate
}
```
On Sat, Jan 21, 2017 at 14:38 Dave Abrahams via swift-evolution <
swift-evolution at swift.org> wrote:
>
>
> Sent from my iPad
>
> On Jan 21, 2017, at 3:49 AM, Brent Royal-Gordon <brent at architechies.com>
> wrote:
>
> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <
> swift-evolution at swift.org> wrote:
>
>
> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>
>
> Probably best read in rendered markdown on GitHub:
>
> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>
>
> We’re eager to hear everyone’s thoughts.
>
>
> There is so, so much good stuff here.
>
>
> Right back atcha, Brent! Thanks for the detailed review!
>
> I'm really looking forward to seeing how these ideas develop and enter the
> language.
>
> #### Future Directions
>
>
> One of the most common internationalization errors is the unintentional
>
> presentation to users of text that has not been localized, but
> regularizing APIs
>
> and improving documentation can go only so far in preventing this error.
>
> Combined with the fact that `String` operations are non-localized by
> default,
>
> the environment for processing human-readable text may still be somewhat
>
> error-prone in Swift 4.
>
>
> For an audience of mostly non-experts, it is especially important that
> naïve
>
> code is very likely to be correct if it compiles, and that more
> sophisticated
>
> issues can be revealed progressively. For this reason, we intend to
>
> specifically and separately target localization and internationalization
>
> problems in the Swift 5 timeframe.
>
>
> I am very glad to see this statement in a Swift design document. I have a
> few ideas about this, but they can wait until the next version.
>
> At first blush this just adds work, but consider what it does
>
> for equality: two strings that normalize the same, naturally, will collate
> the
>
> same. But also, *strings that normalize differently will always collate
>
> differently*. In other words, for equality, it is sufficient to compare
> the
>
> strings' normalized forms and see if they are the same. We can therefore
>
> entirely skip the expensive part of collation for equality comparison.
>
>
> Next, naturally, anything that applies to equality also applies to
> hashing: it
>
> is sufficient to hash the string's normalized form, bypassing collation
> keys.
>
>
> That's a great catch.
>
> This leaves us executing the full UCA *only* for localized sorting, and
> ICU's
>
> implementation has apparently been very well optimized.
>
>
> Sounds good to me.
>
> Because the current `Comparable` protocol expresses all comparisons with
> binary
>
> operators, string comparisons—which may require
>
> additional [options](#operations-with-options)—do not fit smoothly into the
>
> existing syntax. At the same time, we'd like to solve other problems with
>
> comparison, as outlined
>
> in
>
> [this proposal](
> https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
>
> (implemented by changes at the head
>
> of
>
> [this branch](
> https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
>
> We should adopt a modification of that proposal that uses a method rather
> than
>
> an operator `<=>`:
>
>
> ```swift
>
> enum SortOrder { case before, same, after }
>
>
> protocol Comparable : Equatable {
>
> func compared(to: Self) -> SortOrder
>
> ...
>
> }
>
> ```
>
>
> This change will give us a syntactic platform on which to implement
> methods with
>
> additional, defaulted arguments, thereby unifying and regularizing
> comparison
>
> across the library.
>
>
> ```swift
>
> extension String {
>
> func compared(to: Self) -> SortOrder
>
>
> }
>
> ```
>
>
> While it's great that `compared(to:case:etc.)` is parallel to
> `compared(to:)`, you don't actually want to *use* anything like
> `compared(to:)` if you can help it. Think about the clarity at the use site:
>
> if foo.compared(to: bar, case: .insensitive, locale: .current) ==
> .before { … }
>
>
> Right. We intend to keep the usual comparison operators.
>
> Poor readability of "foo <=> bar == .before" is another reason we think
> that giving up on "<=>" is no great loss.
>
> The operands and sense of the comparison are kind of lost in all this
> garbage. You really want to see `foo < bar` in this code somewhere, but you
> don't.
>
>
> Yeah, we thought about trying to build a DSL for that, but failed. I
> think the best possible option would be something like:
>
> foo.comparison(case: .insensitive, locale: .current) < bar
>
> The biggest problem is that you can build things like
>
> fu = foo.comparison(case: .insensitive, locale: .current)
> br = bar.comparison(case: .sensitive)
> fu < br // what does this mean?
>
> We could even prevent such nonsense from compiling, but the cost in
> library API surface area is quite large.
>
> I'm struggling a little with the naming and syntax, but as a general
> approach, I think we want people to use something more like this:
>
> if StringOptions(case: .insensitive, locale: .current).compare(foo <
> bar) { … }
>
>
> Yeah, we can't do that without making
>
> let a = foo < bar
>
> ambiguous
>
> Which might have an implementation like:
>
> // This protocol might actually be part of your `Unicode` protocol; I'm
> just breaking it out separately here.
> protocol StringOptionsComparable {
> func compare(to: Self, options: StringOptions) -> SortOrder
> }
> extension StringOptionsComparable {
> static func < (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op:
> (SortOrder) -> Bool) {
> return (lhs, rhs, { $0 == .before })
> }
> static func == (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op:
> (SortOrder) -> Bool) {
> return (lhs, rhs, { $0 == .same })
> }
> static func > (lhs: Self, rhs: Self) -> (lhs: Self, rhs: Self, op:
> (SortOrder) -> Bool) {
> return (lhs, rhs, { $0 == .after })
> }
> // etc.
> }
>
> struct StringOptions {
> // Obvious properties and initializers go here
>
> func compare<StringType: StringOptionsComparable>(_ expression:
> (lhs: StringType, rhs: StringType, op: (SortOrder) -> Bool)) -> Bool {
> return expression.op( expression.lhs.compare(to:
> expression.rhs, options: self) )
> }
> }
>
> You could also imagine much less verbose syntaxes using custom operators.
> Strawman example:
>
> if foo < bar %% (case: .insensitive, locale: .current) { … }
>
> I think this would make human-friendly comparisons much easier to write
> and understand than adding a bunch of options to a `compared(to:)` call.
>
>
> That one has the same problem with ambiguity of "a < b". There might be
> an answer here but it's not obvious and I feel solving it can wait a little.
>
> This quirk aside, every aspect of strings-as-collections-of-graphemes
> appears to
>
> comport perfectly with Unicode. We think the concatenation problem is
> tolerable,
>
> because the cases where it occurs all represent partially-formed
> constructs.
>
> ...
>
> Admitting these cases encourages exploration of grapheme composition and is
>
> consistent with what appears to be an overall Unicode philosophy that “no
>
> special provisions are made to get marginally better behavior for… cases
> that
>
> never occur in practice.”[2]
>
>
> This sounds good to me.
>
> ### Unification of Slicing Operations
>
>
> I think you know what I think about this. :^)
>
> (By the way, I've at least partially let this proposal drop for the moment
> because it's so dependent on generic subscripts to really be an
> improvement. I do plan to pick it up when those arrive; ping me then if I
> don't notice.)
>
>
> Okeydoke.
>
> A question, though. We currently have a couple of methods, mostly with
> `subrange` in their names, that can be thought of as slicing operations but
> aren't:
>
> collection.removeSubrange(i..<j)
> collection[i..<j].removeAll()
>
> collection.replaceSubrange(i..<j, with: others)
> collection[i..<j].replaceAll(with: others) // hypothetically
>
> Should these be changed, too? Can we make them efficient (in terms of e.g.
> copy-on-write) if we do?
>
>
> We could, once the ownership model is implemented. However, I'm not sure
> whether it's enough of an improvement to be worth doing. You could go all
> the way to
>
> collection[i..<j] = EmptyCollection()
> collection[i..<j] = others
>
> But for that we'd need to (at least) introduce write-only subscripts.
>
> ### Substrings
>
>
> When implementing substring slicing, languages are faced with three
> options:
>
>
> 1. Make the substrings the same type as string, and share storage.
>
> 2. Make the substrings the same type as string, and copy storage when
> making the substring.
>
> 3. Make substrings a different type, with a storage copy on conversion to
> string.
>
>
> We think number 3 is the best choice.
>
>
> I agree, and I think `Substring` is the right name for it: parallel to
> `SubSequence`, explains where it comes from, captures the trade-offs
> nicely. `StringSlice` is parallel to `ArraySlice`, but it strikes me as a
> "foolish consistency", as the saying goes; it avoids a term of art for
> little reason I can see.
>
> However, is there a reason we're talking about using a separate
> `Substring` type at all, instead of using `Slice<String>`?
>
>
> Yes: we couldn't specialize its representation to store short substrings
> inline, at least not without introducing an undesirable level of complexity.
>
>
> Perhaps I'm missing something, but I *think* it does everything we need
> here. (Of course, you could say the same thing about `ArraySlice`, and yet
> we have that, too.)
>
>
> ArraySlice is doomed :-)
>
> https://bugs.swift.org/browse/SR-3631
>
> The downside of having two types is the inconvenience of sometimes having a
>
> `Substring` when you need a `String`, and vice-versa. It is likely this
> would
>
> be a significantly bigger problem than with `Array` and `ArraySlice`, as
>
> slicing of `String` is such a common operation. It is especially relevant
> to
>
> existing code that assumes `String` is the currency type. To ease the pain
> of
>
> type mismatches, `Substring` should be a subtype of `String` in the same
> way
>
> that `Int` is a subtype of `Optional<Int>`.
>
>
> I've seen people struggle with the `Array`/`ArraySlice` issue when writing
> recursive algorithms, so personally, I'd like to see a more general
> solution that handles all `Collection`s.
>
>
> The more general solution is "extend Unicode" or "extend Collection" (and
> when a String *parameter* is needed, "make your method generic over
> Collection/Unicode").
>
> Rather than having an implicit copying conversion from `String` to
> `Substring` (or `Array` to `ArraySlice`, or `Collection` to
> `Collection.SubSequence`), I wonder if implicitly converting in the other
> direction might be more useful, at least in some circumstances. Converting
> in this direction does *not* involve an implicit copy, merely calculating a
> range, so you won't have the same performance surprises. On the other hand,
> it's also useful in fewer situations.
>
>
> That's the problem, right there, combined with the fact that we don't have
> a terse syntax like s[] for going the other way. I think it would be a
> much more elegant design, personally, but I don't see the tradeoffs working
> out. If we can come up with a way to do it that works, we should. So far,
> Ben and I have failed.
>
> (If we did go with consistently using `Slice<T>`, this might merely be a
> special-cased `T -> Slice<T>` conversion. One type, special-cased until we
> feel comfortable inventing a general mechanism.)
>
> A user who needs to optimize away copies altogether should use this
> guideline:
>
> if for performance reasons you are tempted to add a `Range` argument to
> your
>
> method as well as a `String` to avoid unnecessary copies, you should
> instead
>
> use `Substring`.
>
>
> I do like this as a guideline, though. There's definitely room in the
> standard library for "a string and a range of that string to operate upon".
>
>
> I don't know what you mean. It's our intention that nothing but the
> lowest level operations (e.g. replaceRange) would work on ranges when they
> could instead be working on slices.
>
> ##### The “Empty Subscript”
>
>
> To make it easy to call such an optimized API when you only have a
> `String` (or
>
> to call any API that takes a `Collection`'s `SubSequence` when all you
> have is
>
> the `Collection`), we propose the following “empty subscript” operation,
>
>
> I```swift
>
> extension Collection {
>
> subscript() -> SubSequence {
>
> return self[startIndex..<endIndex]
>
> }
>
> }
>
> ```
>
>
> which allows the following usage:
>
>
> ```swift
>
> funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
>
> ```
>
>
> That's a little bit funky, but I guess it might work.
>
> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be
> imported
>
> without the `NSRange` argument. The Objective-C importer should be
> changed to
>
> give these APIs special treatment so that when a `Substring` is passed,
> instead
>
> of being converted to a `String`, the full `NSString` and range are passed
> to
>
> the Objective-C method, thereby avoiding a copy.
>
>
> As a result, you would never need to pass an `NSRange` to these APIs, which
>
> solves the impedance problem by eliminating the argument, resulting in more
>
> idiomatic Swift code while retaining the performance benefit. To help
> users
>
> manually handle any cases that remain, Foundation should be augmented to
> allow
>
> the following syntax for converting to and from `NSRange`:
>
>
> ```swift
>
> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
>
> let iToJ = Range(nsr, in: s) // Equivalent to i..<j
>
> ```
>
>
> I sort of like this, but note that if we use `String` -> `Substring`
> conversion instead of the other way around, there's less magic needed to
> get this effect: `NSString, NSRange` can be imported as `Substring`, which
> automatically converts from `String` in exactly the manner we want it to.
>
>
> Indeed.
>
>
> Since Unicode conformance is a key feature of string processing in swift,
> we
>
> call that protocol `Unicode`:
>
>
> I'm sorry, I think the name is too clever by half. It sounds something
> like what `UnicodeCodec` actually is. Or maybe a type representing a
> version of the Unicode standard or something. I'd prefer something more
> prosaic like `StringProtocol`.
>
>
> It's an option we considered. So far I think Unicode is better (most
> especially if we end up with a "facade" design) but we should discuss it.
>
>
>
> **Note:** `Unicode` would make a fantastic namespace for much of
>
> what's in this proposal if we could get the ability to nest types and
>
> protocols in protocols.
>
>
> I mean, sure, but then you imagine it being used generically:
>
> func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
> // which concrete types can `source` be???
>
>
> All "string" types, including String, Substring, UTF8String, StaticString,
> etc.
>
> We should provide convenient APIs processing strings by character. For
> example,
>
> it should be easy to cleanly express, “if this string starts with `"f"`,
> process
>
> the rest of the string as follows…” Swift is well-suited to expressing
> this
>
> common pattern beautifully, but we need to add the APIs. Here are two
> examples
>
> of the sort of code that might be possible given such APIs:
>
>
> ```swift
>
> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
>
> somethingWith(input) // process the rest of input
>
> }
>
>
> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
>
> ...
>
> }
>
> ```
>
>
> The specific spelling and functionality of APIs like this are TBD. The
> larger
>
> point is to make sure matching-and-consuming jobs are well-supported.
>
>
> Yes.
>
> #### Unified Pattern Matcher Protocol
>
>
> Many of the current methods that do matching are overloaded to do the same
>
> logical operations in different ways, with the following axes:
>
>
> - Logical Operation: `find`, `split`, `replace`, match at start
>
> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
>
> - Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
>
> the method name, and sometimes an argument
>
> - Whole string or subrange.
>
>
> We should represent these aspects as orthogonal, composable components,
>
> abstracting pattern matchers into a protocol like
>
> [this one](
> https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33
> ),
>
> that can allow us to define logical operations once, without introducing
>
> overloads, and massively reducing API surface area.
>
>
> *Very* yes.
>
> For example, using the strawman prefix `%` syntax to turn string literals
> into
>
> patterns, the following pairs would all invoke the same generic methods:
>
>
> ```swift
>
> if let found = s.firstMatch(%"searchString") { ... }
>
> if let found = s.firstMatch(someRegex) { ... }
>
>
> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
>
> for m in s.allMatches(someRegex) { ... }
>
>
> let items = s.split(separatedBy: ", ")
>
> let tokens = s.split(separatedBy: CharacterSet.whitespace)
>
> ```
>
>
> Very, *very* yes.
>
> If we do this, rather than your `%` operator (or whatever it becomes), I
> wonder if we can have these extensions:
>
> // Assuming a protocol like:
> protocol Pattern {
> associatedtype PatternElement
> func matches<CollectionType: Collection>(…) -> … where
> CollectionType.Element == Element
> }
> extension Equatable: Pattern {
> typealias PatternElement = Self
> …
> }
> extension Collection: Pattern where Element: Equatable {
> typealias PatternElement = Element
> }
>
> ...although then `Collection` would conform to `Pattern` through both
> itself and (conditionally) `Equatable`. Hmm.
>
> I suppose we faced this same problem elsewhere and ended up with things
> like:
>
> mutating func append(_ element: Element)
> mutating func append<Seq: Sequence>(contentsOf seq: Seq) where
> Seq.Iterator.Element == Element
>
> So we could do things like:
>
> str.firstMatch("x") // single element, so this is a Character
> str.firstMatch(contentsOf("xy"))
> str.firstMatch(anyOf(["x", "y"] as Set))
>
>
> I really, really want to explore these ideas further, and I really, really
> don't want to do it in this thread, if you don't mind. There are lots of
> ways to slice this particular cupcake.
>
>
> #### Index Interchange Among Views
>
>
> I really, really, really want this.
>
> We think random-access
>
> *code-unit storage* is a reasonable requirement to impose on all `String`
>
> instances.
>
>
> Wait, you do? Doesn't that mean either using UTF-32, inventing a UTF-24 to
> use, or using some kind of complicated side table that adjusts for all the
> multi-unit characters in a UTF-16 or UTF-8 string? None of these sound
> ideal.
>
>
> No; I'm not sure why you would think that.
>
> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
>
> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
>
> seamless by having them share an index type (semantics of indexing a
> `String`
>
> between grapheme cluster boundaries are TBD—it can either trap or be
> forgiving).
>
>
> I think it should be forgiving, and I think it should be forgiving in a
> very specific way: It should treat indexing in the middle of a cluster as
> though you indexed at the beginning.
>
>
> That's my intuition as well.
>
> The reason is `AttributedString`. You can think of `AttributedString` as
> being a type which adds additional views to a `String`; these views are
> indexed by `String.Index`, just like `String`, `String.UnicodeScalarView`,
> et.al., and advancing an index with these views advances it to the
> beginning of the next run. But you can also just subscript these views with
> an arbitrary index in the middle of a run, and it'll work correctly.
>
> I think it would be useful for this behavior to be consistent among all
> `String` views.
>
> Having a common index allows easy traversal into the interior of graphemes,
>
> something that is often needed, without making it likely that someone will
> do it
>
> by accident.
>
>
> - `String.index(after:)` should advance to the next grapheme, even when the
>
> index points partway through a grapheme.
>
>
> - `String.index(before:)` should move to the start of the grapheme before
>
> the current position.
>
>
> Good.
>
> Seamless index interchange between `String` and its UTF-8 or UTF-16 views
> is not
>
> crucial, as the specifics of encoding should not be a concern for most use
>
> cases, and would impose needless costs on the indices of other views.
>
>
> I don't know about this, at least for the UTF-16 view. Here's why:
>
> That leaves the interchange of bare indices with Cocoa APIs trafficking in
>
> `Int`. Hopefully such APIs will be rare, but when needed, the following
>
> extension, which would be useful for all `Collections`, can help:
>
>
> ```swift
>
> extension Collection {
>
> func index(offset: IndexDistance) -> Index {
>
> return index(startIndex, offsetBy: offset)
>
> }
>
> func offset(of i: Index) -> IndexDistance {
>
> return distance(from: startIndex, to: i)
>
> }
>
> }
>
> ```
>
>
> Then integers can easily be translated into offsets into a `String`'s
> `utf16`
>
> view for consumption by Cocoa:
>
>
> ```swift
>
> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>
> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>
> ```
>
>
> I worry that this conversion will be too obscure.
>
>
> I very much hope it will be rare enough that it'll be OK, but if it isn't,
> we can always have
>
> let cocoaIndex = s.utf16Offset(of: i)
>
> and/or take other measures to simplify it.
>
> In Objective-C, you don't really think very much about what "character"
> means; it's just an index that points to a location inside the string. I
> don't think people will know to use the `utf16` view instead of the
> others—especially the plain `String` version, which would be the most
> obvious one to use.
>
> I think I'd prefer to see the following:
>
> 1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.
>
>
> It will be, in the common case, but *many* people seem to want plain
> String to be able to store UTF-8, and I'm not yet prepared to rule that out.
>
> 2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16
> offset.
>
> 3. With just the standard library imported, `String.Index` does not have
> any obvious way to convert to or from an `Int` offset; you use
> `index(_:offsetBy:)` on one of the views. `utf16`'s implementation is just
> faster than the others.
>
>
> This is roughly where we are today.
>
> 4. Foundation adds `init(_:)` methods to `String.Index` and `Int`, as well
> as `Range<String.Index>` and `NSRange`, which perform mutual conversions:
>
> XCTAssertEqual(Int(String.Index(cocoaIndex)), cocoaIndex)
> XCTAssertEqual(NSRange(Range<String.Index>(cocoaRange)), cocoaRange)
>
> I think this would really help to guide people to the right APIs for the
> task.
>
> (Also, it would make my `AttributedString` thing work better, too.)
>
> ### Formatting
>
>
> Briefly: I am, let's say, 95% on board with your plan to replace format
> strings with interpolation and format methods. The remaining 5% concern is
> that it we'll need an adequate replacement for the ability to load a format
> string dynamically and have it reorder or alter the formatting of
> interpolated values. Obviously dynamic format strings are dangerous and
> limited, but where you *can* use them, they're invaluable.\
>
>
> Yes. We have ideas, though they're far from baked.
>
> #### String Interpolation
>
>
> Swift string interpolation provides a user-friendly alternative to printf's
>
> domain-specific language (just write ordinary swift code!) and its type
> safety
>
> problems (put the data right where it belongs!) but the following issues
> prevent
>
> it from being useful for localized formatting (among other jobs):
>
>
> * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to
> restrict
>
> types used in string interpolation.
>
> * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation
> can't
>
> distinguish (fragments of) the base string from the string substitutions.
>
>
> If I find some copious free time, I could try to develop proposals for one
> or both of these. Would there be interest in them at this point? (Feel free
> to contact me off-list about this, preferably in a new thread.)
>
> (Okay, one random thought, because I can't resist: Perhaps the "\(…)"
> syntax can be translated directly into an `init(…)` on the type you're
> creating. That is, you can write:
>
> let x: MyString = "foo \(bar) baz \(quux, radix: 16)"
>
> And that translates to:
>
> let x = MyString(stringInterpolationSegments:
> MyString(stringLiteral: "foo "),
> MyString(bar),
> MyString(stringLiteral: " baz "),
> MyString(quux, radix: 16)
> )
>
> That would require you to redeclare `String` initializers on your own
> string type, but you probably need some of your own logic anyway, right?)
>
>
> Let's go to a separate thread for this, as you suggested.
>
> In the long run, we should improve Swift string interpolation to the point
> where
>
> it can participate in most any formatting job. Mostly this centers around
>
> fixing the interpolation protocols per the previous item, and supporting
>
> localization.
>
>
> For what it's worth, by using a hacky workaround for SR-1260, I've written
> (Swift 2.0) code that passes strings with interpolations through the
> Foundation localized string tables: <
> https://gist.github.com/brentdax/79fa038c0af0cafb52dd> Obviously that's
> just a start, but it is incredibly convenient.
>
>
> I know; it's an inspiration :-)
>
> ### C String Interop
>
>
> Our support for interoperation with nul-terminated C strings is scattered
> and
>
> incoherent, with 6 ways to transform a C string into a `String` and four
> ways to
>
> do the inverse. These APIs should be replaced with the following
>
>
> These APIs are much better than the status quo, but it's a shame that we
> can't have them handle non-nul-terminated data, too.
>
>
> We thought about unifying them with other transcoding APIs, but the
> pointer-to-nul-terminated-code-units case is sufficiently important that we
> think they deserve dedicated support.
>
> Actually... (Begin shaggy dog story...)
>
> Suppose you introduce an `UnsafeNulTerminatedBufferPointer` type. Then you
> could write a *very* high-level API which handles pretty much every
> conversion under the sun:
>
> extension String {
> /// Constructs a `String` from a sequence of `codeUnits` in an
> indicated `encoding`.
> ///
> /// - Parameter codeUnits: A sequence of code units in the given
> `encoding`.
> /// - Parameter encoding: The encoding the code units are in.
> init<CodeUnits: Sequence, Encoding: UnicodeEncoding>(_ codeUnits:
> CodeUnits, encoding: Encoding)
> where CodeUnits.Iterator.Element == Encoding.CodeUnit
> }
>
>
> Yes, we intend to support something like that.
>
> For UTF-8, at least, that would cover reading from `Array`,
> `UnsafeBufferPointer`, `UnsafeRawBufferPointer`,
> `UnsafeNulTerminatedBufferPointer`, `Data`, you name it. Maybe we could
> have a second one that always takes something producing bytes, no matter
> the encoding used:
>
> extension String {
> /// Constructs a `String` from the code units contained in `bytes`
> in a given `encoding`.
> ///
> /// - Parameter bytes: A sequence of bytes expressing code units in
> the given `encoding`.
> /// - Parameter encoding: The encoding the code units are in.
> init<Bytes: Sequence, Encoding: UnicodeEncoding>(_ codeUnits:
> CodeUnits, encoding: Encoding)
> where CodeUnits.Iterator.Element == UInt8
> }
>
> These two initializers would replace...um, something like eight existing
> ones, including ones from Foundation. On the other hand, this is *very*
> generic. And, unless we actually changed the way `char *` imported to
> `UnsafeNulTerminatedBufferPointer<CChar>`, the C string call sequence would
> be pretty complicated:
>
> String(UnsafeNulTerminatedBufferPointer(start: cString), encoding:
> UTF8.self)
>
> So you might end up having to wrap it in an `init(cString:)` anyway, just
> for convenience. Oh well, it was worth exploring.
>
>
> I think you ended up where we did.
>
> Prototype of the above:
> https://gist.github.com/brentdax/8b71f46b424dc64abaa77f18556e607b
>
> (Hmm...maybe bridge `char *` to a type like this instead?
>
> struct CCharPointer {
> var baseAddress: UnsafePointer<CChar> { get }
> var nulTerminated: UnsafeNulTerminatedBufferPointer<CChar> { get }
> func ofLength(_ length: Int) -> UnsafeBufferPointer<CChar>
> }
>
> Nah, probably not gonna happen...)
>
> init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
>
>
> By the way, I just noticed an impedance mismatch in current Swift: `CChar`
> is usually an `Int8`, but `UnicodeScalar` and `UTF8` currently want
> `UInt8`. It'd be nice to address this somehow, if only by adding some
> signed variants or something.
>
>
> We thought about that problem and landed on the proposed interface above
> as all that is needed to resolve it.
>
>
> ### High-Performance String Processing
>
>
> Many strings are short enough to store in 64 bits, many can be stored
> using only
>
> 8 bits per unicode scalar, others are best encoded in UTF-16, and some
> come to
>
> us already in some other encoding, such as UTF-8, that would be costly to
>
> translate. Supporting these formats while maintaining usability for
>
> general-purpose APIs demands that a single `String` type can be backed by
> many
>
> different representations.
>
>
> Just putting a pin in this, because I'll want to discuss it a little later.
>
> ### Parsing ASCII Structure
>
>
> Although many machine-readable formats support the inclusion of arbitrary
>
> Unicode text, it is also common that their fundamental structure lies
> entirely
>
> within the ASCII subset (JSON, YAML, many XML formats). These formats are
> often
>
> processed most efficiently by recognizing ASCII structural elements as
> ASCII,
>
> and capturing the arbitrary sections between them in more-general
> strings. The
>
> current String API offers no way to efficiently recognize ASCII and skip
> past
>
> everything else without the overhead of full decoding into unicode scalars.
>
>
> For these purposes, strings should supply an `extendedASCII` view that is a
>
> collection of `UInt32`, where values less than `0x80` represent the
>
> corresponding ASCII character, and other values represent data that is
> specific
>
> to the underlying encoding of the string.
>
>
> This sounds interesting, but:
>
> 1. It doesn't sound like you anticipate there being any way to compare an
> element of the `extendedASCII` view to a character literal. That seems like
> it'd be really useful.
>
>
> We don't have character literals :-)
>
> However, I agree that there needs to be a way to do it. The thing would
> be to make it easy to construct a UInt8 from a string literal. There are a
> few possibilities; I'm a little nervous about making this work:
>
> if c == "X" { ... }
>
> but maybe I should just get over it. The cleanest alternative I can think
> of is:
>
> if c == ascii("X") { ... }
>
> where "X" is required by the compiler to be a single ascii character.
>
> I guess another possibility is to introduce an ASCII type and overload
> operators so it can be compared with all the Ints:
>
> if c == "X" as ASCII { ... }
>
> 2. I don't really understand how you envision using the "data specific to
> the underlying encoding" sections. Presumably you'll want to convert that
> data into a string eventually, right?
>
>
> It already *is* in a string. The point is that we have a way to scan the
> string looking for ASCII patterns without transcoding it.
>
> Do you have pseudocode or something lying around that might help us
> understand how you think this might be used?
>
>
> Not exactly. The pattern matching prototype you referred to earlier would
> be enhanced to use the extendedASCII view when it was available and the
> pattern being matched was suitably restricted. How, exactly, that works is
> still a bit of a research project though.
>
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>
>
> UnicodeEncoding has an associated type, but it may be important to be able
> to
>
> traffic in completely dynamic encoding values, e.g. for “tell me the most
>
> efficient encoding for this string.”
>
>
> As long as you're here, we haven't talked about `UnicodeEncoding` much. I
> assume this is a slightly modified version of `UnicodeCodec`? Anything to
> say about it?
>
>
> That's basically right. You can see a first cut at it in the
> unicode-rethink branch on GitHub.
>
> If it *is* similar to `UnicodeCodec`, one thing I will note is that the
> way `UnicodeCodec` works in code units is rather annoying for I/O. It may
> make sense to have some sort of type-erasing wrapper around `UnicodeCodec`
> which always uses bytes. (You then have to worry about endianness, of
> course...)
>
>
> Take a look at the branch and let me know how this looks like it would
> work for I/O.
>
> By the way, I think I/O really needs a special kind of collection: a sort
> of deque built out of I/O buffer-sized chunks that are filled on demand
> from a Sequence. That is part, at least, of how I justify UnicodeEncoding
> having a Collection-based interface where UnicodeCodec used Iterator.
>
>
> ### Should there be a string “facade?”
>
> …
>
> An interesting variation on this design is possible if defaulted generic
>
> parameters are introduced to the language:
>
>
> ```swift
>
> struct String<U: Unicode = StringStorage>
>
> : BidirectionalCollection {
>
>
> // ...APIs for high-level string processing here...
>
>
> var unicode: U // access to lower-level unicode details
>
> }
>
>
> typealias Substring = String<StringStorage.SubSequence>
>
> ```
>
>
> I think this is a very, very interesting idea. A few notes:
>
> * Earlier, I said I didn't like `Unicode` as a protocol name. If we go
> this route, I think `StringStorage` is a good name for that protocol. The
> default storage might be something like `UTF16StringStorage`, or just, you
> know, `DefaultStringStorage`.
>
> * Earlier, you mentioned the tension between using multiple
> representations for flexibility and pinning down one representation for
> speed. One way to handle this might be to have `String`'s default
> `StringStorage` be a superclass or type-erased wrapper or something.
>
>
> Yes, that's the idea.
>
> That way, if you just write `String`, you get something flexible; if you
> write `String<NFCNormalizedUTF16StringStorage>`, you get something fast.
>
>
> This only works in the "facade" variant where you have a defaulted generic
> parameter feature, but yes, that's the idea of that variant.
>
> * Could `NSString` be a `StringStorage`, or support a trivial wrapper that
> converts it into a `StringStorage`? Would that be helpful at all?
>
>
> Yes, that's part of the idea.
>
> * If we do this, does `String.Index` become a type-specific thing? That
> is, might `String<UTF8Storage>.Index` be different from
> `String<UTF16Storage>.Index`?
>
>
> Yes.
>
> What does that mean for `String.Index` unification?
>
>
> Not much. We never intended for indices to be interchangeable among
> different specific string types (other than a string and its SubSequence).
>
> ### `description` and `debugDescription`
>
>
> * Should these be creating localized or non-localized representations?
>
>
> `debugDescription`, I think, is non-localized; it's something helpful for
> the programmer, and the programmer's language is not the user's. It's also
> usually something you don't want to put *too* much effort into, other than
> to dump a lot of data about the instance.
>
> `description` would have to change to be localizable. (Specifically, it
> would have to take a locale.) This is doable, of course, but it hasn't been
> done yet.
>
>
> Well, it could use the current locale. These things are supposed to
> remain lightweight.
>
> * Is returning a `String` efficient enough?
>
>
> I'm not sure how important efficiency is for `description`, honestly.
>
>
> It depends how intimately this is tied into interpolation and formatting,
> I think.
>
> * Is `debugDescription` pulling the weight of the API surface area it adds?
>
>
> Maybe? Or maybe it's better off as part of the `Mirror` instead of a
> property on the instance itself.
>
>
> That's a very interesting thought!
>
> ### `StaticString`
>
>
> `StaticString` was added as a byproduct of standard library developed and
> kept
>
> around because it seemed useful, but it was never truly *designed* for
> client
>
> programmers. We need to decide what happens with it. Presumably
> *something*
>
> should fill its role, and that should conform to `Unicode`.
>
>
> Maybe. One complication there is that `Unicode` presumably supports
> mutation, which `StaticString` doesn't.
>
>
> No, Unicode doesn't support mutation. A mutable Unicode will usually
> conform to Unicode and RangeReplaceableCollection (but not
> MutableCollection, because replacing a grapheme is not an O(1) operation).
>
> Another possibility I've discussed in the past is renaming `StaticString`
> to `StringLiteral` and using it largely as a way to initialize `String`. (I
> mentioned that in a thread about the need for public integer and
> floating-point literal types that are more expressive now that we're
> supporting larger integer/float types.)
>
>
> Yes, a broad redesign of all literals is crucial. However, there are
> other sources of static string data than literals and those need to be
> accommodated.
>
> It could have just enough API surface to access it as a buffer of UTF-8
> bytes and thereby build a `String` or `Data` from it.
>
> Well, that's it for this massive email. You guys are doing a hell of a job
> on this.
>
>
> Thanks for all the feedback, and the encouragement!
>
> Hope this helps,
> --
> Brent Royal-Gordon
> Architechies
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170122/cd6f0088/attachment.html>
More information about the swift-evolution
mailing list