<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><br><br>Sent from my iPad</div><div><br>On Jan 19, 2017, at 8:20 PM, Xiaodi Wu <<a href="mailto:xiaodi.wu@gmail.com">xiaodi.wu@gmail.com</a>> wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr">Clearly too big to digest in one take. Some initial thoughts:<div><br></div><div>* Not sure about the wisdom of the ad-hoc Substring : String compiler magic. It seems that whatever needs overcoming here would be equally relevant for ArraySlice.</div></div></div></blockquote><div><br></div>We have mixed feelings about it as well, and you make a good point about ArraySlice. I'm not convinced trafficking in slices is going to be as important for Array as it is for String, though.<div><br><div><blockquote type="cite"><div><div dir="ltr"><div>It would be more design work, but perhaps not terribly more implementation work, to have a magical protocol that allows the compiler to apply a similar magic to conforming types (e.g. a `ImplicitlyConvertibleSlice` protocol with an associated type, to which ArraySlice and String could both conform). </div></div></div></blockquote><div><br></div><div><span style="background-color: rgba(255, 255, 255, 0);">It's not just about slices. There are other subtype relationships we'll want in the language eventually anyway. Int8:Int16, for example. We don't t</span></div><div><br></div><blockquote type="cite"><div><div dir="ltr"><div>Alternatively, perhaps all of this is not truly necessary for sufficient ergonomics.</div></div></div></blockquote><div><br></div><div><span style="background-color: rgba(255, 255, 255, 0);">Personally I would be happy to try the design without the implicit conversion first, but there are legitimate concerns about forcing users to write String(someSubstring) and the manifesto needs to at least offer a solid plan in place for addressing it.</span></div><div style="background-color: rgba(255, 255, 255, 0);"><br></div></div><div><blockquote type="cite"><div><div dir="ltr"><div>* A requirement to transcode UTF-8 strings to UTF-16 for storage seems...inefficient? </div></div></div></blockquote><div><br></div>To be clear, nobody's suggesting that you can't store a UTF8String, only that it may be necessary to restrict the encodings that can be stored in the currency type "String."</div><div> <br><blockquote type="cite"><div><div dir="ltr"><div>Why any hesitation at all to expose UTF-8-encoded code units as UInt16? Sure, there are going to be unused bits, but so what? </div></div></div></blockquote><div><br></div>We're still exploring the design space. That idea is relatively fresh and I haven't convinced myself that it is both efficient and ergonomic. But it's promising.</div><div><br><blockquote type="cite"><div><div dir="ltr"><div>If I understand it correctly, it's only the concrete type exposed on String for code units that's in play here; the backing representations themselves can use whatever is most efficient. </div></div></div></blockquote><div><br></div><div>Yes.</div><br><blockquote type="cite"><div><div dir="ltr"><div>So, why _not_ support UTF-32 and expose all code units as UInt32? Isn't that exactly paralleling the design for the extendedASCII view, where users get ASCII characters back as UInt32 and encoding-specific code units as such as well?</div></div></div></blockquote><div><br></div></div><div>Yes, there's a definite parallel.</div><div><br><blockquote type="cite"><div><div dir="ltr"><div>* Are the backing representations for String also the same types that can be exposed statically (as in the mentioned `NFCNormalizedUTF16String`)?</div></div></div></blockquote><div><br></div>Roughly. I think we want at least the following backing representations for String:</div><div><br></div><div>1. The two compressed representations used by Cocoa "tagged pointer" strings</div><div>2. A third "tagged pointer" representation that stores 63 bits of UTF-16 (so arbitrary UnicodeScalars and most Characters can be stored efficiently)</div><div>3. A known Latin-1 backing store that we can fast-path</div><div>4. A known UTF-16 backing store</div><div>5. A type-erased arbitrary (or nearly-arbitrary, if we have to accept a UTF16 subset restriction) instance of Unicode</div><div><br></div><div>It's possible that some of the representations in the range 3...5 can be collapsed into one.</div><div><br><blockquote type="cite"><div><div dir="ltr"><div>* Why `withCString` with a closure instead of just `cString` returning [CChar]? Particularly if the backing store isn't UTF8, isn't the C string going to have to be a newly allocated buffer anyway? </div></div></div></blockquote><div><br></div><div>Not if the backing store is Latin-1, which will be very common.</div><div>Also not if the string is short and can be transcoded into stack-based storage, which will also be common.</div><br><blockquote type="cite"><div><div dir="ltr"><div>Personally, I find the current `utf8CString` to be quite convenient :P</div></div></div></blockquote><div><br></div>All these string types should bridge seamlessly to char*. Isn't that enough for the super-lightweight use case?</div><div><br><blockquote type="cite"><div><div dir="ltr"><div><div><div class="gmail_extra"><div class="gmail_quote">On Thu, Jan 19, 2017 at 8:56 PM, Ben Cohen via swift-evolution <span dir="ltr"><<a href="mailto:swift-evolution@swift.org" target="_blank">swift-evolution@swift.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi all,<br>
<br>
Below is our take on a design manifesto for Strings in Swift 4 and beyond.<br>
<br>
Probably best read in rendered markdown on GitHub:<br>
<a href="https://github.com/apple/swift/blob/master/docs/StringManifesto.md" rel="noreferrer" target="_blank">https://github.com/apple/<wbr>swift/blob/master/docs/<wbr>StringManifesto.md</a><br>
<br>
We’re eager to hear everyone’s thoughts.<br>
<br>
Regards,<br>
Ben and Dave<br>
<br>
<br>
# String Processing For Swift 4<br>
<br>
* Authors: [Dave Abrahams](<a href="https://github.com/dabrahams" rel="noreferrer" target="_blank">https://github.com/<wbr>dabrahams</a>), [Ben Cohen](<a href="https://github.com/airspeedswift" rel="noreferrer" target="_blank">https://github.com/<wbr>airspeedswift</a>)<br>
<br>
The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus<br>
far, with just this short blurb in the<br>
[list of goals](<a href="https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html" rel="noreferrer" target="_blank">https://lists.swift.<wbr>org/pipermail/swift-evolution/<wbr>Week-of-Mon-20160725/025676.<wbr>html</a>):<br>
<br>
> **String re-evaluation**: String is one of the most important fundamental<br>
> types in the language. The standard library leads have numerous ideas of how<br>
> to improve the programming model for it, without jeopardizing the goals of<br>
> providing a unicode-correct-by-default model. Our goal is to be better at<br>
> string processing than Perl!<br>
<br>
For Swift 4 and beyond we want to improve three dimensions of text processing:<br>
<br>
1. Ergonomics<br>
2. Correctness<br>
3. Performance<br>
<br>
This document is meant to both provide a sense of the long-term vision<br>
(including undecided issues and possible approaches), and to define the scope of<br>
work that could be done in the Swift 4 timeframe.<br>
<br>
## General Principles<br>
<br>
### Ergonomics<br>
<br>
It's worth noting that ergonomics and correctness are mutually-reinforcing. An<br>
API that is easy to use—but incorrectly—cannot be considered an ergonomic<br>
success. Conversely, an API that's simply hard to use is also hard to use<br>
correctly. Acheiving optimal performance without compromising ergonomics or<br>
correctness is a greater challenge.<br>
<br>
Consistency with the Swift language and idioms is also important for<br>
ergonomics. There are several places both in the standard library and in the<br>
foundation additions to `String` where patterns and practices found elsewhere<br>
could be applied to improve usability and familiarity.<br>
<br>
### API Surface Area<br>
<br>
Primary data types such as `String` should have APIs that are easily understood<br>
given a signature and a one-line summary. Today, `String` fails that test. As<br>
you can see, the Standard Library and Foundation both contribute significantly to<br>
its overall complexity.<br>
<br>
**Method Arity** | **Standard Library** | **Foundation**<br>
---|:---:|:---:<br>
0: `ƒ()` | 5 | 7<br>
1: `ƒ(:)` | 19 | 48<br>
2: `ƒ(::)` | 13 | 19<br>
3: `ƒ(:::)` | 5 | 11<br>
4: `ƒ(::::)` | 1 | 7<br>
5: `ƒ(:::::)` | - | 2<br>
6: `ƒ(::::::)` | - | 1<br>
<br>
**API Kind** | **Standard Library** | **Foundation**<br>
---|:---:|:---:<br>
`init` | 41 | 18<br>
`func` | 42 | 55<br>
`subscript` | 9 | 0<br>
`var` | 26 | 14<br>
<br>
**Total: 205 APIs**<br>
<br>
By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have<br>
to press through physical API sprawl just to get started.<br>
<br>
Many of the choices detailed below contribute to solving this problem,<br>
including:<br>
<br>
* Restoring `Collection` conformance and dropping the `.characters` view.<br>
* Providing a more general, composable slicing syntax.<br>
* Altering `Comparable` so that parameterized<br>
(e.g. case-insensitive) comparison fits smoothly into the basic syntax.<br>
* Clearly separating language-dependent operations on text produced<br>
by and for humans from language-independent<br>
operations on text produced by and for machine processing.<br>
* Relocating APIs that fall outside the domain of basic string processing and<br>
discouraging the proliferation of ad-hoc extensions.<br>
<br>
<br>
### Batteries Included<br>
<br>
While `String` is available to all programs out-of-the-box, crucial APIs for<br>
basic string processing tasks are still inaccessible until `Foundation` is<br>
imported. While it makes sense that `Foundation` is needed for domain-specific<br>
jobs such as<br>
[linguistic tagging](<a href="https://developer.apple.com/reference/foundation/nslinguistictagger" rel="noreferrer" target="_blank">https://developer.<wbr>apple.com/reference/<wbr>foundation/nslinguistictagger</a>)<wbr>,<br>
one should not need to import anything to, for example, do case-insensitive<br>
comparison.<br>
<br>
### Unicode Compliance and Platform Support<br>
<br>
The Unicode standard provides a crucial objective reference point for what<br>
constitutes correct behavior in an extremely complex domain, so<br>
Unicode-correctness is, and will remain, a fundamental design principle behind<br>
Swift's `String`. That said, the Unicode standard is an evolving document, so<br>
this objective reference-point is not fixed.[1] While<br>
many of the most important operations—e.g. string hashing, equality, and<br>
non-localized comparison—will be stable, the semantics<br>
of others, such as grapheme breaking and localized comparison and case<br>
conversion, are expected to change as platforms are updated, so programs should<br>
be written so their correctness does not depend on precise stability of these<br>
semantics across OS versions or platforms. Although it may be possible to<br>
imagine static and/or dynamic analysis tools that will help users find such<br>
errors, the only sure way to deal with this fact of life is to educate users.<br>
<br>
## Design Points<br>
<br>
### Internationalization<br>
<br>
There is strong evidence that developers cannot determine how to use<br>
internationalization APIs correctly. Although documentation could and should be<br>
improved, the sheer size, complexity, and diversity of these APIs is a major<br>
contributor to the problem, causing novices to tune out, and more experienced<br>
programmers to make avoidable mistakes.<br>
<br>
The first step in improving this situation is to regularize all localized<br>
operations as invocations of normal string operations with extra<br>
parameters. Among other things, this means:<br>
<br>
1. Doing away with `localizedXXX` methods<br>
2. Providing a terse way to name the current locale as a parameter<br>
3. Automatically adjusting defaults for options such<br>
as case sensitivity based on whether the operation is localized.<br>
4. Removing correctness traps like `<wbr>localizedCaseInsensitiveCompar<wbr>e` (see<br>
guidance in the<br>
[Internationalization and Localization Guide](<a href="https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html" rel="noreferrer" target="_blank">https://developer.<wbr>apple.com/library/content/<wbr>documentation/MacOSX/<wbr>Conceptual/BPInternational/<wbr>InternationalizingYourCode/<wbr>InternationalizingYourCode.<wbr>html</a>).<br>
<br>
Along with appropriate documentation updates, these changes will make localized<br>
operations more teachable, comprehensible, and approachable, thereby lowering a<br>
barrier that currently leads some developers to ignore localization issues<br>
altogether.<br>
<br>
#### The Default Behavior of `String`<br>
<br>
Although this isn't well-known, the most accessible form of many operations on<br>
Swift `String` (and `NSString`) are really only appropriate for text that is<br>
intended to be processed for, and consumed by, machines. The semantics of the<br>
operations with the simplest spellings are always non-localized and<br>
language-agnostic.<br>
<br>
Two major factors play into this design choice:<br>
<br>
1. Machine processing of text is important, so we should have first-class,<br>
accessible functions appropriate to that use case.<br>
<br>
2. The most general localized operations require a locale parameter not required<br>
by their un-localized counterparts. This naturally skews complexity towards<br>
localized operations.<br>
<br>
Reaffirming that `String`'s simplest APIs have<br>
language-independent/machine-<wbr>processed semantics has the benefit of clarifying<br>
the proper default behavior of operations such as comparison, and allows us to<br>
make [significant optimizations](#collation-<wbr>semantics) that were previously<br>
thought to conflict with Unicode.<br>
<br>
#### Future Directions<br>
<br>
One of the most common internationalization errors is the unintentional<br>
presentation to users of text that has not been localized, but regularizing APIs<br>
and improving documentation can go only so far in preventing this error.<br>
Combined with the fact that `String` operations are non-localized by default,<br>
the environment for processing human-readable text may still be somewhat<br>
error-prone in Swift 4.<br>
<br>
For an audience of mostly non-experts, it is especially important that naïve<br>
code is very likely to be correct if it compiles, and that more sophisticated<br>
issues can be revealed progressively. For this reason, we intend to<br>
specifically and separately target localization and internationalization<br>
problems in the Swift 5 timeframe.<br>
<br>
### Operations With Options<br>
<br>
There are three categories of common string operation that commonly need to be<br>
tuned in various dimensions:<br>
<br>
**Operation**|**Applicable Options**<br>
---|---<br>
sort ordering | locale, case/diacritic/width-<wbr>insensitivity<br>
case conversion | locale<br>
pattern matching | locale, case/diacritic/width-<wbr>insensitivity<br>
<br>
The defaults for case-, diacritic-, and width-insensitivity are different for<br>
localized operations than for non-localized operations, so for example a<br>
localized sort should be case-insensitive by default, and a non-localized sort<br>
should be case-sensitive by default. We propose a standard “language” of<br>
defaulted parameters to be used for these purposes, with usage roughly like this:<br>
<br>
```swift<br>
x.compared(to: y, case: .sensitive, in: swissGerman)<br>
<br>
x.lowercased(in: .currentLocale)<br>
<br>
x.allMatches(<br>
somePattern, case: .insensitive, diacritic: .insensitive)<br>
```<br>
<br>
This usage might be supported by code like this:<br>
<br>
```swift<br>
enum StringSensitivity {<br>
case sensitive<br>
case insensitive<br>
}<br>
<br>
extension Locale {<br>
static var currentLocale: Locale { ... }<br>
}<br>
<br>
extension Unicode {<br>
// An example of the option language in declaration context,<br>
// with nil defaults indicating unspecified, so defaults can be<br>
// driven by the presence/absence of a specific Locale<br>
func frobnicated(<br>
case caseSensitivity: StringSensitivity? = nil,<br>
diacritic diacriticSensitivity: StringSensitivity? = nil,<br>
width widthSensitivity: StringSensitivity? = nil,<br>
in locale: Locale? = nil<br>
) -> Self { ... }<br>
}<br>
```<br>
<br>
### Comparing and Hashing Strings<br>
<br>
#### Collation Semantics<br>
<br>
What Unicode says about collation—which is used in `<`, `==`, and hashing— turns<br>
out to be quite interesting, once you pick it apart. The full Unicode Collation<br>
Algorithm (UCA) works like this:<br>
<br>
1. Fully normalize both strings<br>
2. Convert each string to a sequence of numeric triples to form a collation key<br>
3. “Flatten” the key by concatenating the sequence of first elements to the<br>
sequence of second elements to the sequence of third elements<br>
4. Lexicographically compare the flattened keys<br>
<br>
While step 1 can usually<br>
be [done quickly](<a href="http://unicode.org/reports/tr15/#Description_Norm" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr15/#Description_Norm</a><wbr>) and<br>
incrementally, step 2 uses a collation table that maps matching *sequences* of<br>
unicode scalars in the normalized string to *sequences* of triples, which get<br>
accumulated into a collation key. Predictably, this is where the real costs<br>
lie.<br>
<br>
*However*, there are some bright spots to this story. First, as it turns out,<br>
string sorting (localized or not) should be done down to what's called<br>
the<br>
[“identical” level](<a href="http://unicode.org/reports/tr10/#Multi_Level_Comparison" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr10/#Multi_Level_<wbr>Comparison</a>),<br>
which adds a step 3a: append the string's normalized form to the flattened<br>
collation key. At first blush this just adds work, but consider what it does<br>
for equality: two strings that normalize the same, naturally, will collate the<br>
same. But also, *strings that normalize differently will always collate<br>
differently*. In other words, for equality, it is sufficient to compare the<br>
strings' normalized forms and see if they are the same. We can therefore<br>
entirely skip the expensive part of collation for equality comparison.<br>
<br>
Next, naturally, anything that applies to equality also applies to hashing: it<br>
is sufficient to hash the string's normalized form, bypassing collation keys.<br>
This should provide significant speedups over the current implementation.<br>
Perhaps more importantly, since comparison down to the “identical” level applies<br>
even to localized strings, it means that hashing and equality can be implemented<br>
exactly the same way for localized and non-localized text, and hash tables with<br>
localized keys will remain valid across current-locale changes.<br>
<br>
Finally, once it is agreed that the *default* role for `String` is to handle<br>
machine-generated and machine-readable text, the default ordering of `String`s<br>
need no longer use the UCA at all. It is sufficient to order them in any way<br>
that's consistent with equality, so `String` ordering can simply be a<br>
lexicographical comparison of normalized forms,[4]<br>
(which is equivalent to lexicographically comparing the sequences of grapheme<br>
clusters), again bypassing step 2 and offering another speedup.<br>
<br>
This leaves us executing the full UCA *only* for localized sorting, and ICU's<br>
implementation has apparently been very well optimized.<br>
<br>
Following this scheme everywhere would also allow us to make sorting behavior<br>
consistent across platforms. Currently, we sort `String` according to the UCA,<br>
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by<br>
unicode scalar value.<br>
<br>
#### Syntax<br>
<br>
Because the current `Comparable` protocol expresses all comparisons with binary<br>
operators, string comparisons—which may require<br>
additional [options](#operations-with-<wbr>options)—do not fit smoothly into the<br>
existing syntax. At the same time, we'd like to solve other problems with<br>
comparison, as outlined<br>
in<br>
[this proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e" rel="noreferrer" target="_blank">https://gist.github.<wbr>com/CodaFi/<wbr>f0347bd37f1c407bf7ea0c429ead38<wbr>0e</a>)<br>
(implemented by changes at the head<br>
of<br>
[this branch](<a href="https://github.com/CodaFi/swift/commits/space-the-final-frontier)" rel="noreferrer" target="_blank">https://github.com/<wbr>CodaFi/swift/commits/space-<wbr>the-final-frontier)</a>).<br>
We should adopt a modification of that proposal that uses a method rather than<br>
an operator `<=>`:<br>
<br>
```swift<br>
enum SortOrder { case before, same, after }<br>
<br>
protocol Comparable : Equatable {<br>
func compared(to: Self) -> SortOrder<br>
...<br>
}<br>
```<br>
<br>
This change will give us a syntactic platform on which to implement methods with<br>
additional, defaulted arguments, thereby unifying and regularizing comparison<br>
across the library.<br>
<br>
```swift<br>
extension String {<br>
func compared(to: Self) -> SortOrder<br>
<br>
}<br>
```<br>
<br>
**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible<br>
that the standard library simply adopts Foundation's `ComparisonResult` as is,<br>
but we believe the community should at least consider alternate naming before<br>
that happens. There will be an opportunity to discuss the choices in detail<br>
when the modified<br>
[Comparison Proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e" rel="noreferrer" target="_blank">https://gist.github.<wbr>com/CodaFi/<wbr>f0347bd37f1c407bf7ea0c429ead38<wbr>0e</a>) comes<br>
up for review.<br>
<br>
### `String` should be a `Collection` of `Character`s Again<br>
<br>
In Swift 2.0, `String`'s `Collection` conformance was dropped, because we<br>
convinced ourselves that its semantics differed from those of `Collection` too<br>
significantly.<br>
<br>
It was always well understood that if strings were treated as sequences of<br>
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,<br>
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was<br>
a collection of `Character` (extended grapheme clusters). During 2.0<br>
development, though, we realized that correct string concatenation could<br>
occasionally merge distinct grapheme clusters at the start and end of combined<br>
strings.<br>
<br>
This quirk aside, every aspect of strings-as-collections-of-<wbr>graphemes appears to<br>
comport perfectly with Unicode. We think the concatenation problem is tolerable,<br>
because the cases where it occurs all represent partially-formed constructs. The<br>
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE<br>
ACCENT)—are explicitly called out in the Unicode standard as<br>
“[degenerate](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)" rel="noreferrer" target="_blank">http://unicode.<wbr>org/reports/tr29/#Grapheme_<wbr>Cluster_Boundaries)</a>” or<br>
“[defective](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)" rel="noreferrer" target="_blank">http://www.<wbr>unicode.org/versions/Unicode9.<wbr>0.0/ch03.pdf)</a>”. The other<br>
cases—such as a string ending in a zero-width joiner or half of a regional<br>
indicator—appear to be equally transient and unlikely outside of a text editor.<br>
<br>
Admitting these cases encourages exploration of grapheme composition and is<br>
consistent with what appears to be an overall Unicode philosophy that “no<br>
special provisions are made to get marginally better behavior for… cases that<br>
never occur in practice.”[2] Furthermore, it seems<br>
unlikely to disturb the semantics of any plausible algorithms. We can handle<br>
these cases by documenting them, explicitly stating that the elements of a<br>
`String` are an emergent property based on Unicode rules.<br>
<br>
The benefits of restoring `Collection` conformance are substantial:<br>
<br>
* Collection-like operations encourage experimentation with strings to<br>
investigate and understand their behavior. This is useful for teaching new<br>
programmers, but also good for experienced programmers who want to<br>
understand more about strings/unicode.<br>
<br>
* Extended grapheme clusters form a natural element boundary for Unicode<br>
strings. For example, searching and matching operations will always produce<br>
results that line up on grapheme cluster boundaries.<br>
<br>
* Character-by-character processing is a legitimate thing to do in many real<br>
use-cases, including parsing, pattern matching, and language-specific<br>
transformations such as transliteration.<br>
<br>
* `Collection` conformance makes a wide variety of powerful operations<br>
available that are appropriate to `String`'s default role as the vehicle for<br>
machine processed text.<br>
<br>
The methods `String` would inherit from `Collection`, where similar to<br>
higher-level string algorithms, have the right semantics. For example,<br>
grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of<br>
`flatMap` with case-conversion, produce the same results one would expect<br>
from whole-string ordering comparison, equality comparison, and<br>
case-conversion, respectively. `reverse` operates correctly on graphemes,<br>
keeping diacritics moored to their base characters and leaving emoji intact.<br>
Other methods such as `indexOf` and `contains` make obvious sense. A few<br>
`Collection` methods, like `min` and `max`, may not be particularly useful<br>
on `String`, but we don't consider that to be a problem worth solving, in<br>
the same way that we wouldn't try to suppress `min` and `max` on a<br>
`Set([UInt8])` that was used to store IP addresses.<br>
<br>
* Many of the higher-level operations that we want to provide for `String`s,<br>
such as parsing and pattern matching, should apply to any `Collection`, and<br>
many of the benefits we want for `Collections`, such<br>
as unified slicing, should accrue<br>
equally to `String`. Making `String` part of the same protocol hierarchy<br>
allows us to write these operations once and not worry about keeping the<br>
benefits in sync.<br>
<br>
* Slicing strings into substrings is a crucial part of the vocabulary of<br>
string processing, and all other sliceable things are `Collection`s.<br>
Because of its collection-like behavior, users naturally think of `String`<br>
in collection terms, but run into frustrating limitations where it fails to<br>
conform and are left to wonder where all the differences lie. Many simply<br>
“correct” this limitation by declaring a trivial conformance:<br>
<br>
```swift<br>
extension String : BidirectionalCollection {}<br>
```<br>
<br>
Even if we removed indexing-by-element from `String`, users could still do<br>
this:<br>
<br>
```swift<br>
extension String : BidirectionalCollection {<br>
subscript(i: Index) -> Character { return characters[i] }<br>
}<br>
```<br>
<br>
It would be much better to legitimize the conformance to `Collection` and<br>
simply document the oddity of any concatenation corner-cases, than to deny<br>
users the benefits on the grounds that a few cases are confusing.<br>
<br>
Note that the fact that `String` is a collection of graphemes does *not* mean<br>
that string operations will necessarily have to do grapheme boundary<br>
recognition. See the Unicode protocol section for details.<br>
<br>
### `Character` and `CharacterSet`<br>
<br>
`Character`, which represents a<br>
Unicode<br>
[extended grapheme cluster](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr29/#Grapheme_<wbr>Cluster_Boundaries</a>),<br>
is a bit of a black box, requiring conversion to `String` in order to<br>
do any introspection, including interoperation with ASCII. To fix this, we should:<br>
<br>
- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure<br>
of grapheme clusters is discoverable.<br>
- Add a failable `init` from sequences of scalars (returning nil for sequences<br>
that contain 0 or 2+ graphemes).<br>
- (Lower priority) expose some operations, such as `func uppercase() -><br>
String`, `var isASCII: Bool`, and, to the extent they can be sensibly<br>
generalized, queries of unicode properties that should also be exposed on<br>
`UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .<br>
<br>
Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`<br>
type. This means it is usable on `String`, but only by going through the unicode<br>
scalar view. To deal with this clash in the short term, `CharacterSet` should be<br>
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to<br>
introduce a `CharacterSet` that provides similar functionality for extended<br>
grapheme clusters.[5]<br>
<br>
### Unification of Slicing Operations<br>
<br>
Creating substrings is a basic part of String processing, but the slicing<br>
operations that we have in Swift are inconsistent in both their spelling and<br>
their naming:<br>
<br>
* Slices with two explicit endpoints are done with subscript, and support<br>
in-place mutation:<br>
<br>
```swift<br>
s[i..<j].mutate()<br>
```<br>
<br>
* Slicing from an index to the end, or from the start to an index, is done<br>
with a method and does not support in-place mutation:<br>
```swift<br>
s.prefix(upTo: i).readOnly()<br>
```<br>
<br>
Prefix and suffix operations should be migrated to be subscripting operations<br>
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as<br>
in<br>
[this proposal](<a href="https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md" rel="noreferrer" target="_blank">https://github.com/<wbr>apple/swift-evolution/blob/<wbr>9cf2685293108ea3efcbebb7ee6a86<wbr>18b83d4a90/proposals/0132-<wbr>sequence-end-ops.md</a>).<br>
With generic subscripting in the language, that will allow us to collapse a wide<br>
variety of methods and subscript overloads into a single implementation, and<br>
give users an easy-to-use and composable way to describe subranges.<br>
<br>
Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`<br>
is an ongoing research project that can be considered part of the potential<br>
long-term vision of text (and collection) processing.<br>
<br>
### Substrings<br>
<br>
When implementing substring slicing, languages are faced with three options:<br>
<br>
1. Make the substrings the same type as string, and share storage.<br>
2. Make the substrings the same type as string, and copy storage when making the substring.<br>
3. Make substrings a different type, with a storage copy on conversion to string.<br>
<br>
We think number 3 is the best choice. A walk-through of the tradeoffs follows.<br>
<br>
#### Same type, shared storage<br>
<br>
In Swift 3.0, slicing a `String` produces a new `String` that is a view into a<br>
subrange of the original `String`'s storage. This is why `String` is 3 words in<br>
size (the start, length and buffer owner), unlike the similar `Array` type<br>
which is only one.<br>
<br>
This is a simple model with big efficiency gains when chopping up strings into<br>
multiple smaller strings. But it does mean that a stored substring keeps the<br>
entire original string buffer alive even after it would normally have been<br>
released.<br>
<br>
This arrangement has proven to be problematic in other programming languages,<br>
because applications sometimes extract small strings from large ones and keep<br>
those small strings long-term. That is considered a memory leak and was enough<br>
of a problem in Java that they changed from substrings sharing storage to<br>
making a copy in 1.7.<br>
<br>
#### Same type, copied storage<br>
<br>
Copying of substrings is also the choice made in C#, and in the default<br>
`NSString` implementation. This approach avoids the memory leak issue, but has<br>
obvious performance overhead in performing the copies.<br>
<br>
This in turn encourages trafficking in string/range pairs instead of in<br>
substrings, for performance reasons, leading to API challenges. For example:<br>
<br>
```swift<br>
foo.compare(bar, range: start..<end)<br>
```<br>
<br>
Here, it is not clear whether `range` applies to `foo` or `bar`. This<br>
relationship is better expressed in Swift as a slicing operation:<br>
<br>
```swift<br>
foo[start..<end].compare(bar)<br>
```<br>
<br>
Not only does this clarify to which string the range applies, it also brings<br>
this sub-range capability to any API that operates on `String` "for free". So<br>
these other combinations also work equally well:<br>
<br>
```swift<br>
// apply range on argument rather than target<br>
foo.compare(bar[start..<end])<br>
// apply range on both<br>
foo[start..<end].compare(bar[<wbr>start1..<end1])<br>
// compare two strings ignoring first character<br>
foo.dropFirst().compare(bar.<wbr>dropFirst())<br>
```<br>
<br>
In all three cases, an explicit range argument need not appear on the `compare`<br>
method itself. The implementation of `compare` does not need to know anything<br>
about ranges. Methods need only take range arguments when that was an<br>
integral part of their purpose (for example, setting the start and end of a<br>
user's current selection in a text box).<br>
<br>
#### Different type, shared storage<br>
<br>
The desire to share underlying storage while preventing accidental memory leaks<br>
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.<br>
The inconvenience of a separate type is mitigated by most operations used on<br>
`Array` from the standard library being generic over `Sequence` or `Collection`.<br>
<br>
We should apply the same approach for `String` by introducing a distinct<br>
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:<br>
<br>
> Important: Long-term storage of `Substring` instances is discouraged. A<br>
> substring holds a reference to the entire storage of a larger string, not<br>
> just to the portion it presents, even after the original string's lifetime<br>
> ends. Long-term storage of a `Substring` may therefore prolong the lifetime<br>
> of large strings that are no longer otherwise accessible, which can appear<br>
> to be memory leakage.<br>
<br>
When assigning a `Substring` to a longer-lived variable (usually a stored<br>
property) explicitly of type `String`, a type conversion will be performed, and<br>
at this point the substring buffer is copied and the original string's storage<br>
can be released.<br>
<br>
A `String` that was not its own `Substring` could be one word—a single tagged<br>
pointer—without requiring additional allocations. `Substring`s would be a view<br>
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a<br>
length. The small string optimization for `Substring` would take advantage of<br>
the larger size, probably with a less compressed encoding for speed.<br>
<br>
The downside of having two types is the inconvenience of sometimes having a<br>
`Substring` when you need a `String`, and vice-versa. It is likely this would<br>
be a significantly bigger problem than with `Array` and `ArraySlice`, as<br>
slicing of `String` is such a common operation. It is especially relevant to<br>
existing code that assumes `String` is the currency type. To ease the pain of<br>
type mismatches, `Substring` should be a subtype of `String` in the same way<br>
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit<br>
conversion from `Substring` to `String`, as well as the usual implicit<br>
conversions such as `[Substring]` to `[String]` that other subtype<br>
relationships receive.<br>
<br>
In most cases, type inference combined with the subtype relationship should<br>
make the type difference a non-issue and users will not care which type they<br>
are using. For flexibility and optimizability, most operations from the<br>
standard library will traffic in generic models of<br>
[`Unicode`](#the--code-<wbr>unicode--code--protocol).<br>
<br>
##### Guidance for API Designers<br>
<br>
In this model, **if a user is unsure about which type to use, `String` is always<br>
a reasonable default**. A `Substring` passed where `String` is expected will be<br>
implicitly copied. When compared to the “same type, copied storage” model, we<br>
have effectively deferred the cost of copying from the point where a substring<br>
is created until it must be converted to `String` for use with an API.<br>
<br>
A user who needs to optimize away copies altogether should use this guideline:<br>
if for performance reasons you are tempted to add a `Range` argument to your<br>
method as well as a `String` to avoid unnecessary copies, you should instead<br>
use `Substring`.<br>
<br>
##### The “Empty Subscript”<br>
<br>
To make it easy to call such an optimized API when you only have a `String` (or<br>
to call any API that takes a `Collection`'s `SubSequence` when all you have is<br>
the `Collection`), we propose the following “empty subscript” operation,<br>
<br>
```swift<br>
extension Collection {<br>
subscript() -> SubSequence {<br>
return self[startIndex..<endIndex]<br>
}<br>
}<br>
```<br>
<br>
which allows the following usage:<br>
<br>
```swift<br>
funcThatIsJustLooking(at: <a href="http://person.name" rel="noreferrer" target="_blank">person.name</a>[]) // pass <a href="http://person.name" rel="noreferrer" target="_blank">person.name</a> as Substring<br>
```<br>
<br>
The `[]` syntax can be offered as a fixit when needed, similar to `&` for an<br>
`inout` argument. While it doesn't help a user to convert `[String]` to<br>
`[Substring]`, the need for such conversions is extremely rare, can be done with<br>
a simple `map` (which could also be offered by a fixit):<br>
<br>
```swift<br>
takesAnArrayOfSubstring(<wbr>arrayOfString.map { $0[] })<br>
```<br>
<br>
#### Other Options Considered<br>
<br>
As we have seen, all three options above have downsides, but it's possible<br>
these downsides could be eliminated/mitigated by the compiler. We are proposing<br>
one such mitigation—implicit conversion—as part of the the "different type,<br>
shared storage" option, to help avoid the cognitive load on developers of<br>
having to deal with a separate `Substring` type.<br>
<br>
To avoid the memory leak issues of a "same type, shared storage" substring<br>
option, we considered whether the compiler could perform an implicit copy of<br>
the underlying storage when it detects the string is being "stored" for long<br>
term usage, say when it is assigned to a stored property. The trouble with this<br>
approach is it is very difficult for the compiler to distinguish between<br>
long-term storage versus short-term in the case of abstractions that rely on<br>
stored properties. For example, should the storing of a substring inside an<br>
`Optional` be considered long-term? Or the storing of multiple substrings<br>
inside an array? The latter would not work well in the case of a<br>
`components(separatedBy:)` implementation that intended to return an array of<br>
substrings. It would also be difficult to distinguish intentional medium-term<br>
storage of substrings, say by a lexer. There does not appear to be an effective<br>
consistent rule that could be applied in the general case for detecting when a<br>
substring is truly being stored long-term.<br>
<br>
To avoid the cost of copying substrings under "same type, copied storage", the<br>
optimizer could be enhanced to to reduce the impact of some of those copies.<br>
For example, this code could be optimized to pull the invariant substring out<br>
of the loop:<br>
<br>
```swift<br>
for _ in 0..<lots {<br>
someFunc(takingString: bigString[bigRange])<br>
}<br>
```<br>
<br>
It's worth noting that a similar optimization is needed to avoid an equivalent<br>
problem with implicit conversion in the "different type, shared storage" case:<br>
<br>
```swift<br>
let substring = bigString[bigRange]<br>
for _ in 0..<lots { someFunc(takingString: substring) }<br>
```<br>
<br>
However, in the case of "same type, copied storage" there are many use cases<br>
that cannot be optimized as easily. Consider the following simple definition of<br>
a recursive `contains` algorithm, which when substring slicing is linear makes<br>
the overall algorithm quadratic:<br>
<br>
```swift<br>
extension String {<br>
func containsChar(_ x: Character) -> Bool {<br>
return !isEmpty && (first == x || dropFirst().containsChar(x))<br>
}<br>
}<br>
```<br>
<br>
For the optimizer to eliminate this problem is unrealistic, forcing the user to<br>
remember to optimize the code to not use string slicing if they want it to be<br>
efficient (assuming they remember):<br>
<br>
```swift<br>
extension String {<br>
// add optional argument tracking progress through the string<br>
func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {<br>
let idx = idx ?? startIndex<br>
return idx != endIndex<br>
&& (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))<br>
}<br>
}<br>
```<br>
<br>
#### Substrings, Ranges and Objective-C Interop<br>
<br>
The pattern of passing a string/range pair is common in several Objective-C<br>
APIs, and is made especially awkward in Swift by the non-interchangeability of<br>
`Range<String.Index>` and `NSRange`.<br>
<br>
```swift<br>
s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))<br>
```<br>
<br>
In general, however, the Swift idiom for operating on a sub-range of a<br>
`Collection` is to *slice* the collection and operate on that:<br>
<br>
```swift<br>
s2.find(s2[j..<s2.endIndex])<br>
```<br>
<br>
Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported<br>
without the `NSRange` argument. The Objective-C importer should be changed to<br>
give these APIs special treatment so that when a `Substring` is passed, instead<br>
of being converted to a `String`, the full `NSString` and range are passed to<br>
the Objective-C method, thereby avoiding a copy.<br>
<br>
As a result, you would never need to pass an `NSRange` to these APIs, which<br>
solves the impedance problem by eliminating the argument, resulting in more<br>
idiomatic Swift code while retaining the performance benefit. To help users<br>
manually handle any cases that remain, Foundation should be augmented to allow<br>
the following syntax for converting to and from `NSRange`:<br>
<br>
```swift<br>
let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]<br>
let iToJ = Range(nsr, in: s) // Equivalent to i..<j<br>
```<br>
<br>
### The `Unicode` protocol<br>
<br>
With `Substring` and `String` being distinct types and sharing almost all<br>
interface and semantics, and with the highest-performance string processing<br>
requiring knowledge of encoding and layout that the currency types can't<br>
provide, it becomes important to capture the common “string API” in a protocol.<br>
Since Unicode conformance is a key feature of string processing in swift, we<br>
call that protocol `Unicode`:<br>
<br>
**Note:** The following assumes several features that are planned but not yet implemented in<br>
Swift, and should be considered a sketch rather than a final design.<br>
<br>
```swift<br>
protocol Unicode<br>
: Comparable, BidirectionalCollection where Element == Character {<br>
<br>
associatedtype Encoding : UnicodeEncoding<br>
var encoding: Encoding { get }<br>
<br>
associatedtype CodeUnits<br>
: RandomAccessCollection where Element == Encoding.CodeUnit<br>
var codeUnits: CodeUnits { get }<br>
<br>
associatedtype UnicodeScalars<br>
: BidirectionalCollection where Element == UnicodeScalar<br>
var unicodeScalars: UnicodeScalars { get }<br>
<br>
associatedtype ExtendedASCII<br>
: BidirectionalCollection where Element == UInt32<br>
var extendedASCII: ExtendedASCII { get }<br>
<br>
var unicodeScalars: UnicodeScalars { get }<br>
}<br>
<br>
extension Unicode {<br>
// ... define high-level non-mutating string operations, e.g. search ...<br>
<br>
func compared<Other: Unicode>(<br>
to rhs: Other,<br>
case caseSensitivity: StringSensitivity? = nil,<br>
diacritic diacriticSensitivity: StringSensitivity? = nil,<br>
width widthSensitivity: StringSensitivity? = nil,<br>
in locale: Locale? = nil<br>
) -> SortOrder { ... }<br>
}<br>
<br>
extension Unicode : RangeReplaceableCollection where CodeUnits :<br>
RangeReplaceableCollection {<br>
// Satisfy protocol requirement<br>
mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)<br>
where C.Element == Element<br>
<br>
// ... define high-level mutating string operations, e.g. replace ...<br>
}<br>
<br>
```<br>
<br>
The goal is that `Unicode` exposes the underlying encoding and code units in<br>
such a way that for types with a known representation (e.g. a high-performance<br>
`UTF8String`) that information can be known at compile-time and can be used to<br>
generate a single path, while still allowing types like `String` that admit<br>
multiple representations to use runtime queries and branches to fast path<br>
specializations.<br>
<br>
**Note:** `Unicode` would make a fantastic namespace for much of<br>
what's in this proposal if we could get the ability to nest types and<br>
protocols in protocols.<br>
<br>
<br>
### Scanning, Matching, and Tokenization<br>
<br>
#### Low-Level Textual Analysis<br>
<br>
We should provide convenient APIs processing strings by character. For example,<br>
it should be easy to cleanly express, “if this string starts with `"f"`, process<br>
the rest of the string as follows…” Swift is well-suited to expressing this<br>
common pattern beautifully, but we need to add the APIs. Here are two examples<br>
of the sort of code that might be possible given such APIs:<br>
<br>
```swift<br>
if let firstLetter = input.droppingPrefix(<wbr>alphabeticCharacter) {<br>
somethingWith(input) // process the rest of input<br>
}<br>
<br>
if let (number, restOfInput) = input.parsingPrefix(Int.self) {<br>
...<br>
}<br>
```<br>
<br>
The specific spelling and functionality of APIs like this are TBD. The larger<br>
point is to make sure matching-and-consuming jobs are well-supported.<br>
<br>
#### Unified Pattern Matcher Protocol<br>
<br>
Many of the current methods that do matching are overloaded to do the same<br>
logical operations in different ways, with the following axes:<br>
<br>
- Logical Operation: `find`, `split`, `replace`, match at start<br>
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure<br>
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of<br>
the method name, and sometimes an argument<br>
- Whole string or subrange.<br>
<br>
We should represent these aspects as orthogonal, composable components,<br>
abstracting pattern matchers into a protocol like<br>
[this one](<a href="https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33" rel="noreferrer" target="_blank">https://github.com/apple/<wbr>swift/blob/master/test/<wbr>Prototypes/PatternMatching.<wbr>swift#L33</a>),<br>
that can allow us to define logical operations once, without introducing<br>
overloads, and massively reducing API surface area.<br>
<br>
For example, using the strawman prefix `%` syntax to turn string literals into<br>
patterns, the following pairs would all invoke the same generic methods:<br>
<br>
```swift<br>
if let found = s.firstMatch(%"searchString") { ... }<br>
if let found = s.firstMatch(someRegex) { ... }<br>
<br>
for m in s.allMatches((%"searchString")<wbr>, case: .insensitive) { ... }<br>
for m in s.allMatches(someRegex) { ... }<br>
<br>
let items = s.split(separatedBy: ", ")<br>
let tokens = s.split(separatedBy: CharacterSet.whitespace)<br>
```<br>
<br>
Note that, because Swift requires the indices of a slice to match the indices of<br>
the range from which it was sliced, operations like `firstMatch` can return a<br>
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in<br>
the string being searched, if needed, can easily be recovered as the<br>
`startIndex` and `endIndex` of the `Substring`.<br>
<br>
Note also that matching operations are useful for collections in general, and<br>
would fall out of this proposal:<br>
<br>
```<br>
// replace subsequences of contiguous NaNs with zero<br>
forces.replace(oneOrMore([<wbr>Float.nan]), [0.0])<br>
```<br>
<br>
#### Regular Expressions<br>
<br>
Addressing regular expressions is out of scope for this proposal.<br>
That said, it is important that to note the pattern matching protocol mentioned<br>
above provides a suitable foundation for regular expressions, and types such as<br>
`NSRegularExpression` can easily be retrofitted to conform to it. In the<br>
future, support for regular expression literals in the compiler could allow for<br>
compile-time syntax checking and optimization.<br>
<br>
### String Indices<br>
<br>
`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and<br>
`utf16`—each with its own opaque index type. The APIs used to translate indices<br>
between views add needless complexity, and the opacity of indices makes them<br>
difficult to serialize.<br>
<br>
The index translation problem has two aspects:<br>
<br>
1. `String` views cannot consume one anothers' indices without a cumbersome<br>
conversion step. An index into a `String`'s `characters` must be translated<br>
before it can be used as a position in its `unicodeScalars`. Although these<br>
translations are rarely needed, they add conceptual and API complexity.<br>
2. Many APIs in the core libraries and other frameworks still expose `String`<br>
positions as `Int`s and regions as `NSRange`s, which can only reference a<br>
`utf16` view and interoperate poorly with `String` itself.<br>
<br>
#### Index Interchange Among Views<br>
<br>
String's need for flexible backing storage and reasonably-efficient indexing<br>
(i.e. without dynamically allocating and reference-counting the indices<br>
themselves) means indices need an efficient underlying storage type. Although<br>
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into<br>
underlying code unit storage makes a good underlying storage type, provided<br>
`String`'s underlying storage supports random-access. We think random-access<br>
*code-unit storage* is a reasonable requirement to impose on all `String`<br>
instances.<br>
<br>
Making these `Int` code unit offsets conveniently accessible and constructible<br>
solves the serialization problem:<br>
<br>
```swift<br>
clipboard.write(s.endIndex.<wbr>codeUnitOffset)<br>
let offset = clipboard.read(Int.self)<br>
let i = String.Index(codeUnitOffset: offset)<br>
```<br>
<br>
Index interchange between `String` and its `unicodeScalars`, `codeUnits`,<br>
and [`extendedASCII`](#parsing-<wbr>ascii-structure) views can be made entirely<br>
seamless by having them share an index type (semantics of indexing a `String`<br>
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).<br>
Having a common index allows easy traversal into the interior of graphemes,<br>
something that is often needed, without making it likely that someone will do it<br>
by accident.<br>
<br>
- `String.index(after:)` should advance to the next grapheme, even when the<br>
index points partway through a grapheme.<br>
<br>
- `String.index(before:)` should move to the start of the grapheme before<br>
the current position.<br>
<br>
Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not<br>
crucial, as the specifics of encoding should not be a concern for most use<br>
cases, and would impose needless costs on the indices of other views. That<br>
said, we can make translation much more straightforward by exposing simple<br>
bidirectional converting `init`s on both index types:<br>
<br>
```swift<br>
let u8Position = String.UTF8.Index(<wbr>someStringIndex)<br>
let originalPosition = String.Index(u8Position)<br>
```<br>
<br>
#### Index Interchange with Cocoa<br>
<br>
We intend to address `NSRange`s that denote substrings in Cocoa APIs as<br>
described [later in this document](#substrings--ranges-<wbr>and-objective-c-interop).<br>
That leaves the interchange of bare indices with Cocoa APIs trafficking in<br>
`Int`. Hopefully such APIs will be rare, but when needed, the following<br>
extension, which would be useful for all `Collections`, can help:<br>
<br>
```swift<br>
extension Collection {<br>
func index(offset: IndexDistance) -> Index {<br>
return index(startIndex, offsetBy: offset)<br>
}<br>
func offset(of i: Index) -> IndexDistance {<br>
return distance(from: startIndex, to: i)<br>
}<br>
}<br>
```<br>
<br>
Then integers can easily be translated into offsets into a `String`'s `utf16`<br>
view for consumption by Cocoa:<br>
<br>
```swift<br>
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))<br>
let swiftIndex = s.utf16.index(offset: cocoaIndex)<br>
```<br>
<br>
### Formatting<br>
<br>
A full treatment of formatting is out of scope of this proposal, but<br>
we believe it's crucial for completing the text processing picture. This<br>
section details some of the existing issues and thinking that may guide future<br>
development.<br>
<br>
#### Printf-Style Formatting<br>
<br>
`String.format` is designed on the `printf` model: it takes a format string with<br>
textual placeholders for substitution, and an arbitrary list of other arguments.<br>
The syntax and meaning of these placeholders has a long history in<br>
C, but for anyone who doesn't use them regularly they are cryptic and complex,<br>
as the `printf (3)` man page attests.<br>
<br>
Aside from complexity, this style of API has two major problems: First, the<br>
spelling of these placeholders must match up to the types of the arguments, in<br>
the right order, or the behavior is undefined. Some limited support for<br>
compile-time checking of this correspondence could be implemented, but only for<br>
the cases where the format string is a literal. Second, there's no reasonable<br>
way to extend the formatting vocabulary to cover the needs of new types: you are<br>
stuck with what's in the box.<br>
<br>
#### Foundation Formatters<br>
<br>
The formatters supplied by Foundation are highly capable and versatile, offering<br>
both formatting and parsing services. When used for formatting, though, the<br>
design pattern demands more from users than it should:<br>
<br>
* Matching the type of data being formatted to a formatter type<br>
* Creating an instance of that type<br>
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the<br>
need for this step prevents the instance from being used and discarded in<br>
the same expression where it is created.<br>
* Overall, introduction of needless verbosity into source<br>
<br>
These may seem like small issues, but the experience of Apple localization<br>
experts is that the total drag of these factors on programmers is such that they<br>
tend to reach for `String.format` instead.<br>
<br>
#### String Interpolation<br>
<br>
Swift string interpolation provides a user-friendly alternative to printf's<br>
domain-specific language (just write ordinary swift code!) and its type safety<br>
problems (put the data right where it belongs!) but the following issues prevent<br>
it from being useful for localized formatting (among other jobs):<br>
<br>
* [SR-2303](<a href="https://bugs.swift.org/browse/SR-2303" rel="noreferrer" target="_blank">https://bugs.swift.<wbr>org/browse/SR-2303</a>) We are unable to restrict<br>
types used in string interpolation.<br>
* [SR-1260](<a href="https://bugs.swift.org/browse/SR-1260" rel="noreferrer" target="_blank">https://bugs.swift.<wbr>org/browse/SR-1260</a>) String interpolation can't<br>
distinguish (fragments of) the base string from the string substitutions.<br>
<br>
In the long run, we should improve Swift string interpolation to the point where<br>
it can participate in most any formatting job. Mostly this centers around<br>
fixing the interpolation protocols per the previous item, and supporting<br>
localization.<br>
<br>
To be able to use formatting effectively inside interpolations, it needs to be<br>
both lightweight (because it all happens in-situ) and discoverable. One<br>
approach would be to standardize on `format` methods, e.g.:<br>
<br>
```swift<br>
"Column 1: \(n.format(radix:16, width:8)) *** \(message)"<br>
<br>
"Something with leading zeroes: \(x.format(fill: zero, width:8))"<br>
```<br>
<br>
### C String Interop<br>
<br>
Our support for interoperation with nul-terminated C strings is scattered and<br>
incoherent, with 6 ways to transform a C string into a `String` and four ways to<br>
do the inverse. These APIs should be replaced with the following<br>
<br>
```swift<br>
extension String {<br>
/// Constructs a `String` having the same contents as `nulTerminatedUTF8`.<br>
///<br>
/// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded<br>
/// bytes ending just before the first zero byte (NUL character).<br>
init(cString nulTerminatedUTF8: UnsafePointer<CChar>)<br>
<br>
/// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.<br>
///<br>
/// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in<br>
/// the given `encoding`, ending just before the first zero code unit.<br>
/// - Parameter encoding: describes the encoding in which the code units<br>
/// should be interpreted.<br>
init<Encoding: UnicodeEncoding>(<br>
cString nulTerminatedCodeUnits: UnsafePointer<Encoding.<wbr>CodeUnit>,<br>
encoding: Encoding)<br>
<br>
/// Invokes the given closure on the contents of the string, represented as a<br>
/// pointer to a null-terminated sequence of UTF-8 code units.<br>
func withCString<Result>(<br>
_ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result<br>
}<br>
```<br>
<br>
In both of the construction APIs, any invalid encoding sequence detected will<br>
have its longest valid prefix replaced by U+FFFD, the Unicode replacement<br>
character, per Unicode specification. This covers the common case. The<br>
replacement is done *physically* in the underlying storage and the validity of<br>
the result is recorded in the `String`'s `encoding` such that future accesses<br>
need not be slowed down by possible error repair separately.<br>
<br>
Construction that is aborted when encoding errors are detected can be<br>
accomplished using APIs on the `encoding`. String types that retain their<br>
physical encoding even in the presence of errors and are repaired on-the-fly can<br>
be built as different instances of the `Unicode` protocol.<br>
<br>
### Unicode 9 Conformance<br>
<br>
Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes<br>
the process of properly identifying `Character` boundaries. We need to update<br>
`String` to account for this change.<br>
<br>
### High-Performance String Processing<br>
<br>
Many strings are short enough to store in 64 bits, many can be stored using only<br>
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to<br>
us already in some other encoding, such as UTF-8, that would be costly to<br>
translate. Supporting these formats while maintaining usability for<br>
general-purpose APIs demands that a single `String` type can be backed by many<br>
different representations.<br>
<br>
That said, the highest performance code always requires static knowledge of the<br>
data structures on which it operates, and for this code, dynamic selection of<br>
representation comes at too high a cost. Heavy-duty text processing demands a<br>
way to opt out of dynamism and directly use known encodings. Having this<br>
ability can also make it easy to cleanly specialize code that handles dynamic<br>
cases for maximal efficiency on the most common representations.<br>
<br>
To address this need, we can build models of the `Unicode` protocol that encode<br>
representation information into the type, such as `NFCNormalizedUTF16String`.<br>
<br>
### Parsing ASCII Structure<br>
<br>
Although many machine-readable formats support the inclusion of arbitrary<br>
Unicode text, it is also common that their fundamental structure lies entirely<br>
within the ASCII subset (JSON, YAML, many XML formats). These formats are often<br>
processed most efficiently by recognizing ASCII structural elements as ASCII,<br>
and capturing the arbitrary sections between them in more-general strings. The<br>
current String API offers no way to efficiently recognize ASCII and skip past<br>
everything else without the overhead of full decoding into unicode scalars.<br>
<br>
For these purposes, strings should supply an `extendedASCII` view that is a<br>
collection of `UInt32`, where values less than `0x80` represent the<br>
corresponding ASCII character, and other values represent data that is specific<br>
to the underlying encoding of the string.<br>
<br>
## Language Support<br>
<br>
This proposal depends on two new features in the Swift language:<br>
<br>
1. **Generic subscripts**, to<br>
enable unified slicing syntax.<br>
<br>
2. **A subtype relationship** between<br>
`Substring` and `String`, enabling framework APIs to traffic solely in<br>
`String` while still making it possible to avoid copies by handling<br>
`Substring`s where necessary.<br>
<br>
Additionally, **the ability to nest types and protocols inside<br>
protocols** could significantly shrink the footprint of this proposal<br>
on the top-level Swift namespace.<br>
<br>
<br>
## Open Questions<br>
<br>
### Must `String` be limited to storing UTF-16 subset encodings?<br>
<br>
- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in<br>
question here; this is about what encodings must be storable, without<br>
transcoding, in the common currency type called “`String`”.<br>
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.<br>
- If we have a way to get at a `String`'s code units, we need a concrete type in<br>
which to express them in the API of `String`, which is a concrete type<br>
- If String needs to be able to represent UTF-32, presumably the code units need<br>
to be `UInt32`.<br>
- Not supporting UTF-32-encoded text seems like one reasonable design choice.<br>
- Maybe we can allow UTF-8 storage in `String` and expose its code units as<br>
`UInt16`, just as we would for Latin-1.<br>
- Supporting only UTF-16-subset encodings would imply that `String` indices can<br>
be serialized without recording the `String`'s underlying encoding.<br>
<br>
### Do we need a type-erasable base protocol for UnicodeEncoding?<br>
<br>
UnicodeEncoding has an associated type, but it may be important to be able to<br>
traffic in completely dynamic encoding values, e.g. for “tell me the most<br>
efficient encoding for this string.”<br>
<br>
### Should there be a string “facade?”<br>
<br>
One possible design alternative makes `Unicode` a vehicle for expressing<br>
the storage and encoding of code units, but does not attempt to give it an API<br>
appropriate for `String`. Instead, string APIs would be provided by a generic<br>
wrapper around an instance of `Unicode`:<br>
<br>
```swift<br>
struct StringFacade<U: Unicode> : BidirectionalCollection {<br>
<br>
// ...APIs for high-level string processing here...<br>
<br>
var unicode: U // access to lower-level unicode details<br>
}<br>
<br>
typealias String = StringFacade<StringStorage><br>
typealias Substring = StringFacade<StringStorage.<wbr>SubSequence><br>
```<br>
<br>
This design would allow us to de-emphasize lower-level `String` APIs such as<br>
access to the specific encoding, by putting them behind a `.unicode` property.<br>
A similar effect in a facade-less design would require a new top-level<br>
`StringProtocol` playing the role of the facade with an an `associatedtype<br>
Storage : Unicode`.<br>
<br>
An interesting variation on this design is possible if defaulted generic<br>
parameters are introduced to the language:<br>
<br>
```swift<br>
struct String<U: Unicode = StringStorage><br>
: BidirectionalCollection {<br>
<br>
// ...APIs for high-level string processing here...<br>
<br>
var unicode: U // access to lower-level unicode details<br>
}<br>
<br>
typealias Substring = String<StringStorage.<wbr>SubSequence><br>
```<br>
<br>
One advantage of such a design is that naïve users will always extend “the right<br>
type” (`String`) without thinking, and the new APIs will show up on `Substring`,<br>
`MyUTF8String`, etc. That said, it also has downsides that should not be<br>
overlooked, not least of which is the confusability of the meaning of the word<br>
“string.” Is it referring to the generic or the concrete type?<br>
<br>
### `TextOutputStream` and `TextOutputStreamable`<br>
<br>
`TextOutputStreamable` is intended to provide a vehicle for<br>
efficiently transporting formatted representations to an output stream<br>
without forcing the allocation of storage. Its use of `String`, a<br>
type with multiple representations, at the lowest-level unit of<br>
communication, conflicts with this goal. It might be sufficient to<br>
change `TextOutputStream` and `TextOutputStreamable` to traffic in an<br>
associated type conforming to `Unicode`, but that is not yet clear.<br>
This area will require some design work.<br>
<br>
### `description` and `debugDescription`<br>
<br>
* Should these be creating localized or non-localized representations?<br>
* Is returning a `String` efficient enough?<br>
* Is `debugDescription` pulling the weight of the API surface area it adds?<br>
<br>
### `StaticString`<br>
<br>
`StaticString` was added as a byproduct of standard library developed and kept<br>
around because it seemed useful, but it was never truly *designed* for client<br>
programmers. We need to decide what happens with it. Presumably *something*<br>
should fill its role, and that should conform to `Unicode`.<br>
<br>
## Footnotes<br>
<br>
<b id="f0">0</b> The integers rewrite currently underway is expected to<br>
substantially reduce the scope of `Int`'s API by using more<br>
generics. [↩](#a0)<br>
<br>
<b id="f1">1</b> In practice, these semantics will usually be tied to the<br>
version of the installed [ICU](<a href="http://icu-project.org" rel="noreferrer" target="_blank">http://icu-project.org</a>) library, which<br>
programmatically encodes the most complex rules of the Unicode Standard and its<br>
de-facto extension, CLDR.[↩](#a1)<br>
<br>
<b id="f2">2</b><br>
See<br>
[<a href="http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation)" rel="noreferrer" target="_blank">http://unicode.org/reports/<wbr>tr29/#Notation](http://<wbr>unicode.org/reports/tr29/#<wbr>Notation)</a>. Note<br>
that inserting Unicode scalar values to prevent merging of grapheme clusters would<br>
also constitute a kind of misbehavior (one of the clusters at the boundary would<br>
not be found in the result), so would be relatively costly to implement, with<br>
little benefit. [↩](#a2)<br>
<br>
<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by<br>
the Unicode standard for this purpose. In fact there's<br>
a [whole chapter](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf" rel="noreferrer" target="_blank">http://www.unicode.<wbr>org/versions/Unicode9.0.0/<wbr>ch05.pdf</a>)<br>
dedicated to it. In particular, §5.17 says:<br>
<br>
> When comparing text that is visible to end users, a correct linguistic sort<br>
> should be used, as described in _Section 5.16, Sorting and<br>
> Searching_. However, in many circumstances the only requirement is for a<br>
> fast, well-defined ordering. In such cases, a binary ordering can be used.<br>
<br>
[↩](#a4)<br>
<br>
<br>
<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto<br>
properties in a table that's indexed by unicode scalar value. This table is<br>
part of the Unicode standard. Some of these queries (e.g., “is this an<br>
uppercase character?”) may have fairly obvious generalizations to grapheme<br>
clusters, but exactly how to do it is a research topic and *ideally* we'd either<br>
establish the existing practice that the Unicode committee would standardize, or<br>
the Unicode committee would do the research and we'd implement their<br>
result.[↩](#a5)<br>
<br>
______________________________<wbr>_________________<br>
swift-evolution mailing list<br>
<a href="mailto:swift-evolution@swift.org">swift-evolution@swift.org</a><br>
<a href="https://lists.swift.org/mailman/listinfo/swift-evolution" rel="noreferrer" target="_blank">https://lists.swift.org/<wbr>mailman/listinfo/swift-<wbr>evolution</a><br>
</blockquote></div><br></div></div></div></div>
</div></blockquote></div></div></body></html>