<div dir="ltr"><div>The strings proposal is a warm welcome to the Swift Language and I believe many developers are happy to see Strings become a priority. String processing may be one of the most common tasks for a developer day to day. That being said one thing in the proposal I believe is not correct is the leaving out regular expressions. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">"<span style="color:rgb(51,51,51);font-family:-apple-system,blinkmacsystemfont,'segoe ui',helvetica,arial,sans-serif,'apple color emoji','segoe ui emoji','segoe ui symbol';font-size:16px">Addressing regular expressions is out of scope for this proposal." </span></blockquote><div> </div><div>Working with regular expressions in both Objective-c and Swift is a real pain. I don't believe that because NSRegularExpression exists is a good enough reason to leave it out of Swift 4 string improvements. NSRegularExpression is NOT easily retrofitted to strings. Perl, ruby, javascript and many more programming languages have native easy to use regular expression functionality built in. <br></div><div><br></div><div>Examples:</div><div>Ruby:</div><div>/hey/ =~ 'hey what's up'</div><div>/world/.match('hello world')</div><div><br></div><div>Javascript:</div><div>'javascript regex'.search(/regex/)</div><div>'hello replace'.replace(/replace/, 'world')<br></div><div><br></div><div>Perl</div><div>$statement = "The quick brown fox";</div><div><br></div><div>if ($statement = /quick/) {</div><div> print "this is what's up\n";</div><div>}</div><div><br></div><div>Now let's look at NSRegularExpression...</div><div><br></div><div>Swift:</div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div>do {</div></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div> let pattern = "\\w" // escape everything</div><div> let regex = NSRegularExpression(pattern, options)</div><div> let results = regex.matches(in: str, options: .reportCompletion, range: NSRange(location: 0, length: str.characters.distance(from: str.startIndex, to: str.endIndex)))</div><div><br></div><div> results.forEach {</div><div> print($0) // why is this a NSTextCheckResult?!</div><div> }</div></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div>} catch {</div><div> // welp out of luck</div><div>}</div></blockquote><div><br></div><div>Yes I'm fully aware of the method:</div><div><br></div><div> str.replaceOccurences(of: "pattern" with: "something" options: .regularExpression, range: nil) </div><div><br></div><div>but it is just not enough for what is needed. Also, it is confusing to have a replace regex method separate from NSRegularExpression. It was not easy to find. </div><div><br></div><div>Taken from <a href="http://nshipster.com/nsregularexpression/">NSHipster</a>:</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.</blockquote><div><br></div><div>There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored. </div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 19, 2017 at 7:56 PM, Ben Cohen via swift-evolution <span dir="ltr"><<a href="mailto:swift-evolution@swift.org" target="_blank">swift-evolution@swift.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>
<br>
Below is our take on a design manifesto for Strings in Swift 4 and beyond.<br>
<br>
Probably best read in rendered markdown on GitHub:<br>
<a href="https://github.com/apple/swift/blob/master/docs/StringManifesto.md" rel="noreferrer" target="_blank">https://github.com/apple/<wbr>swift/blob/master/docs/<wbr>StringManifesto.md</a><br>
<br>
We’re eager to hear everyone’s thoughts.<br>
<br>
Regards,<br>
Ben and Dave<br>
<br>
<br>
# String Processing For Swift 4<br>
<br>
* Authors: [Dave Abrahams](<a href="https://github.com/dabrahams" rel="noreferrer" target="_blank">https://github.com/<wbr>dabrahams</a>), [Ben Cohen](<a href="https://github.com/airspeedswift" rel="noreferrer" target="_blank">https://github.com/<wbr>airspeedswift</a>)<br>
<br>
The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus<br>
far, with just this short blurb in the<br>
[list of goals](<a href="https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html" rel="noreferrer" target="_blank">https://lists.swift.<wbr>org/pipermail/swift-evolution/<wbr>Week-of-Mon-20160725/025676.<wbr>html</a>):<br>
<br>
> **String re-evaluation**: String is one of the most important fundamental<br>
> types in the language. The standard library leads have numerous ideas of how<br>
> to improve the programming model for it, without jeopardizing the goals of<br>
> providing a unicode-correct-by-default model. Our goal is to be better at<br>
> string processing than Perl!<br>
<br>
For Swift 4 and beyond we want to improve three dimensions of text processing:<br>
<br>
1. Ergonomics<br>
2. Correctness<br>
3. Performance<br>
<br>
This document is meant to both provide a sense of the long-term vision<br>
(including undecided issues and possible approaches), and to define the scope of<br>
work that could be done in the Swift 4 timeframe.<br>
<br>
## General Principles<br>
<br>
### Ergonomics<br>
<br>
It's worth noting that ergonomics and correctness are mutually-reinforcing. An<br>
API that is easy to use—but incorrectly—cannot be considered an ergonomic<br>
success. Conversely, an API that's simply hard to use is also hard to use<br>
correctly. Acheiving optimal performance without compromising ergonomics or<br>
correctness is a greater challenge.<br>
<br>
Consistency with the Swift language and idioms is also important for<br>
ergonomics. There are several places both in the standard library and in the<br>
foundation additions to `String` where patterns and practices found elsewhere<br>
could be applied to improve usability and familiarity.<br>
<br>
### API Surface Area<br>
<br>
Primary data types such as `String` should have APIs that are easily understood<br>
given a signature and a one-line summary. Today, `String` fails that test. As<br>
you can see, the Standard Library and Foundation both contribute significantly to<br>
its overall complexity.<br>
<br>
**Method Arity** | **Standard Library** | **Foundation**<br>
---|:---:|:---:<br>
0: `ƒ()` | 5 | 7<br>
1: `ƒ(:)` | 19 | 48<br>
2: `ƒ(::)` | 13 | 19<br>
3: `ƒ(:::)` | 5 | 11<br>
4: `ƒ(::::)` | 1 | 7<br>
5: `ƒ(:::::)` | - | 2<br>
6: `ƒ(::::::)` | - | 1<br>
<br>
**API Kind** | **Standard Library** | **Foundation**<br>
---|:---:|:---:<br>
`init` | 41 | 18<br>
`func` | 42 | 55<br>
`subscript` | 9 | 0<br>
`var` | 26 | 14<br>
<br>
**Total: 205 APIs**<br>
<br>
By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have<br>
to press through physical API sprawl just to get started.<br>
<br>
Many of the choices detailed below contribute to solving this problem,<br>
including:<br>
<br>
* Restoring `Collection` conformance and dropping the `.characters` view.<br>
* Providing a more general, composable slicing syntax.<br>
* Altering `Comparable` so that parameterized<br>
(e.g. case-insensitive) comparison fits smoothly into the basic syntax.<br>
* Clearly separating language-dependent operations on text produced<br>
by and for humans from language-independent<br>
operations on text produced by and for machine processing.<br>
* Relocating APIs that fall outside the domain of basic string processing and<br>
discouraging the proliferation of ad-hoc extensions.<br>
<br>
<br>
### Batteries Included<br>
<br>
While `String` is available to all programs out-of-the-box, crucial APIs for<br>
basic string processing tasks are still inaccessible until `Foundation` is<br>
imported. While it makes sense that `Foundation` is needed for domain-specific<br>
jobs such as<br>
[linguistic tagging](<a href="https://developer.apple.com/reference/foundation/nslinguistictagger" rel="noreferrer" target="_blank">https://developer.<wbr>apple.com/reference/<wbr>foundation/nslinguistictagger</a>)<wbr>,<br>
one should not need to import anything to, for example, do case-insensitive<br>
comparison.<br>
<br>
### Unicode Compliance and Platform Support<br>
<br>
The Unicode standard provides a crucial objective reference point for what<br>
constitutes correct behavior in an extremely complex domain, so<br>
Unicode-correctness is, and will remain, a fundamental design principle behind<br>
Swift's `String`. That said, the Unicode standard is an evolving document, so<br>
this objective reference-point is not fixed.[1] While<br>
many of the most important operations—e.g. string hashing, equality, and<br>
non-localized comparison—will be stable, the semantics<br>
of others, such as grapheme breaking and localized comparison and case<br>
conversion, are expected to change as platforms are updated, so programs should<br>
be written so their correctness does not depend on precise stability of these<br>
semantics across OS versions or platforms. Although it may be possible to<br>
imagine static and/or dynamic analysis tools that will help users find such<br>
errors, the only sure way to deal with this fact of life is to educate users.<br>
<br>
## Design Points<br>
<br>
### Internationalization<br>
<br>
There is strong evidence that developers cannot determine how to use<br>
internationalization APIs correctly. Although documentation could and should be<br>
improved, the sheer size, complexity, and diversity of these APIs is a major<br>
contributor to the problem, causing novices to tune out, and more experienced<br>
programmers to make avoidable mistakes.<br>
<br>
The first step in improving this situation is to regularize all localized<br>
operations as invocations of normal string operations with extra<br>
parameters. Among other things, this means:<br>
<br>
1. Doing away with `localizedXXX` methods<br>
2. Providing a terse way to name the current locale as a parameter<br>
3. Automatically adjusting defaults for options such<br>
as case sensitivity based on whether the operation is localized.<br>
4. Removing correctness traps like `<wbr>localizedCaseInsensitiveCompar<wbr>e` (see<br>
guidance in the<br>
[Internationalization and Localization Guide](<a href="https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html" rel="noreferrer" target="_blank">https://developer.<wbr>apple.com/library/content/<wbr>documentation/MacOSX/<wbr>Conceptual/BPInternational/<wbr>InternationalizingYourCode/<wbr>InternationalizingYourCode.<wbr>html</a>).<br>
<br>
Along with appropriate documentation updates, these changes will make localized<br>
operations more teachable, comprehensible, and approachable, thereby lowering a<br>
barrier that currently leads some developers to ignore localization issues<br>
altogether.<br>
<br>
#### The Default Behavior of `String`<br>
<br>
Although this isn't well-known, the most accessible form of many operations on<br>
Swift `String` (and `NSString`) are really only appropriate for text that is<br>
intended to be processed for, and consumed by, machines. The semantics of the<br>
operations with the simplest spellings are always non-localized and<br>
language-agnostic.<br>
<br>
Two major factors play into this design choice:<br>
<br>
1. Machine processing of text is important, so we should have first-class,<br>
accessible functions appropriate to that use case.<br>
<br>
2. The most general localized operations require a locale parameter not required<br>
by their un-localized counterparts. This naturally skews complexity towards<br>
localized operations.<br>
<br>
Reaffirming that `String`'s simplest APIs have<br>
language-independent/machine-<wbr>processed semantics has the benefit of clarifying<br>
the proper default behavior of operations such as comparison, and allows us to<br>
make [significant optimizations](#collation-<wbr>semantics) that were previously<br>
thought to conflict with Unicode.<br>
<br>
#### Future Directions<br>
<br>
One of the most common internationalization errors is the unintentional<br>
presentation to users of text that has not been localized, but regularizing APIs<br>
and improving documentation can go only so far in preventing this error.<br>
Combined with the fact that `String` operations are non-localized by default,<br>
the environment for processing human-readable text may still be somewhat<br>
error-prone in Swift 4.<br>
<br>
For an audience of mostly non-experts, it is especially important that naïve<br>
code is very likely to be correct if it compiles, and that more sophisticated<br>
issues can be revealed progressively. For this reason, we intend to<br>
specifically and separately target localization and internationalization<br>
problems in the Swift 5 timeframe.<br>
<br>
### Operations With Options<br>
<br>
There are three categories of common string operation that commonly need to be<br>
tuned in various dimensions:<br>
<br>
**Operation**|**Applicable Options**<br>
---|---<br>
sort ordering | locale, case/diacritic/width-<wbr>insensitivity<br>
case conversion | locale<br>
pattern matching | locale, case/diacritic/width-<wbr>insensitivity<br>
<br>
The defaults for case-, diacritic-, and width-insensitivity are different for<br>
localized operations than for non-localized operations, so for example a<br>
localized sort should be case-insensitive by default, and a non-localized sort<br>
should be case-sensitive by default. We propose a standard “language” of<br>
defaulted parameters to be used for these purposes, with usage roughly like this:<br>
<br>
```swift<br>
x.compared(to: y, case: .sensitive, in: swissGerman)<br>
<br>
x.lowercased(in: .currentLocale)<br>
<br>
x.allMatches(<br>
somePattern, case: .insensitive, diacritic: .insensitive)<br>
```<br>
<br>
This usage might be supported by code like this:<br>
<br>
```swift<br>
enum StringSensitivity {<br>
case sensitive<br>
case insensitive<br>
}<br>
<br>
extension Locale {<br>
static var currentLocale: Locale { ... }<br>
}<br>
<br>
extension Unicode {<br>
// An example of the option language in declaration context,<br>
// with nil defaults indicating unspecified, so defaults can be<br>
// driven by the presence/absence of a specific Locale<br>
func frobnicated(<br>
case caseSensitivity: StringSensitivity? = nil,<br>
diacritic diacriticSensitivity: StringSensitivity? = nil,<br>
width widthSensitivity: StringSensitivity? = nil,<br>
in locale: Locale? = nil<br>
) -> Self { ... }<br>
}<br>
```<br>
<br>
### Comparing and Hashing Strings<br>
<br>
#### Collation Semantics<br>
<br>
What Unicode says about collation—which is used in `<`, `==`, and hashing— turns<br>
out to be quite interesting, once you pick it apart. The full Unicode Collation<br>
Algorithm (UCA) works like this:<br>
<br>
1. Fully normalize both strings<br>
2. Convert each string to a sequence of numeric triples to form a collation key<br>
3. “Flatten” the key by concatenating the sequence of first elements to the<br>
sequence of second elements to the sequence of third elements<br>
4. Lexicographically compare the flattened keys<br>
<br>
While step 1 can usually<br>
be [done quickly](<a href="http://unicode.org/reports/tr15/#Description_Norm" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr15/#Description_Norm</a><wbr>) and<br>
incrementally, step 2 uses a collation table that maps matching *sequences* of<br>
unicode scalars in the normalized string to *sequences* of triples, which get<br>
accumulated into a collation key. Predictably, this is where the real costs<br>
lie.<br>
<br>
*However*, there are some bright spots to this story. First, as it turns out,<br>
string sorting (localized or not) should be done down to what's called<br>
the<br>
[“identical” level](<a href="http://unicode.org/reports/tr10/#Multi_Level_Comparison" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr10/#Multi_Level_<wbr>Comparison</a>),<br>
which adds a step 3a: append the string's normalized form to the flattened<br>
collation key. At first blush this just adds work, but consider what it does<br>
for equality: two strings that normalize the same, naturally, will collate the<br>
same. But also, *strings that normalize differently will always collate<br>
differently*. In other words, for equality, it is sufficient to compare the<br>
strings' normalized forms and see if they are the same. We can therefore<br>
entirely skip the expensive part of collation for equality comparison.<br>
<br>
Next, naturally, anything that applies to equality also applies to hashing: it<br>
is sufficient to hash the string's normalized form, bypassing collation keys.<br>
This should provide significant speedups over the current implementation.<br>
Perhaps more importantly, since comparison down to the “identical” level applies<br>
even to localized strings, it means that hashing and equality can be implemented<br>
exactly the same way for localized and non-localized text, and hash tables with<br>
localized keys will remain valid across current-locale changes.<br>
<br>
Finally, once it is agreed that the *default* role for `String` is to handle<br>
machine-generated and machine-readable text, the default ordering of `String`s<br>
need no longer use the UCA at all. It is sufficient to order them in any way<br>
that's consistent with equality, so `String` ordering can simply be a<br>
lexicographical comparison of normalized forms,[4]<br>
(which is equivalent to lexicographically comparing the sequences of grapheme<br>
clusters), again bypassing step 2 and offering another speedup.<br>
<br>
This leaves us executing the full UCA *only* for localized sorting, and ICU's<br>
implementation has apparently been very well optimized.<br>
<br>
Following this scheme everywhere would also allow us to make sorting behavior<br>
consistent across platforms. Currently, we sort `String` according to the UCA,<br>
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by<br>
unicode scalar value.<br>
<br>
#### Syntax<br>
<br>
Because the current `Comparable` protocol expresses all comparisons with binary<br>
operators, string comparisons—which may require<br>
additional [options](#operations-with-<wbr>options)—do not fit smoothly into the<br>
existing syntax. At the same time, we'd like to solve other problems with<br>
comparison, as outlined<br>
in<br>
[this proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e" rel="noreferrer" target="_blank">https://gist.github.<wbr>com/CodaFi/<wbr>f0347bd37f1c407bf7ea0c429ead38<wbr>0e</a>)<br>
(implemented by changes at the head<br>
of<br>
[this branch](<a href="https://github.com/CodaFi/swift/commits/space-the-final-frontier)" rel="noreferrer" target="_blank">https://github.com/<wbr>CodaFi/swift/commits/space-<wbr>the-final-frontier)</a>).<br>
We should adopt a modification of that proposal that uses a method rather than<br>
an operator `<=>`:<br>
<br>
```swift<br>
enum SortOrder { case before, same, after }<br>
<br>
protocol Comparable : Equatable {<br>
func compared(to: Self) -> SortOrder<br>
...<br>
}<br>
```<br>
<br>
This change will give us a syntactic platform on which to implement methods with<br>
additional, defaulted arguments, thereby unifying and regularizing comparison<br>
across the library.<br>
<br>
```swift<br>
extension String {<br>
func compared(to: Self) -> SortOrder<br>
<br>
}<br>
```<br>
<br>
**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible<br>
that the standard library simply adopts Foundation's `ComparisonResult` as is,<br>
but we believe the community should at least consider alternate naming before<br>
that happens. There will be an opportunity to discuss the choices in detail<br>
when the modified<br>
[Comparison Proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e" rel="noreferrer" target="_blank">https://gist.github.<wbr>com/CodaFi/<wbr>f0347bd37f1c407bf7ea0c429ead38<wbr>0e</a>) comes<br>
up for review.<br>
<br>
### `String` should be a `Collection` of `Character`s Again<br>
<br>
In Swift 2.0, `String`'s `Collection` conformance was dropped, because we<br>
convinced ourselves that its semantics differed from those of `Collection` too<br>
significantly.<br>
<br>
It was always well understood that if strings were treated as sequences of<br>
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,<br>
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was<br>
a collection of `Character` (extended grapheme clusters). During 2.0<br>
development, though, we realized that correct string concatenation could<br>
occasionally merge distinct grapheme clusters at the start and end of combined<br>
strings.<br>
<br>
This quirk aside, every aspect of strings-as-collections-of-<wbr>graphemes appears to<br>
comport perfectly with Unicode. We think the concatenation problem is tolerable,<br>
because the cases where it occurs all represent partially-formed constructs. The<br>
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE<br>
ACCENT)—are explicitly called out in the Unicode standard as<br>
“[degenerate](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)" rel="noreferrer" target="_blank">http://unicode.<wbr>org/reports/tr29/#Grapheme_<wbr>Cluster_Boundaries)</a>” or<br>
“[defective](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)" rel="noreferrer" target="_blank">http://www.<wbr>unicode.org/versions/Unicode9.<wbr>0.0/ch03.pdf)</a>”. The other<br>
cases—such as a string ending in a zero-width joiner or half of a regional<br>
indicator—appear to be equally transient and unlikely outside of a text editor.<br>
<br>
Admitting these cases encourages exploration of grapheme composition and is<br>
consistent with what appears to be an overall Unicode philosophy that “no<br>
special provisions are made to get marginally better behavior for… cases that<br>
never occur in practice.”[2] Furthermore, it seems<br>
unlikely to disturb the semantics of any plausible algorithms. We can handle<br>
these cases by documenting them, explicitly stating that the elements of a<br>
`String` are an emergent property based on Unicode rules.<br>
<br>
The benefits of restoring `Collection` conformance are substantial:<br>
<br>
* Collection-like operations encourage experimentation with strings to<br>
investigate and understand their behavior. This is useful for teaching new<br>
programmers, but also good for experienced programmers who want to<br>
understand more about strings/unicode.<br>
<br>
* Extended grapheme clusters form a natural element boundary for Unicode<br>
strings. For example, searching and matching operations will always produce<br>
results that line up on grapheme cluster boundaries.<br>
<br>
* Character-by-character processing is a legitimate thing to do in many real<br>
use-cases, including parsing, pattern matching, and language-specific<br>
transformations such as transliteration.<br>
<br>
* `Collection` conformance makes a wide variety of powerful operations<br>
available that are appropriate to `String`'s default role as the vehicle for<br>
machine processed text.<br>
<br>
The methods `String` would inherit from `Collection`, where similar to<br>
higher-level string algorithms, have the right semantics. For example,<br>
grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of<br>
`flatMap` with case-conversion, produce the same results one would expect<br>
from whole-string ordering comparison, equality comparison, and<br>
case-conversion, respectively. `reverse` operates correctly on graphemes,<br>
keeping diacritics moored to their base characters and leaving emoji intact.<br>
Other methods such as `indexOf` and `contains` make obvious sense. A few<br>
`Collection` methods, like `min` and `max`, may not be particularly useful<br>
on `String`, but we don't consider that to be a problem worth solving, in<br>
the same way that we wouldn't try to suppress `min` and `max` on a<br>
`Set([UInt8])` that was used to store IP addresses.<br>
<br>
* Many of the higher-level operations that we want to provide for `String`s,<br>
such as parsing and pattern matching, should apply to any `Collection`, and<br>
many of the benefits we want for `Collections`, such<br>
as unified slicing, should accrue<br>
equally to `String`. Making `String` part of the same protocol hierarchy<br>
allows us to write these operations once and not worry about keeping the<br>
benefits in sync.<br>
<br>
* Slicing strings into substrings is a crucial part of the vocabulary of<br>
string processing, and all other sliceable things are `Collection`s.<br>
Because of its collection-like behavior, users naturally think of `String`<br>
in collection terms, but run into frustrating limitations where it fails to<br>
conform and are left to wonder where all the differences lie. Many simply<br>
“correct” this limitation by declaring a trivial conformance:<br>
<br>
```swift<br>
extension String : BidirectionalCollection {}<br>
```<br>
<br>
Even if we removed indexing-by-element from `String`, users could still do<br>
this:<br>
<br>
```swift<br>
extension String : BidirectionalCollection {<br>
subscript(i: Index) -> Character { return characters[i] }<br>
}<br>
```<br>
<br>
It would be much better to legitimize the conformance to `Collection` and<br>
simply document the oddity of any concatenation corner-cases, than to deny<br>
users the benefits on the grounds that a few cases are confusing.<br>
<br>
Note that the fact that `String` is a collection of graphemes does *not* mean<br>
that string operations will necessarily have to do grapheme boundary<br>
recognition. See the Unicode protocol section for details.<br>
<br>
### `Character` and `CharacterSet`<br>
<br>
`Character`, which represents a<br>
Unicode<br>
[extended grapheme cluster](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries" rel="noreferrer" target="_blank">http://unicode.org/<wbr>reports/tr29/#Grapheme_<wbr>Cluster_Boundaries</a>),<br>
is a bit of a black box, requiring conversion to `String` in order to<br>
do any introspection, including interoperation with ASCII. To fix this, we should:<br>
<br>
- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure<br>
of grapheme clusters is discoverable.<br>
- Add a failable `init` from sequences of scalars (returning nil for sequences<br>
that contain 0 or 2+ graphemes).<br>
- (Lower priority) expose some operations, such as `func uppercase() -><br>
String`, `var isASCII: Bool`, and, to the extent they can be sensibly<br>
generalized, queries of unicode properties that should also be exposed on<br>
`UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .<br>
<br>
Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`<br>
type. This means it is usable on `String`, but only by going through the unicode<br>
scalar view. To deal with this clash in the short term, `CharacterSet` should be<br>
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to<br>
introduce a `CharacterSet` that provides similar functionality for extended<br>
grapheme clusters.[5]<br>
<br>
### Unification of Slicing Operations<br>
<br>
Creating substrings is a basic part of String processing, but the slicing<br>
operations that we have in Swift are inconsistent in both their spelling and<br>
their naming:<br>
<br>
* Slices with two explicit endpoints are done with subscript, and support<br>
in-place mutation:<br>
<br>
```swift<br>
s[i..<j].mutate()<br>
```<br>
<br>
* Slicing from an index to the end, or from the start to an index, is done<br>
with a method and does not support in-place mutation:<br>
```swift<br>
s.prefix(upTo: i).readOnly()<br>
```<br>
<br>
Prefix and suffix operations should be migrated to be subscripting operations<br>
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as<br>
in<br>
[this proposal](<a href="https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md" rel="noreferrer" target="_blank">https://github.com/<wbr>apple/swift-evolution/blob/<wbr>9cf2685293108ea3efcbebb7ee6a86<wbr>18b83d4a90/proposals/0132-<wbr>sequence-end-ops.md</a>).<br>
With generic subscripting in the language, that will allow us to collapse a wide<br>
variety of methods and subscript overloads into a single implementation, and<br>
give users an easy-to-use and composable way to describe subranges.<br>
<br>
Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`<br>
is an ongoing research project that can be considered part of the potential<br>
long-term vision of text (and collection) processing.<br>
<br>
### Substrings<br>
<br>
When implementing substring slicing, languages are faced with three options:<br>
<br>
1. Make the substrings the same type as string, and share storage.<br>
2. Make the substrings the same type as string, and copy storage when making the substring.<br>
3. Make substrings a different type, with a storage copy on conversion to string.<br>
<br>
We think number 3 is the best choice. A walk-through of the tradeoffs follows.<br>
<br>
#### Same type, shared storage<br>
<br>
In Swift 3.0, slicing a `String` produces a new `String` that is a view into a<br>
subrange of the original `String`'s storage. This is why `String` is 3 words in<br>
size (the start, length and buffer owner), unlike the similar `Array` type<br>
which is only one.<br>
<br>
This is a simple model with big efficiency gains when chopping up strings into<br>
multiple smaller strings. But it does mean that a stored substring keeps the<br>
entire original string buffer alive even after it would normally have been<br>
released.<br>
<br>
This arrangement has proven to be problematic in other programming languages,<br>
because applications sometimes extract small strings from large ones and keep<br>
those small strings long-term. That is considered a memory leak and was enough<br>
of a problem in Java that they changed from substrings sharing storage to<br>
making a copy in 1.7.<br>
<br>
#### Same type, copied storage<br>
<br>
Copying of substrings is also the choice made in C#, and in the default<br>
`NSString` implementation. This approach avoids the memory leak issue, but has<br>
obvious performance overhead in performing the copies.<br>
<br>
This in turn encourages trafficking in string/range pairs instead of in<br>
substrings, for performance reasons, leading to API challenges. For example:<br>
<br>
```swift<br>
foo.compare(bar, range: start..<end)<br>
```<br>
<br>
Here, it is not clear whether `range` applies to `foo` or `bar`. This<br>
relationship is better expressed in Swift as a slicing operation:<br>
<br>
```swift<br>
foo[start..<end].compare(bar)<br>
```<br>
<br>
Not only does this clarify to which string the range applies, it also brings<br>
this sub-range capability to any API that operates on `String` "for free". So<br>
these other combinations also work equally well:<br>
<br>
```swift<br>
// apply range on argument rather than target<br>
foo.compare(bar[start..<end])<br>
// apply range on both<br>
foo[start..<end].compare(bar[<wbr>start1..<end1])<br>
// compare two strings ignoring first character<br>
foo.dropFirst().compare(bar.<wbr>dropFirst())<br>
```<br>
<br>
In all three cases, an explicit range argument need not appear on the `compare`<br>
method itself. The implementation of `compare` does not need to know anything<br>
about ranges. Methods need only take range arguments when that was an<br>
integral part of their purpose (for example, setting the start and end of a<br>
user's current selection in a text box).<br>
<br>
#### Different type, shared storage<br>
<br>
The desire to share underlying storage while preventing accidental memory leaks<br>
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.<br>
The inconvenience of a separate type is mitigated by most operations used on<br>
`Array` from the standard library being generic over `Sequence` or `Collection`.<br>
<br>
We should apply the same approach for `String` by introducing a distinct<br>
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:<br>
<br>
> Important: Long-term storage of `Substring` instances is discouraged. A<br>
> substring holds a reference to the entire storage of a larger string, not<br>
> just to the portion it presents, even after the original string's lifetime<br>
> ends. Long-term storage of a `Substring` may therefore prolong the lifetime<br>
> of large strings that are no longer otherwise accessible, which can appear<br>
> to be memory leakage.<br>
<br>
When assigning a `Substring` to a longer-lived variable (usually a stored<br>
property) explicitly of type `String`, a type conversion will be performed, and<br>
at this point the substring buffer is copied and the original string's storage<br>
can be released.<br>
<br>
A `String` that was not its own `Substring` could be one word—a single tagged<br>
pointer—without requiring additional allocations. `Substring`s would be a view<br>
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a<br>
length. The small string optimization for `Substring` would take advantage of<br>
the larger size, probably with a less compressed encoding for speed.<br>
<br>
The downside of having two types is the inconvenience of sometimes having a<br>
`Substring` when you need a `String`, and vice-versa. It is likely this would<br>
be a significantly bigger problem than with `Array` and `ArraySlice`, as<br>
slicing of `String` is such a common operation. It is especially relevant to<br>
existing code that assumes `String` is the currency type. To ease the pain of<br>
type mismatches, `Substring` should be a subtype of `String` in the same way<br>
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit<br>
conversion from `Substring` to `String`, as well as the usual implicit<br>
conversions such as `[Substring]` to `[String]` that other subtype<br>
relationships receive.<br>
<br>
In most cases, type inference combined with the subtype relationship should<br>
make the type difference a non-issue and users will not care which type they<br>
are using. For flexibility and optimizability, most operations from the<br>
standard library will traffic in generic models of<br>
[`Unicode`](#the--code-<wbr>unicode--code--protocol).<br>
<br>
##### Guidance for API Designers<br>
<br>
In this model, **if a user is unsure about which type to use, `String` is always<br>
a reasonable default**. A `Substring` passed where `String` is expected will be<br>
implicitly copied. When compared to the “same type, copied storage” model, we<br>
have effectively deferred the cost of copying from the point where a substring<br>
is created until it must be converted to `String` for use with an API.<br>
<br>
A user who needs to optimize away copies altogether should use this guideline:<br>
if for performance reasons you are tempted to add a `Range` argument to your<br>
method as well as a `String` to avoid unnecessary copies, you should instead<br>
use `Substring`.<br>
<br>
##### The “Empty Subscript”<br>
<br>
To make it easy to call such an optimized API when you only have a `String` (or<br>
to call any API that takes a `Collection`'s `SubSequence` when all you have is<br>
the `Collection`), we propose the following “empty subscript” operation,<br>
<br>
```swift<br>
extension Collection {<br>
subscript() -> SubSequence {<br>
return self[startIndex..<endIndex]<br>
}<br>
}<br>
```<br>
<br>
which allows the following usage:<br>
<br>
```swift<br>
funcThatIsJustLooking(at: <a href="http://person.name" rel="noreferrer" target="_blank">person.name</a>[]) // pass <a href="http://person.name" rel="noreferrer" target="_blank">person.name</a> as Substring<br>
```<br>
<br>
The `[]` syntax can be offered as a fixit when needed, similar to `&` for an<br>
`inout` argument. While it doesn't help a user to convert `[String]` to<br>
`[Substring]`, the need for such conversions is extremely rare, can be done with<br>
a simple `map` (which could also be offered by a fixit):<br>
<br>
```swift<br>
takesAnArrayOfSubstring(<wbr>arrayOfString.map { $0[] })<br>
```<br>
<br>
#### Other Options Considered<br>
<br>
As we have seen, all three options above have downsides, but it's possible<br>
these downsides could be eliminated/mitigated by the compiler. We are proposing<br>
one such mitigation—implicit conversion—as part of the the "different type,<br>
shared storage" option, to help avoid the cognitive load on developers of<br>
having to deal with a separate `Substring` type.<br>
<br>
To avoid the memory leak issues of a "same type, shared storage" substring<br>
option, we considered whether the compiler could perform an implicit copy of<br>
the underlying storage when it detects the string is being "stored" for long<br>
term usage, say when it is assigned to a stored property. The trouble with this<br>
approach is it is very difficult for the compiler to distinguish between<br>
long-term storage versus short-term in the case of abstractions that rely on<br>
stored properties. For example, should the storing of a substring inside an<br>
`Optional` be considered long-term? Or the storing of multiple substrings<br>
inside an array? The latter would not work well in the case of a<br>
`components(separatedBy:)` implementation that intended to return an array of<br>
substrings. It would also be difficult to distinguish intentional medium-term<br>
storage of substrings, say by a lexer. There does not appear to be an effective<br>
consistent rule that could be applied in the general case for detecting when a<br>
substring is truly being stored long-term.<br>
<br>
To avoid the cost of copying substrings under "same type, copied storage", the<br>
optimizer could be enhanced to to reduce the impact of some of those copies.<br>
For example, this code could be optimized to pull the invariant substring out<br>
of the loop:<br>
<br>
```swift<br>
for _ in 0..<lots {<br>
someFunc(takingString: bigString[bigRange])<br>
}<br>
```<br>
<br>
It's worth noting that a similar optimization is needed to avoid an equivalent<br>
problem with implicit conversion in the "different type, shared storage" case:<br>
<br>
```swift<br>
let substring = bigString[bigRange]<br>
for _ in 0..<lots { someFunc(takingString: substring) }<br>
```<br>
<br>
However, in the case of "same type, copied storage" there are many use cases<br>
that cannot be optimized as easily. Consider the following simple definition of<br>
a recursive `contains` algorithm, which when substring slicing is linear makes<br>
the overall algorithm quadratic:<br>
<br>
```swift<br>
extension String {<br>
func containsChar(_ x: Character) -> Bool {<br>
return !isEmpty && (first == x || dropFirst().containsChar(x))<br>
}<br>
}<br>
```<br>
<br>
For the optimizer to eliminate this problem is unrealistic, forcing the user to<br>
remember to optimize the code to not use string slicing if they want it to be<br>
efficient (assuming they remember):<br>
<br>
```swift<br>
extension String {<br>
// add optional argument tracking progress through the string<br>
func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {<br>
let idx = idx ?? startIndex<br>
return idx != endIndex<br>
&& (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))<br>
}<br>
}<br>
```<br>
<br>
#### Substrings, Ranges and Objective-C Interop<br>
<br>
The pattern of passing a string/range pair is common in several Objective-C<br>
APIs, and is made especially awkward in Swift by the non-interchangeability of<br>
`Range<String.Index>` and `NSRange`.<br>
<br>
```swift<br>
s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))<br>
```<br>
<br>
In general, however, the Swift idiom for operating on a sub-range of a<br>
`Collection` is to *slice* the collection and operate on that:<br>
<br>
```swift<br>
s2.find(s2[j..<s2.endIndex])<br>
```<br>
<br>
Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported<br>
without the `NSRange` argument. The Objective-C importer should be changed to<br>
give these APIs special treatment so that when a `Substring` is passed, instead<br>
of being converted to a `String`, the full `NSString` and range are passed to<br>
the Objective-C method, thereby avoiding a copy.<br>
<br>
As a result, you would never need to pass an `NSRange` to these APIs, which<br>
solves the impedance problem by eliminating the argument, resulting in more<br>
idiomatic Swift code while retaining the performance benefit. To help users<br>
manually handle any cases that remain, Foundation should be augmented to allow<br>
the following syntax for converting to and from `NSRange`:<br>
<br>
```swift<br>
let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]<br>
let iToJ = Range(nsr, in: s) // Equivalent to i..<j<br>
```<br>
<br>
### The `Unicode` protocol<br>
<br>
With `Substring` and `String` being distinct types and sharing almost all<br>
interface and semantics, and with the highest-performance string processing<br>
requiring knowledge of encoding and layout that the currency types can't<br>
provide, it becomes important to capture the common “string API” in a protocol.<br>
Since Unicode conformance is a key feature of string processing in swift, we<br>
call that protocol `Unicode`:<br>
<br>
**Note:** The following assumes several features that are planned but not yet implemented in<br>
Swift, and should be considered a sketch rather than a final design.<br>
<br>
```swift<br>
protocol Unicode<br>
: Comparable, BidirectionalCollection where Element == Character {<br>
<br>
associatedtype Encoding : UnicodeEncoding<br>
var encoding: Encoding { get }<br>
<br>
associatedtype CodeUnits<br>
: RandomAccessCollection where Element == Encoding.CodeUnit<br>
var codeUnits: CodeUnits { get }<br>
<br>
associatedtype UnicodeScalars<br>
: BidirectionalCollection where Element == UnicodeScalar<br>
var unicodeScalars: UnicodeScalars { get }<br>
<br>
associatedtype ExtendedASCII<br>
: BidirectionalCollection where Element == UInt32<br>
var extendedASCII: ExtendedASCII { get }<br>
<br>
var unicodeScalars: UnicodeScalars { get }<br>
}<br>
<br>
extension Unicode {<br>
// ... define high-level non-mutating string operations, e.g. search ...<br>
<br>
func compared<Other: Unicode>(<br>
to rhs: Other,<br>
case caseSensitivity: StringSensitivity? = nil,<br>
diacritic diacriticSensitivity: StringSensitivity? = nil,<br>
width widthSensitivity: StringSensitivity? = nil,<br>
in locale: Locale? = nil<br>
) -> SortOrder { ... }<br>
}<br>
<br>
extension Unicode : RangeReplaceableCollection where CodeUnits :<br>
RangeReplaceableCollection {<br>
// Satisfy protocol requirement<br>
mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)<br>
where C.Element == Element<br>
<br>
// ... define high-level mutating string operations, e.g. replace ...<br>
}<br>
<br>
```<br>
<br>
The goal is that `Unicode` exposes the underlying encoding and code units in<br>
such a way that for types with a known representation (e.g. a high-performance<br>
`UTF8String`) that information can be known at compile-time and can be used to<br>
generate a single path, while still allowing types like `String` that admit<br>
multiple representations to use runtime queries and branches to fast path<br>
specializations.<br>
<br>
**Note:** `Unicode` would make a fantastic namespace for much of<br>
what's in this proposal if we could get the ability to nest types and<br>
protocols in protocols.<br>
<br>
<br>
### Scanning, Matching, and Tokenization<br>
<br>
#### Low-Level Textual Analysis<br>
<br>
We should provide convenient APIs processing strings by character. For example,<br>
it should be easy to cleanly express, “if this string starts with `"f"`, process<br>
the rest of the string as follows…” Swift is well-suited to expressing this<br>
common pattern beautifully, but we need to add the APIs. Here are two examples<br>
of the sort of code that might be possible given such APIs:<br>
<br>
```swift<br>
if let firstLetter = input.droppingPrefix(<wbr>alphabeticCharacter) {<br>
somethingWith(input) // process the rest of input<br>
}<br>
<br>
if let (number, restOfInput) = input.parsingPrefix(Int.self) {<br>
...<br>
}<br>
```<br>
<br>
The specific spelling and functionality of APIs like this are TBD. The larger<br>
point is to make sure matching-and-consuming jobs are well-supported.<br>
<br>
#### Unified Pattern Matcher Protocol<br>
<br>
Many of the current methods that do matching are overloaded to do the same<br>
logical operations in different ways, with the following axes:<br>
<br>
- Logical Operation: `find`, `split`, `replace`, match at start<br>
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure<br>
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of<br>
the method name, and sometimes an argument<br>
- Whole string or subrange.<br>
<br>
We should represent these aspects as orthogonal, composable components,<br>
abstracting pattern matchers into a protocol like<br>
[this one](<a href="https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33" rel="noreferrer" target="_blank">https://github.com/apple/<wbr>swift/blob/master/test/<wbr>Prototypes/PatternMatching.<wbr>swift#L33</a>),<br>
that can allow us to define logical operations once, without introducing<br>
overloads, and massively reducing API surface area.<br>
<br>
For example, using the strawman prefix `%` syntax to turn string literals into<br>
patterns, the following pairs would all invoke the same generic methods:<br>
<br>
```swift<br>
if let found = s.firstMatch(%"searchString") { ... }<br>
if let found = s.firstMatch(someRegex) { ... }<br>
<br>
for m in s.allMatches((%"searchString")<wbr>, case: .insensitive) { ... }<br>
for m in s.allMatches(someRegex) { ... }<br>
<br>
let items = s.split(separatedBy: ", ")<br>
let tokens = s.split(separatedBy: CharacterSet.whitespace)<br>
```<br>
<br>
Note that, because Swift requires the indices of a slice to match the indices of<br>
the range from which it was sliced, operations like `firstMatch` can return a<br>
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in<br>
the string being searched, if needed, can easily be recovered as the<br>
`startIndex` and `endIndex` of the `Substring`.<br>
<br>
Note also that matching operations are useful for collections in general, and<br>
would fall out of this proposal:<br>
<br>
```<br>
// replace subsequences of contiguous NaNs with zero<br>
forces.replace(oneOrMore([<wbr>Float.nan]), [0.0])<br>
```<br>
<br>
#### Regular Expressions<br>
<br>
Addressing regular expressions is out of scope for this proposal.<br>
That said, it is important that to note the pattern matching protocol mentioned<br>
above provides a suitable foundation for regular expressions, and types such as<br>
`NSRegularExpression` can easily be retrofitted to conform to it. In the<br>
future, support for regular expression literals in the compiler could allow for<br>
compile-time syntax checking and optimization.<br>
<br>
### String Indices<br>
<br>
`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and<br>
`utf16`—each with its own opaque index type. The APIs used to translate indices<br>
between views add needless complexity, and the opacity of indices makes them<br>
difficult to serialize.<br>
<br>
The index translation problem has two aspects:<br>
<br>
1. `String` views cannot consume one anothers' indices without a cumbersome<br>
conversion step. An index into a `String`'s `characters` must be translated<br>
before it can be used as a position in its `unicodeScalars`. Although these<br>
translations are rarely needed, they add conceptual and API complexity.<br>
2. Many APIs in the core libraries and other frameworks still expose `String`<br>
positions as `Int`s and regions as `NSRange`s, which can only reference a<br>
`utf16` view and interoperate poorly with `String` itself.<br>
<br>
#### Index Interchange Among Views<br>
<br>
String's need for flexible backing storage and reasonably-efficient indexing<br>
(i.e. without dynamically allocating and reference-counting the indices<br>
themselves) means indices need an efficient underlying storage type. Although<br>
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into<br>
underlying code unit storage makes a good underlying storage type, provided<br>
`String`'s underlying storage supports random-access. We think random-access<br>
*code-unit storage* is a reasonable requirement to impose on all `String`<br>
instances.<br>
<br>
Making these `Int` code unit offsets conveniently accessible and constructible<br>
solves the serialization problem:<br>
<br>
```swift<br>
clipboard.write(s.endIndex.<wbr>codeUnitOffset)<br>
let offset = clipboard.read(Int.self)<br>
let i = String.Index(codeUnitOffset: offset)<br>
```<br>
<br>
Index interchange between `String` and its `unicodeScalars`, `codeUnits`,<br>
and [`extendedASCII`](#parsing-<wbr>ascii-structure) views can be made entirely<br>
seamless by having them share an index type (semantics of indexing a `String`<br>
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).<br>
Having a common index allows easy traversal into the interior of graphemes,<br>
something that is often needed, without making it likely that someone will do it<br>
by accident.<br>
<br>
- `String.index(after:)` should advance to the next grapheme, even when the<br>
index points partway through a grapheme.<br>
<br>
- `String.index(before:)` should move to the start of the grapheme before<br>
the current position.<br>
<br>
Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not<br>
crucial, as the specifics of encoding should not be a concern for most use<br>
cases, and would impose needless costs on the indices of other views. That<br>
said, we can make translation much more straightforward by exposing simple<br>
bidirectional converting `init`s on both index types:<br>
<br>
```swift<br>
let u8Position = String.UTF8.Index(<wbr>someStringIndex)<br>
let originalPosition = String.Index(u8Position)<br>
```<br>
<br>
#### Index Interchange with Cocoa<br>
<br>
We intend to address `NSRange`s that denote substrings in Cocoa APIs as<br>
described [later in this document](#substrings--ranges-<wbr>and-objective-c-interop).<br>
That leaves the interchange of bare indices with Cocoa APIs trafficking in<br>
`Int`. Hopefully such APIs will be rare, but when needed, the following<br>
extension, which would be useful for all `Collections`, can help:<br>
<br>
```swift<br>
extension Collection {<br>
func index(offset: IndexDistance) -> Index {<br>
return index(startIndex, offsetBy: offset)<br>
}<br>
func offset(of i: Index) -> IndexDistance {<br>
return distance(from: startIndex, to: i)<br>
}<br>
}<br>
```<br>
<br>
Then integers can easily be translated into offsets into a `String`'s `utf16`<br>
view for consumption by Cocoa:<br>
<br>
```swift<br>
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))<br>
let swiftIndex = s.utf16.index(offset: cocoaIndex)<br>
```<br>
<br>
### Formatting<br>
<br>
A full treatment of formatting is out of scope of this proposal, but<br>
we believe it's crucial for completing the text processing picture. This<br>
section details some of the existing issues and thinking that may guide future<br>
development.<br>
<br>
#### Printf-Style Formatting<br>
<br>
`String.format` is designed on the `printf` model: it takes a format string with<br>
textual placeholders for substitution, and an arbitrary list of other arguments.<br>
The syntax and meaning of these placeholders has a long history in<br>
C, but for anyone who doesn't use them regularly they are cryptic and complex,<br>
as the `printf (3)` man page attests.<br>
<br>
Aside from complexity, this style of API has two major problems: First, the<br>
spelling of these placeholders must match up to the types of the arguments, in<br>
the right order, or the behavior is undefined. Some limited support for<br>
compile-time checking of this correspondence could be implemented, but only for<br>
the cases where the format string is a literal. Second, there's no reasonable<br>
way to extend the formatting vocabulary to cover the needs of new types: you are<br>
stuck with what's in the box.<br>
<br>
#### Foundation Formatters<br>
<br>
The formatters supplied by Foundation are highly capable and versatile, offering<br>
both formatting and parsing services. When used for formatting, though, the<br>
design pattern demands more from users than it should:<br>
<br>
* Matching the type of data being formatted to a formatter type<br>
* Creating an instance of that type<br>
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the<br>
need for this step prevents the instance from being used and discarded in<br>
the same expression where it is created.<br>
* Overall, introduction of needless verbosity into source<br>
<br>
These may seem like small issues, but the experience of Apple localization<br>
experts is that the total drag of these factors on programmers is such that they<br>
tend to reach for `String.format` instead.<br>
<br>
#### String Interpolation<br>
<br>
Swift string interpolation provides a user-friendly alternative to printf's<br>
domain-specific language (just write ordinary swift code!) and its type safety<br>
problems (put the data right where it belongs!) but the following issues prevent<br>
it from being useful for localized formatting (among other jobs):<br>
<br>
* [SR-2303](<a href="https://bugs.swift.org/browse/SR-2303" rel="noreferrer" target="_blank">https://bugs.swift.<wbr>org/browse/SR-2303</a>) We are unable to restrict<br>
types used in string interpolation.<br>
* [SR-1260](<a href="https://bugs.swift.org/browse/SR-1260" rel="noreferrer" target="_blank">https://bugs.swift.<wbr>org/browse/SR-1260</a>) String interpolation can't<br>
distinguish (fragments of) the base string from the string substitutions.<br>
<br>
In the long run, we should improve Swift string interpolation to the point where<br>
it can participate in most any formatting job. Mostly this centers around<br>
fixing the interpolation protocols per the previous item, and supporting<br>
localization.<br>
<br>
To be able to use formatting effectively inside interpolations, it needs to be<br>
both lightweight (because it all happens in-situ) and discoverable. One<br>
approach would be to standardize on `format` methods, e.g.:<br>
<br>
```swift<br>
"Column 1: \(n.format(radix:16, width:8)) *** \(message)"<br>
<br>
"Something with leading zeroes: \(x.format(fill: zero, width:8))"<br>
```<br>
<br>
### C String Interop<br>
<br>
Our support for interoperation with nul-terminated C strings is scattered and<br>
incoherent, with 6 ways to transform a C string into a `String` and four ways to<br>
do the inverse. These APIs should be replaced with the following<br>
<br>
```swift<br>
extension String {<br>
/// Constructs a `String` having the same contents as `nulTerminatedUTF8`.<br>
///<br>
/// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded<br>
/// bytes ending just before the first zero byte (NUL character).<br>
init(cString nulTerminatedUTF8: UnsafePointer<CChar>)<br>
<br>
/// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.<br>
///<br>
/// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in<br>
/// the given `encoding`, ending just before the first zero code unit.<br>
/// - Parameter encoding: describes the encoding in which the code units<br>
/// should be interpreted.<br>
init<Encoding: UnicodeEncoding>(<br>
cString nulTerminatedCodeUnits: UnsafePointer<Encoding.<wbr>CodeUnit>,<br>
encoding: Encoding)<br>
<br>
/// Invokes the given closure on the contents of the string, represented as a<br>
/// pointer to a null-terminated sequence of UTF-8 code units.<br>
func withCString<Result>(<br>
_ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result<br>
}<br>
```<br>
<br>
In both of the construction APIs, any invalid encoding sequence detected will<br>
have its longest valid prefix replaced by U+FFFD, the Unicode replacement<br>
character, per Unicode specification. This covers the common case. The<br>
replacement is done *physically* in the underlying storage and the validity of<br>
the result is recorded in the `String`'s `encoding` such that future accesses<br>
need not be slowed down by possible error repair separately.<br>
<br>
Construction that is aborted when encoding errors are detected can be<br>
accomplished using APIs on the `encoding`. String types that retain their<br>
physical encoding even in the presence of errors and are repaired on-the-fly can<br>
be built as different instances of the `Unicode` protocol.<br>
<br>
### Unicode 9 Conformance<br>
<br>
Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes<br>
the process of properly identifying `Character` boundaries. We need to update<br>
`String` to account for this change.<br>
<br>
### High-Performance String Processing<br>
<br>
Many strings are short enough to store in 64 bits, many can be stored using only<br>
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to<br>
us already in some other encoding, such as UTF-8, that would be costly to<br>
translate. Supporting these formats while maintaining usability for<br>
general-purpose APIs demands that a single `String` type can be backed by many<br>
different representations.<br>
<br>
That said, the highest performance code always requires static knowledge of the<br>
data structures on which it operates, and for this code, dynamic selection of<br>
representation comes at too high a cost. Heavy-duty text processing demands a<br>
way to opt out of dynamism and directly use known encodings. Having this<br>
ability can also make it easy to cleanly specialize code that handles dynamic<br>
cases for maximal efficiency on the most common representations.<br>
<br>
To address this need, we can build models of the `Unicode` protocol that encode<br>
representation information into the type, such as `NFCNormalizedUTF16String`.<br>
<br>
### Parsing ASCII Structure<br>
<br>
Although many machine-readable formats support the inclusion of arbitrary<br>
Unicode text, it is also common that their fundamental structure lies entirely<br>
within the ASCII subset (JSON, YAML, many XML formats). These formats are often<br>
processed most efficiently by recognizing ASCII structural elements as ASCII,<br>
and capturing the arbitrary sections between them in more-general strings. The<br>
current String API offers no way to efficiently recognize ASCII and skip past<br>
everything else without the overhead of full decoding into unicode scalars.<br>
<br>
For these purposes, strings should supply an `extendedASCII` view that is a<br>
collection of `UInt32`, where values less than `0x80` represent the<br>
corresponding ASCII character, and other values represent data that is specific<br>
to the underlying encoding of the string.<br>
<br>
## Language Support<br>
<br>
This proposal depends on two new features in the Swift language:<br>
<br>
1. **Generic subscripts**, to<br>
enable unified slicing syntax.<br>
<br>
2. **A subtype relationship** between<br>
`Substring` and `String`, enabling framework APIs to traffic solely in<br>
`String` while still making it possible to avoid copies by handling<br>
`Substring`s where necessary.<br>
<br>
Additionally, **the ability to nest types and protocols inside<br>
protocols** could significantly shrink the footprint of this proposal<br>
on the top-level Swift namespace.<br>
<br>
<br>
## Open Questions<br>
<br>
### Must `String` be limited to storing UTF-16 subset encodings?<br>
<br>
- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in<br>
question here; this is about what encodings must be storable, without<br>
transcoding, in the common currency type called “`String`”.<br>
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.<br>
- If we have a way to get at a `String`'s code units, we need a concrete type in<br>
which to express them in the API of `String`, which is a concrete type<br>
- If String needs to be able to represent UTF-32, presumably the code units need<br>
to be `UInt32`.<br>
- Not supporting UTF-32-encoded text seems like one reasonable design choice.<br>
- Maybe we can allow UTF-8 storage in `String` and expose its code units as<br>
`UInt16`, just as we would for Latin-1.<br>
- Supporting only UTF-16-subset encodings would imply that `String` indices can<br>
be serialized without recording the `String`'s underlying encoding.<br>
<br>
### Do we need a type-erasable base protocol for UnicodeEncoding?<br>
<br>
UnicodeEncoding has an associated type, but it may be important to be able to<br>
traffic in completely dynamic encoding values, e.g. for “tell me the most<br>
efficient encoding for this string.”<br>
<br>
### Should there be a string “facade?”<br>
<br>
One possible design alternative makes `Unicode` a vehicle for expressing<br>
the storage and encoding of code units, but does not attempt to give it an API<br>
appropriate for `String`. Instead, string APIs would be provided by a generic<br>
wrapper around an instance of `Unicode`:<br>
<br>
```swift<br>
struct StringFacade<U: Unicode> : BidirectionalCollection {<br>
<br>
// ...APIs for high-level string processing here...<br>
<br>
var unicode: U // access to lower-level unicode details<br>
}<br>
<br>
typealias String = StringFacade<StringStorage><br>
typealias Substring = StringFacade<StringStorage.<wbr>SubSequence><br>
```<br>
<br>
This design would allow us to de-emphasize lower-level `String` APIs such as<br>
access to the specific encoding, by putting them behind a `.unicode` property.<br>
A similar effect in a facade-less design would require a new top-level<br>
`StringProtocol` playing the role of the facade with an an `associatedtype<br>
Storage : Unicode`.<br>
<br>
An interesting variation on this design is possible if defaulted generic<br>
parameters are introduced to the language:<br>
<br>
```swift<br>
struct String<U: Unicode = StringStorage><br>
: BidirectionalCollection {<br>
<br>
// ...APIs for high-level string processing here...<br>
<br>
var unicode: U // access to lower-level unicode details<br>
}<br>
<br>
typealias Substring = String<StringStorage.<wbr>SubSequence><br>
```<br>
<br>
One advantage of such a design is that naïve users will always extend “the right<br>
type” (`String`) without thinking, and the new APIs will show up on `Substring`,<br>
`MyUTF8String`, etc. That said, it also has downsides that should not be<br>
overlooked, not least of which is the confusability of the meaning of the word<br>
“string.” Is it referring to the generic or the concrete type?<br>
<br>
### `TextOutputStream` and `TextOutputStreamable`<br>
<br>
`TextOutputStreamable` is intended to provide a vehicle for<br>
efficiently transporting formatted representations to an output stream<br>
without forcing the allocation of storage. Its use of `String`, a<br>
type with multiple representations, at the lowest-level unit of<br>
communication, conflicts with this goal. It might be sufficient to<br>
change `TextOutputStream` and `TextOutputStreamable` to traffic in an<br>
associated type conforming to `Unicode`, but that is not yet clear.<br>
This area will require some design work.<br>
<br>
### `description` and `debugDescription`<br>
<br>
* Should these be creating localized or non-localized representations?<br>
* Is returning a `String` efficient enough?<br>
* Is `debugDescription` pulling the weight of the API surface area it adds?<br>
<br>
### `StaticString`<br>
<br>
`StaticString` was added as a byproduct of standard library developed and kept<br>
around because it seemed useful, but it was never truly *designed* for client<br>
programmers. We need to decide what happens with it. Presumably *something*<br>
should fill its role, and that should conform to `Unicode`.<br>
<br>
## Footnotes<br>
<br>
<b id="f0">0</b> The integers rewrite currently underway is expected to<br>
substantially reduce the scope of `Int`'s API by using more<br>
generics. [↩](#a0)<br>
<br>
<b id="f1">1</b> In practice, these semantics will usually be tied to the<br>
version of the installed [ICU](<a href="http://icu-project.org" rel="noreferrer" target="_blank">http://icu-project.org</a>) library, which<br>
programmatically encodes the most complex rules of the Unicode Standard and its<br>
de-facto extension, CLDR.[↩](#a1)<br>
<br>
<b id="f2">2</b><br>
See<br>
[<a href="http://unicode.org/reports/tr29/#Notation](http://unicode.org/reports/tr29/#Notation)" rel="noreferrer" target="_blank">http://unicode.org/reports/<wbr>tr29/#Notation](http://<wbr>unicode.org/reports/tr29/#<wbr>Notation)</a>. Note<br>
that inserting Unicode scalar values to prevent merging of grapheme clusters would<br>
also constitute a kind of misbehavior (one of the clusters at the boundary would<br>
not be found in the result), so would be relatively costly to implement, with<br>
little benefit. [↩](#a2)<br>
<br>
<b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by<br>
the Unicode standard for this purpose. In fact there's<br>
a [whole chapter](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf" rel="noreferrer" target="_blank">http://www.unicode.<wbr>org/versions/Unicode9.0.0/<wbr>ch05.pdf</a>)<br>
dedicated to it. In particular, §5.17 says:<br>
<br>
> When comparing text that is visible to end users, a correct linguistic sort<br>
> should be used, as described in _Section 5.16, Sorting and<br>
> Searching_. However, in many circumstances the only requirement is for a<br>
> fast, well-defined ordering. In such cases, a binary ordering can be used.<br>
<br>
[↩](#a4)<br>
<br>
<br>
<b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto<br>
properties in a table that's indexed by unicode scalar value. This table is<br>
part of the Unicode standard. Some of these queries (e.g., “is this an<br>
uppercase character?”) may have fairly obvious generalizations to grapheme<br>
clusters, but exactly how to do it is a research topic and *ideally* we'd either<br>
establish the existing practice that the Unicode committee would standardize, or<br>
the Unicode committee would do the research and we'd implement their<br>
result.[↩](#a5)<br>
<br>
______________________________<wbr>_________________<br>
swift-evolution mailing list<br>
<a href="mailto:swift-evolution@swift.org">swift-evolution@swift.org</a><br>
<a href="https://lists.swift.org/mailman/listinfo/swift-evolution" rel="noreferrer" target="_blank">https://lists.swift.org/<wbr>mailman/listinfo/swift-<wbr>evolution</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Joshua Alvarado<div><a href="mailto:alvaradojoshua0@gmail.com" target="_blank">alvaradojoshua0@gmail.com</a></div></div>
</div>