<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><span></span></div><div><span></span><br><span></span><br><span>Sent from my iPad</span><br><span></span><br><div><br><br>Sent from my iPad</div><blockquote type="cite"><span>On Jan 20, 2017, at 5:48 AM, Jonathan Hull <<a href="mailto:jhull@gbis.com">jhull@gbis.com</a>> wrote:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Thanks for all the hard work!</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Still digesting, but I definitely support the goal of string processing even better than Perl. Some random thoughts:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>• I also like the suggestion of implicit conversion from substring slices to strings based on a subtype relationship, since I keep running into that issue when trying to use array slices. </span><br></blockquote><span></span><br><span>Interesting. Could you offer some examples?</span><br><span></span><br><blockquote type="cite"><span>It would be nice to be able to specify that conversion behavior with other types that have a similar subtype relationship.</span><br></blockquote><span></span><br><span>Indeed.</span><br><span></span><br><blockquote type="cite"><span>• One thing that stood out was the interpolation format syntax, which seemed a bit convoluted and difficult to parse:</span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span>"Something with leading zeroes: \(x.format(fill: zero, width:8))"</span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>Have you considered treating the interpolation parenthesis more like the function call syntax? It should be a familiar pattern and easily parseable to someone versed in other areas of swift:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span> “Something with leading zeroes: \(x, fill: .zero, width: 8)"</span><br></blockquote><span></span><br><span>Yes, we've considered it</span></div><div><br></div><div><span style="background-color: rgba(255, 255, 255, 0);"> 1. "\(f(expr1, label2: expr2, label3: expr3))" <br><br> String(describing: f(expr1, label2: expr2, label3: expr3))<br><br> 2. "\(expr0 + expr1(label2: expr2, label3: expr3))"<br><br> String(describing: expr0 + expr1(label2: expr2, label3: expr3)<br><br> 3. "\((expr1, label2: expr2, label3: expr3))"<br><br> String(describing: (expr1, label2: expr2, label3: expr3))<br><br> 4. "\(expr1, label2: expr2, label3: expr3)"<br><br> String(describing: expr1, label2: expr2, label3: expr3)<br><br>I think I'm primarily concerned with the differences among cases 1, 3,<br>and 4, which are extremely minor. 3 and 4 differ by just a set of<br>parentheses, though that might be mitigated by the ${...} suggestion someone else posted. The point of using string interpolation is to improve<br>readability, and I fear these cases make too many things look alike that<br>have very different meanings. Using a common term like "format" calls<br>out what is being done.<br><br>It's possible to produce terser versions of the syntax that don't suffer<br>from this problem by using a dedicated operator:<br><br> "Column 1: \(n⛄(radix:16, width:8)) *** \(message)"<br> "Something with leading zeroes: \(x⛄(fill: zero, width:8))"<br><br>or even<br><br> "Column 1: \(n⛄radix:16⛄width:8) *** \(message)"<br> "Something with leading zeroes: \(x⛄fill:zero⛄width:8)"</span></div><div><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>I think that should work for the common cases (e.g. padding, truncating, and alignment), with string-returning methods on the type (or even formatting objects ala NSNumberFormatter) being used for more exotic formatting needs (e.g. outputting a number as Hex instead of Decimal)</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>• Have you considered having an explicit .machine locale which means that the function should treat the string as machine readable? (as opposed to the lack of a locale)</span><br></blockquote><div><br></div>No, we hadn't. What would be the goal of such a design?</div><div><br></div><div><blockquote type="cite"><span></span></blockquote><blockquote type="cite"><span>• I almost feel like the machine readableness vs human readableness of a string is information that should travel with the string itself. It would be nice to have an extremely terse way to specify that a string is localizable (strawman syntax below), and that might also classify the string as human readable.</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span> let myLocalizedStr = $”This is localizable” //This gets used as the comment in the localization file</span></blockquote><div><br></div>Yes, there are also arguments for encoding "human readable" in the type system. But as noted in <a href="https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions">https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions</a> those ideas are scoped out of Swift 4.</div><div><br><blockquote type="cite"><span></span></blockquote><blockquote type="cite"><span>• Looking forward to RegEx literals!</span></blockquote><blockquote type="cite"><span>Thanks,</span><br></blockquote><blockquote type="cite"><span>Jon</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span>On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <<a href="mailto:swift-evolution@swift.org">swift-evolution@swift.org</a>> wrote:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Hi all,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Below is our take on a design manifesto for Strings in Swift 4 and beyond.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Probably best read in rendered markdown on GitHub:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><a href="https://github.com/apple/swift/blob/master/docs/StringManifesto.md">https://github.com/apple/swift/blob/master/docs/StringManifesto.md</a></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We’re eager to hear everyone’s thoughts.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Regards,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Ben and Dave</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span># String Processing For Swift 4</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Authors: [Dave Abrahams](<a href="https://github.com/dabrahams">https://github.com/dabrahams</a>), [Ben Cohen](<a href="https://github.com/airspeedswift">https://github.com/airspeedswift</a>)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>far, with just this short blurb in the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[list of goals](<a href="https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html">https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html</a>):</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>**String re-evaluation**: String is one of the most important fundamental</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>types in the language. The standard library leads have numerous ideas of how</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>to improve the programming model for it, without jeopardizing the goals of</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>providing a unicode-correct-by-default model. Our goal is to be better at</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>string processing than Perl!</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For Swift 4 and beyond we want to improve three dimensions of text processing:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. Ergonomics</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. Correctness</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>3. Performance</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This document is meant to both provide a sense of the long-term vision </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>(including undecided issues and possible approaches), and to define the scope of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>work that could be done in the Swift 4 timeframe.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>## General Principles</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Ergonomics</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>It's worth noting that ergonomics and correctness are mutually-reinforcing. An</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>API that is easy to use—but incorrectly—cannot be considered an ergonomic</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>success. Conversely, an API that's simply hard to use is also hard to use</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>correctly. Acheiving optimal performance without compromising ergonomics or</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>correctness is a greater challenge.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Consistency with the Swift language and idioms is also important for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>ergonomics. There are several places both in the standard library and in the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foundation additions to `String` where patterns and practices found elsewhere</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>could be applied to improve usability and familiarity.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### API Surface Area</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Primary data types such as `String` should have APIs that are easily understood</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>given a signature and a one-line summary. Today, `String` fails that test. As</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>you can see, the Standard Library and Foundation both contribute significantly to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>its overall complexity.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Method Arity** | **Standard Library** | **Foundation**</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>---|:---:|:---:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>0: `ƒ()` | 5 | 7</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1: `ƒ(:)` | 19 | 48</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2: `ƒ(::)` | 13 | 19</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>3: `ƒ(:::)` | 5 | 11</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>4: `ƒ(::::)` | 1 | 7</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>5: `ƒ(:::::)` | - | 2</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>6: `ƒ(::::::)` | - | 1</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**API Kind** | **Standard Library** | **Foundation**</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>---|:---:|:---:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`init` | 41 | 18</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`func` | 42 | 55</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`subscript` | 9 | 0</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`var` | 26 | 14</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Total: 205 APIs**</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>to press through physical API sprawl just to get started.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Many of the choices detailed below contribute to solving this problem,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>including:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Restoring `Collection` conformance and dropping the `.characters` view.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Providing a more general, composable slicing syntax.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Altering `Comparable` so that parameterized</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> (e.g. case-insensitive) comparison fits smoothly into the basic syntax.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Clearly separating language-dependent operations on text produced </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> by and for humans from language-independent</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> operations on text produced by and for machine processing.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Relocating APIs that fall outside the domain of basic string processing and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> discouraging the proliferation of ad-hoc extensions.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Batteries Included</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>While `String` is available to all programs out-of-the-box, crucial APIs for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>basic string processing tasks are still inaccessible until `Foundation` is</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>imported. While it makes sense that `Foundation` is needed for domain-specific</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>jobs such as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[linguistic tagging](<a href="https://developer.apple.com/reference/foundation/nslinguistictagger">https://developer.apple.com/reference/foundation/nslinguistictagger</a>),</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>one should not need to import anything to, for example, do case-insensitive</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>comparison.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Unicode Compliance and Platform Support</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The Unicode standard provides a crucial objective reference point for what</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>constitutes correct behavior in an extremely complex domain, so</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Unicode-correctness is, and will remain, a fundamental design principle behind</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Swift's `String`. That said, the Unicode standard is an evolving document, so</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>this objective reference-point is not fixed.[1] While</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>many of the most important operations—e.g. string hashing, equality, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>non-localized comparison—will be stable, the semantics</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of others, such as grapheme breaking and localized comparison and case</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>conversion, are expected to change as platforms are updated, so programs should</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>be written so their correctness does not depend on precise stability of these</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>semantics across OS versions or platforms. Although it may be possible to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>imagine static and/or dynamic analysis tools that will help users find such</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>errors, the only sure way to deal with this fact of life is to educate users.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>## Design Points</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Internationalization</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>There is strong evidence that developers cannot determine how to use</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>internationalization APIs correctly. Although documentation could and should be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>improved, the sheer size, complexity, and diversity of these APIs is a major</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>contributor to the problem, causing novices to tune out, and more experienced</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>programmers to make avoidable mistakes.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The first step in improving this situation is to regularize all localized</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>operations as invocations of normal string operations with extra</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>parameters. Among other things, this means:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. Doing away with `localizedXXX` methods </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. Providing a terse way to name the current locale as a parameter</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>3. Automatically adjusting defaults for options such</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> as case sensitivity based on whether the operation is localized.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> guidance in the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> [Internationalization and Localization Guide](<a href="https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html">https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html</a>).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Along with appropriate documentation updates, these changes will make localized</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>operations more teachable, comprehensible, and approachable, thereby lowering a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>barrier that currently leads some developers to ignore localization issues</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>altogether.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### The Default Behavior of `String`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Although this isn't well-known, the most accessible form of many operations on</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Swift `String` (and `NSString`) are really only appropriate for text that is</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>intended to be processed for, and consumed by, machines. The semantics of the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>operations with the simplest spellings are always non-localized and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>language-agnostic.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Two major factors play into this design choice:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. Machine processing of text is important, so we should have first-class,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> accessible functions appropriate to that use case.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. The most general localized operations require a locale parameter not required</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> by their un-localized counterparts. This naturally skews complexity towards</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> localized operations.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Reaffirming that `String`'s simplest APIs have</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>language-independent/machine-processed semantics has the benefit of clarifying</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the proper default behavior of operations such as comparison, and allows us to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>make [significant optimizations](#collation-semantics) that were previously</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>thought to conflict with Unicode.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Future Directions</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>One of the most common internationalization errors is the unintentional</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>presentation to users of text that has not been localized, but regularizing APIs</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>and improving documentation can go only so far in preventing this error.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Combined with the fact that `String` operations are non-localized by default,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the environment for processing human-readable text may still be somewhat</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>error-prone in Swift 4.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For an audience of mostly non-experts, it is especially important that naïve</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>code is very likely to be correct if it compiles, and that more sophisticated</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>issues can be revealed progressively. For this reason, we intend to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>specifically and separately target localization and internationalization</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>problems in the Swift 5 timeframe.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Operations With Options</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>There are three categories of common string operation that commonly need to be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>tuned in various dimensions:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Operation**|**Applicable Options**</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>---|---</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>sort ordering | locale, case/diacritic/width-insensitivity</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>case conversion | locale</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>pattern matching | locale, case/diacritic/width-insensitivity</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The defaults for case-, diacritic-, and width-insensitivity are different for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>localized operations than for non-localized operations, so for example a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>localized sort should be case-insensitive by default, and a non-localized sort</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>should be case-sensitive by default. We propose a standard “language” of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>defaulted parameters to be used for these purposes, with usage roughly like this:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>x.compared(to: y, case: .sensitive, in: swissGerman)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>x.lowercased(in: .currentLocale)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>x.allMatches(</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> somePattern, case: .insensitive, diacritic: .insensitive)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This usage might be supported by code like this:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>enum StringSensitivity {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>case sensitive</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>case insensitive</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Locale {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>static var currentLocale: Locale { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Unicode {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// An example of the option language in declaration context,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// with nil defaults indicating unspecified, so defaults can be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// driven by the presence/absence of a specific Locale</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func frobnicated(</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> case caseSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> diacritic diacriticSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> width widthSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> in locale: Locale? = nil</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>) -> Self { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Comparing and Hashing Strings</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Collation Semantics</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>What Unicode says about collation—which is used in `<`, `==`, and hashing— turns</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>out to be quite interesting, once you pick it apart. The full Unicode Collation</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Algorithm (UCA) works like this:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. Fully normalize both strings</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. Convert each string to a sequence of numeric triples to form a collation key</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>3. “Flatten” the key by concatenating the sequence of first elements to the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> sequence of second elements to the sequence of third elements</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>4. Lexicographically compare the flattened keys </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>While step 1 can usually</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>be [done quickly](<a href="http://unicode.org/reports/tr15/#Description_Norm">http://unicode.org/reports/tr15/#Description_Norm</a>) and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>incrementally, step 2 uses a collation table that maps matching *sequences* of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>unicode scalars in the normalized string to *sequences* of triples, which get</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>accumulated into a collation key. Predictably, this is where the real costs</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>lie.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>*However*, there are some bright spots to this story. First, as it turns out,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>string sorting (localized or not) should be done down to what's called</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[“identical” level](<a href="http://unicode.org/reports/tr10/#Multi_Level_Comparison">http://unicode.org/reports/tr10/#Multi_Level_Comparison</a>),</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>which adds a step 3a: append the string's normalized form to the flattened</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>collation key. At first blush this just adds work, but consider what it does</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>for equality: two strings that normalize the same, naturally, will collate the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>same. But also, *strings that normalize differently will always collate</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>differently*. In other words, for equality, it is sufficient to compare the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>strings' normalized forms and see if they are the same. We can therefore</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>entirely skip the expensive part of collation for equality comparison.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Next, naturally, anything that applies to equality also applies to hashing: it</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>is sufficient to hash the string's normalized form, bypassing collation keys.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This should provide significant speedups over the current implementation.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Perhaps more importantly, since comparison down to the “identical” level applies</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>even to localized strings, it means that hashing and equality can be implemented</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>exactly the same way for localized and non-localized text, and hash tables with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>localized keys will remain valid across current-locale changes.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Finally, once it is agreed that the *default* role for `String` is to handle</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>machine-generated and machine-readable text, the default ordering of `String`s</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>need no longer use the UCA at all. It is sufficient to order them in any way</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that's consistent with equality, so `String` ordering can simply be a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>lexicographical comparison of normalized forms,[4]</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>(which is equivalent to lexicographically comparing the sequences of grapheme</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>clusters), again bypassing step 2 and offering another speedup.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This leaves us executing the full UCA *only* for localized sorting, and ICU's</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>implementation has apparently been very well optimized.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Following this scheme everywhere would also allow us to make sorting behavior</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>consistent across platforms. Currently, we sort `String` according to the UCA,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>except that—*only on Apple platforms*—pairs of ASCII characters are ordered by</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>unicode scalar value.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Syntax</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Because the current `Comparable` protocol expresses all comparisons with binary</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>operators, string comparisons—which may require</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>additional [options](#operations-with-options)—do not fit smoothly into the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>existing syntax. At the same time, we'd like to solve other problems with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>comparison, as outlined</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[this proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e">https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e</a>)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>(implemented by changes at the head</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[this branch](<a href="https://github.com/CodaFi/swift/commits/space-the-final-frontier">https://github.com/CodaFi/swift/commits/space-the-final-frontier</a>)).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We should adopt a modification of that proposal that uses a method rather than</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>an operator `<=>`:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>enum SortOrder { case before, same, after }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>protocol Comparable : Equatable {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func compared(to: Self) -> SortOrder</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This change will give us a syntactic platform on which to implement methods with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>additional, defaulted arguments, thereby unifying and regularizing comparison</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>across the library.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension String {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func compared(to: Self) -> SortOrder</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that the standard library simply adopts Foundation's `ComparisonResult` as is,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>but we believe the community should at least consider alternate naming before</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that happens. There will be an opportunity to discuss the choices in detail</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>when the modified</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[Comparison Proposal](<a href="https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e">https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e</a>) comes</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>up for review.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### `String` should be a `Collection` of `Character`s Again</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In Swift 2.0, `String`'s `Collection` conformance was dropped, because we</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>convinced ourselves that its semantics differed from those of `Collection` too</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>significantly.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>It was always well understood that if strings were treated as sequences of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>a collection of `Character` (extended grapheme clusters). During 2.0</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>development, though, we realized that correct string concatenation could</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>occasionally merge distinct grapheme clusters at the start and end of combined</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>strings.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This quirk aside, every aspect of strings-as-collections-of-graphemes appears to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>comport perfectly with Unicode. We think the concatenation problem is tolerable,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>because the cases where it occurs all represent partially-formed constructs. The</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>ACCENT)—are explicitly called out in the Unicode standard as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>“[degenerate](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries</a>)” or</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>“[defective](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf">http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf</a>)”. The other</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>cases—such as a string ending in a zero-width joiner or half of a regional</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>indicator—appear to be equally transient and unlikely outside of a text editor.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Admitting these cases encourages exploration of grapheme composition and is</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>consistent with what appears to be an overall Unicode philosophy that “no</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>special provisions are made to get marginally better behavior for… cases that</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>never occur in practice.”[2] Furthermore, it seems</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>unlikely to disturb the semantics of any plausible algorithms. We can handle</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>these cases by documenting them, explicitly stating that the elements of a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`String` are an emergent property based on Unicode rules.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The benefits of restoring `Collection` conformance are substantial: </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Collection-like operations encourage experimentation with strings to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> investigate and understand their behavior. This is useful for teaching new</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> programmers, but also good for experienced programmers who want to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> understand more about strings/unicode.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Extended grapheme clusters form a natural element boundary for Unicode</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> strings. For example, searching and matching operations will always produce</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> results that line up on grapheme cluster boundaries.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Character-by-character processing is a legitimate thing to do in many real</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> use-cases, including parsing, pattern matching, and language-specific</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> transformations such as transliteration.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* `Collection` conformance makes a wide variety of powerful operations</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> available that are appropriate to `String`'s default role as the vehicle for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> machine processed text.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> The methods `String` would inherit from `Collection`, where similar to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> higher-level string algorithms, have the right semantics. For example,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `flatMap` with case-conversion, produce the same results one would expect</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> from whole-string ordering comparison, equality comparison, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> case-conversion, respectively. `reverse` operates correctly on graphemes,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> keeping diacritics moored to their base characters and leaving emoji intact.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> Other methods such as `indexOf` and `contains` make obvious sense. A few</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `Collection` methods, like `min` and `max`, may not be particularly useful</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> on `String`, but we don't consider that to be a problem worth solving, in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> the same way that we wouldn't try to suppress `min` and `max` on a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `Set([UInt8])` that was used to store IP addresses.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Many of the higher-level operations that we want to provide for `String`s,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> such as parsing and pattern matching, should apply to any `Collection`, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> many of the benefits we want for `Collections`, such</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> as unified slicing, should accrue</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> equally to `String`. Making `String` part of the same protocol hierarchy</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> allows us to write these operations once and not worry about keeping the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> benefits in sync.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Slicing strings into substrings is a crucial part of the vocabulary of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> string processing, and all other sliceable things are `Collection`s.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> Because of its collection-like behavior, users naturally think of `String`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> in collection terms, but run into frustrating limitations where it fails to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> conform and are left to wonder where all the differences lie. Many simply</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> “correct” this limitation by declaring a trivial conformance:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension String : BidirectionalCollection {}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> Even if we removed indexing-by-element from `String`, users could still do</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> this:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> extension String : BidirectionalCollection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> subscript(i: Index) -> Character { return characters[i] }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> It would be much better to legitimize the conformance to `Collection` and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> simply document the oddity of any concatenation corner-cases, than to deny</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> users the benefits on the grounds that a few cases are confusing.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Note that the fact that `String` is a collection of graphemes does *not* mean</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that string operations will necessarily have to do grapheme boundary</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>recognition. See the Unicode protocol section for details.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### `Character` and `CharacterSet`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Character`, which represents a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Unicode</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[extended grapheme cluster](<a href="http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries</a>),</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>is a bit of a black box, requiring conversion to `String` in order to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>do any introspection, including interoperation with ASCII. To fix this, we should:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> of grapheme clusters is discoverable.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Add a failable `init` from sequences of scalars (returning nil for sequences</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> that contain 0 or 2+ graphemes).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- (Lower priority) expose some operations, such as `func uppercase() -></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> String`, `var isASCII: Bool`, and, to the extent they can be sensibly</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> generalized, queries of unicode properties that should also be exposed on</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>type. This means it is usable on `String`, but only by going through the unicode</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>scalar view. To deal with this clash in the short term, `CharacterSet` should be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>introduce a `CharacterSet` that provides similar functionality for extended</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>grapheme clusters.[5]</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Unification of Slicing Operations</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Creating substrings is a basic part of String processing, but the slicing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>operations that we have in Swift are inconsistent in both their spelling and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>their naming: </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Slices with two explicit endpoints are done with subscript, and support</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> in-place mutation:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> s[i..<j].mutate()</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Slicing from an index to the end, or from the start to an index, is done</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> with a method and does not support in-place mutation:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> s.prefix(upTo: i).readOnly()</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Prefix and suffix operations should be migrated to be subscripting operations</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[this proposal](<a href="https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md">https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md</a>).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>With generic subscripting in the language, that will allow us to collapse a wide</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>variety of methods and subscript overloads into a single implementation, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>give users an easy-to-use and composable way to describe subranges.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>is an ongoing research project that can be considered part of the potential</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>long-term vision of text (and collection) processing.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Substrings</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>When implementing substring slicing, languages are faced with three options:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. Make the substrings the same type as string, and share storage.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. Make the substrings the same type as string, and copy storage when making the substring.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>3. Make substrings a different type, with a storage copy on conversion to string.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We think number 3 is the best choice. A walk-through of the tradeoffs follows.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Same type, shared storage</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In Swift 3.0, slicing a `String` produces a new `String` that is a view into a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>subrange of the original `String`'s storage. This is why `String` is 3 words in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>size (the start, length and buffer owner), unlike the similar `Array` type</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>which is only one.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This is a simple model with big efficiency gains when chopping up strings into</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>multiple smaller strings. But it does mean that a stored substring keeps the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>entire original string buffer alive even after it would normally have been</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>released.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This arrangement has proven to be problematic in other programming languages,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>because applications sometimes extract small strings from large ones and keep</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>those small strings long-term. That is considered a memory leak and was enough</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of a problem in Java that they changed from substrings sharing storage to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>making a copy in 1.7.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Same type, copied storage</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Copying of substrings is also the choice made in C#, and in the default</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`NSString` implementation. This approach avoids the memory leak issue, but has</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>obvious performance overhead in performing the copies.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This in turn encourages trafficking in string/range pairs instead of in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>substrings, for performance reasons, leading to API challenges. For example:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foo.compare(bar, range: start..<end)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Here, it is not clear whether `range` applies to `foo` or `bar`. This</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>relationship is better expressed in Swift as a slicing operation:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foo[start..<end].compare(bar)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Not only does this clarify to which string the range applies, it also brings</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>this sub-range capability to any API that operates on `String` "for free". So</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>these other combinations also work equally well:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// apply range on argument rather than target</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foo.compare(bar[start..<end])</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// apply range on both</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foo[start..<end].compare(bar[start1..<end1])</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// compare two strings ignoring first character</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>foo.dropFirst().compare(bar.dropFirst())</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In all three cases, an explicit range argument need not appear on the `compare`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>method itself. The implementation of `compare` does not need to know anything</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>about ranges. Methods need only take range arguments when that was an</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>integral part of their purpose (for example, setting the start and end of a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>user's current selection in a text box).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Different type, shared storage</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The desire to share underlying storage while preventing accidental memory leaks</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>occurs with slices of `Array`. For this reason we have an `ArraySlice` type.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The inconvenience of a separate type is mitigated by most operations used on</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Array` from the standard library being generic over `Sequence` or `Collection`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We should apply the same approach for `String` by introducing a distinct</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>Important: Long-term storage of `Substring` instances is discouraged. A</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>substring holds a reference to the entire storage of a larger string, not</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>just to the portion it presents, even after the original string's lifetime</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>ends. Long-term storage of a `Substring` may therefore prolong the lifetime</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>of large strings that are no longer otherwise accessible, which can appear</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>to be memory leakage.</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>When assigning a `Substring` to a longer-lived variable (usually a stored</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>property) explicitly of type `String`, a type conversion will be performed, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>at this point the substring buffer is copied and the original string's storage</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>can be released.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>A `String` that was not its own `Substring` could be one word—a single tagged</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>pointer—without requiring additional allocations. `Substring`s would be a view</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>onto a `String`, so are 3 words - pointer to owner, pointer to start, and a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>length. The small string optimization for `Substring` would take advantage of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the larger size, probably with a less compressed encoding for speed.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The downside of having two types is the inconvenience of sometimes having a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Substring` when you need a `String`, and vice-versa. It is likely this would</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>be a significantly bigger problem than with `Array` and `ArraySlice`, as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>slicing of `String` is such a common operation. It is especially relevant to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>existing code that assumes `String` is the currency type. To ease the pain of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>type mismatches, `Substring` should be a subtype of `String` in the same way</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that `Int` is a subtype of `Optional<Int>`. This would give users an implicit</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>conversion from `Substring` to `String`, as well as the usual implicit</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>conversions such as `[Substring]` to `[String]` that other subtype</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>relationships receive.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In most cases, type inference combined with the subtype relationship should</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>make the type difference a non-issue and users will not care which type they</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>are using. For flexibility and optimizability, most operations from the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>standard library will traffic in generic models of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[`Unicode`](#the--code-unicode--code--protocol).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>##### Guidance for API Designers</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In this model, **if a user is unsure about which type to use, `String` is always</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>a reasonable default**. A `Substring` passed where `String` is expected will be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>implicitly copied. When compared to the “same type, copied storage” model, we</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>have effectively deferred the cost of copying from the point where a substring</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>is created until it must be converted to `String` for use with an API.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>A user who needs to optimize away copies altogether should use this guideline:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>if for performance reasons you are tempted to add a `Range` argument to your</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>method as well as a `String` to avoid unnecessary copies, you should instead</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>use `Substring`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>##### The “Empty Subscript”</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>To make it easy to call such an optimized API when you only have a `String` (or</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>to call any API that takes a `Collection`'s `SubSequence` when all you have is</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the `Collection`), we propose the following “empty subscript” operation,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Collection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>subscript() -> SubSequence { </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> return self[startIndex..<endIndex] </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>which allows the following usage:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The `[]` syntax can be offered as a fixit when needed, similar to `&` for an</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`inout` argument. While it doesn't help a user to convert `[String]` to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`[Substring]`, the need for such conversions is extremely rare, can be done with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>a simple `map` (which could also be offered by a fixit):</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>takesAnArrayOfSubstring(arrayOfString.map { $0[] })</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Other Options Considered</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>As we have seen, all three options above have downsides, but it's possible</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>these downsides could be eliminated/mitigated by the compiler. We are proposing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>one such mitigation—implicit conversion—as part of the the "different type,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>shared storage" option, to help avoid the cognitive load on developers of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>having to deal with a separate `Substring` type.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>To avoid the memory leak issues of a "same type, shared storage" substring</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>option, we considered whether the compiler could perform an implicit copy of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the underlying storage when it detects the string is being "stored" for long</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>term usage, say when it is assigned to a stored property. The trouble with this</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>approach is it is very difficult for the compiler to distinguish between</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>long-term storage versus short-term in the case of abstractions that rely on</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>stored properties. For example, should the storing of a substring inside an</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Optional` be considered long-term? Or the storing of multiple substrings</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>inside an array? The latter would not work well in the case of a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`components(separatedBy:)` implementation that intended to return an array of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>substrings. It would also be difficult to distinguish intentional medium-term</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>storage of substrings, say by a lexer. There does not appear to be an effective</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>consistent rule that could be applied in the general case for detecting when a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>substring is truly being stored long-term.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>To avoid the cost of copying substrings under "same type, copied storage", the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>optimizer could be enhanced to to reduce the impact of some of those copies.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For example, this code could be optimized to pull the invariant substring out</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of the loop:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>for _ in 0..<lots { </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>someFunc(takingString: bigString[bigRange]) </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>It's worth noting that a similar optimization is needed to avoid an equivalent</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>problem with implicit conversion in the "different type, shared storage" case:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let substring = bigString[bigRange]</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>for _ in 0..<lots { someFunc(takingString: substring) }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>However, in the case of "same type, copied storage" there are many use cases</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that cannot be optimized as easily. Consider the following simple definition of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>a recursive `contains` algorithm, which when substring slicing is linear makes</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the overall algorithm quadratic:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension String {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> func containsChar(_ x: Character) -> Bool {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> return !isEmpty && (first == x || dropFirst().containsChar(x))</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For the optimizer to eliminate this problem is unrealistic, forcing the user to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>remember to optimize the code to not use string slicing if they want it to be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>efficient (assuming they remember):</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension String {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> // add optional argument tracking progress through the string</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> let idx = idx ?? startIndex</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> return idx != endIndex</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Substrings, Ranges and Objective-C Interop</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The pattern of passing a string/range pair is common in several Objective-C</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>APIs, and is made especially awkward in Swift by the non-interchangeability of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Range<String.Index>` and `NSRange`. </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In general, however, the Swift idiom for operating on a sub-range of a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Collection` is to *slice* the collection and operate on that:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>s2.find(s2[j..<s2.endIndex])</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>without the `NSRange` argument. The Objective-C importer should be changed to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>give these APIs special treatment so that when a `Substring` is passed, instead</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of being converted to a `String`, the full `NSString` and range are passed to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the Objective-C method, thereby avoiding a copy.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>As a result, you would never need to pass an `NSRange` to these APIs, which</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>solves the impedance problem by eliminating the argument, resulting in more</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>idiomatic Swift code while retaining the performance benefit. To help users</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>manually handle any cases that remain, Foundation should be augmented to allow</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the following syntax for converting to and from `NSRange`:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let iToJ = Range(nsr, in: s) // Equivalent to i..<j</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### The `Unicode` protocol</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>With `Substring` and `String` being distinct types and sharing almost all</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>interface and semantics, and with the highest-performance string processing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>requiring knowledge of encoding and layout that the currency types can't</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>provide, it becomes important to capture the common “string API” in a protocol.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Since Unicode conformance is a key feature of string processing in swift, we</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>call that protocol `Unicode`:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Note:** The following assumes several features that are planned but not yet implemented in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Swift, and should be considered a sketch rather than a final design.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>protocol Unicode </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>: Comparable, BidirectionalCollection where Element == Character {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>associatedtype Encoding : UnicodeEncoding</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var encoding: Encoding { get }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>associatedtype CodeUnits </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> : RandomAccessCollection where Element == Encoding.CodeUnit</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var codeUnits: CodeUnits { get }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>associatedtype UnicodeScalars </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> : BidirectionalCollection where Element == UnicodeScalar</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var unicodeScalars: UnicodeScalars { get }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>associatedtype ExtendedASCII </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> : BidirectionalCollection where Element == UInt32</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var extendedASCII: ExtendedASCII { get }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var unicodeScalars: UnicodeScalars { get }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Unicode {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// ... define high-level non-mutating string operations, e.g. search ...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func compared<Other: Unicode>(</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> to rhs: Other,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> case caseSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> diacritic diacriticSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> width widthSensitivity: StringSensitivity? = nil,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> in locale: Locale? = nil</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>) -> SortOrder { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Unicode : RangeReplaceableCollection where CodeUnits :</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>RangeReplaceableCollection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> // Satisfy protocol requirement</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> where C.Element == Element</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// ... define high-level mutating string operations, e.g. replace ...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The goal is that `Unicode` exposes the underlying encoding and code units in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>such a way that for types with a known representation (e.g. a high-performance</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`UTF8String`) that information can be known at compile-time and can be used to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>generate a single path, while still allowing types like `String` that admit</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>multiple representations to use runtime queries and branches to fast path</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>specializations.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>**Note:** `Unicode` would make a fantastic namespace for much of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>what's in this proposal if we could get the ability to nest types and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>protocols in protocols.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Scanning, Matching, and Tokenization</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Low-Level Textual Analysis</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We should provide convenient APIs processing strings by character. For example,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>it should be easy to cleanly express, “if this string starts with `"f"`, process</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the rest of the string as follows…” Swift is well-suited to expressing this</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>common pattern beautifully, but we need to add the APIs. Here are two examples</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>of the sort of code that might be possible given such APIs:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>if let firstLetter = input.droppingPrefix(alphabeticCharacter) {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>somethingWith(input) // process the rest of input</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>if let (number, restOfInput) = input.parsingPrefix(Int.self) {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> ...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The specific spelling and functionality of APIs like this are TBD. The larger</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>point is to make sure matching-and-consuming jobs are well-supported.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Unified Pattern Matcher Protocol</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Many of the current methods that do matching are overloaded to do the same</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>logical operations in different ways, with the following axes:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Logical Operation: `find`, `split`, `replace`, match at start</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Kind of pattern: `CharacterSet`, `String`, a regex, a closure</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the method name, and sometimes an argument</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Whole string or subrange.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We should represent these aspects as orthogonal, composable components,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>abstracting pattern matchers into a protocol like</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[this one](<a href="https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33">https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33</a>),</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that can allow us to define logical operations once, without introducing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>overloads, and massively reducing API surface area.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For example, using the strawman prefix `%` syntax to turn string literals into</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>patterns, the following pairs would all invoke the same generic methods:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>if let found = s.firstMatch(%"searchString") { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>if let found = s.firstMatch(someRegex) { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>for m in s.allMatches((%"searchString"), case: .insensitive) { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>for m in s.allMatches(someRegex) { ... }</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let items = s.split(separatedBy: ", ")</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let tokens = s.split(separatedBy: CharacterSet.whitespace)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Note that, because Swift requires the indices of a slice to match the indices of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the range from which it was sliced, operations like `firstMatch` can return a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the string being searched, if needed, can easily be recovered as the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`startIndex` and `endIndex` of the `Substring`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Note also that matching operations are useful for collections in general, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>would fall out of this proposal:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// replace subsequences of contiguous NaNs with zero</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>forces.replace(oneOrMore([Float.nan]), [0.0])</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Regular Expressions</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Addressing regular expressions is out of scope for this proposal.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>That said, it is important that to note the pattern matching protocol mentioned</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>above provides a suitable foundation for regular expressions, and types such as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`NSRegularExpression` can easily be retrofitted to conform to it. In the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>future, support for regular expression literals in the compiler could allow for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>compile-time syntax checking and optimization.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### String Indices</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`utf16`—each with its own opaque index type. The APIs used to translate indices</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>between views add needless complexity, and the opacity of indices makes them</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>difficult to serialize.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The index translation problem has two aspects:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. `String` views cannot consume one anothers' indices without a cumbersome</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> conversion step. An index into a `String`'s `characters` must be translated</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> before it can be used as a position in its `unicodeScalars`. Although these</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> translations are rarely needed, they add conceptual and API complexity.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. Many APIs in the core libraries and other frameworks still expose `String`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> positions as `Int`s and regions as `NSRange`s, which can only reference a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `utf16` view and interoperate poorly with `String` itself.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Index Interchange Among Views</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>String's need for flexible backing storage and reasonably-efficient indexing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>(i.e. without dynamically allocating and reference-counting the indices</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>themselves) means indices need an efficient underlying storage type. Although</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>we do not wish to expose `String`'s indices *as* integers, `Int` offsets into</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>underlying code unit storage makes a good underlying storage type, provided</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`String`'s underlying storage supports random-access. We think random-access</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>*code-unit storage* is a reasonable requirement to impose on all `String`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>instances.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Making these `Int` code unit offsets conveniently accessible and constructible</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>solves the serialization problem:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>clipboard.write(s.endIndex.codeUnitOffset)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let offset = clipboard.read(Int.self)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let i = String.Index(codeUnitOffset: offset)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Index interchange between `String` and its `unicodeScalars`, `codeUnits`,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>seamless by having them share an index type (semantics of indexing a `String`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>between grapheme cluster boundaries are TBD—it can either trap or be forgiving).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Having a common index allows easy traversal into the interior of graphemes,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>something that is often needed, without making it likely that someone will do it</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>by accident.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- `String.index(after:)` should advance to the next grapheme, even when the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> index points partway through a grapheme.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- `String.index(before:)` should move to the start of the grapheme before</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> the current position.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>crucial, as the specifics of encoding should not be a concern for most use</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>cases, and would impose needless costs on the indices of other views. That</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>said, we can make translation much more straightforward by exposing simple</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>bidirectional converting `init`s on both index types:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let u8Position = String.UTF8.Index(someStringIndex)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let originalPosition = String.Index(u8Position)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Index Interchange with Cocoa</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>We intend to address `NSRange`s that denote substrings in Cocoa APIs as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>described [later in this document](#substrings--ranges-and-objective-c-interop).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>That leaves the interchange of bare indices with Cocoa APIs trafficking in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`Int`. Hopefully such APIs will be rare, but when needed, the following</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension, which would be useful for all `Collections`, can help:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension Collection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func index(offset: IndexDistance) -> Index {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> return index(startIndex, offsetBy: offset)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func offset(of i: Index) -> IndexDistance {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> return distance(from: startIndex, to: i)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Then integers can easily be translated into offsets into a `String`'s `utf16`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>view for consumption by Cocoa:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>let swiftIndex = s.utf16.index(offset: cocoaIndex)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Formatting</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>A full treatment of formatting is out of scope of this proposal, but</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>we believe it's crucial for completing the text processing picture. This</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>section details some of the existing issues and thinking that may guide future</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>development.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Printf-Style Formatting</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`String.format` is designed on the `printf` model: it takes a format string with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>textual placeholders for substitution, and an arbitrary list of other arguments.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The syntax and meaning of these placeholders has a long history in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>C, but for anyone who doesn't use them regularly they are cryptic and complex,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>as the `printf (3)` man page attests.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Aside from complexity, this style of API has two major problems: First, the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>spelling of these placeholders must match up to the types of the arguments, in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the right order, or the behavior is undefined. Some limited support for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>compile-time checking of this correspondence could be implemented, but only for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the cases where the format string is a literal. Second, there's no reasonable</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>way to extend the formatting vocabulary to cover the needs of new types: you are</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>stuck with what's in the box.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### Foundation Formatters</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>The formatters supplied by Foundation are highly capable and versatile, offering</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>both formatting and parsing services. When used for formatting, though, the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>design pattern demands more from users than it should:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Matching the type of data being formatted to a formatter type</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Creating an instance of that type</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> need for this step prevents the instance from being used and discarded in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> the same expression where it is created.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Overall, introduction of needless verbosity into source</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>These may seem like small issues, but the experience of Apple localization</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>experts is that the total drag of these factors on programmers is such that they</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>tend to reach for `String.format` instead.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>#### String Interpolation</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Swift string interpolation provides a user-friendly alternative to printf's</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>domain-specific language (just write ordinary swift code!) and its type safety</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>problems (put the data right where it belongs!) but the following issues prevent</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>it from being useful for localized formatting (among other jobs):</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* [SR-2303](<a href="https://bugs.swift.org/browse/SR-2303">https://bugs.swift.org/browse/SR-2303</a>) We are unable to restrict</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> types used in string interpolation.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* [SR-1260](<a href="https://bugs.swift.org/browse/SR-1260">https://bugs.swift.org/browse/SR-1260</a>) String interpolation can't</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> distinguish (fragments of) the base string from the string substitutions.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In the long run, we should improve Swift string interpolation to the point where</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>it can participate in most any formatting job. Mostly this centers around</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>fixing the interpolation protocols per the previous item, and supporting</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>localization.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>To be able to use formatting effectively inside interpolations, it needs to be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>both lightweight (because it all happens in-situ) and discoverable. One </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>approach would be to standardize on `format` methods, e.g.:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>"Column 1: \(n.format(radix:16, width:8)) *** \(message)"</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>"Something with leading zeroes: \(x.format(fill: zero, width:8))"</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### C String Interop</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Our support for interoperation with nul-terminated C strings is scattered and</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>incoherent, with 6 ways to transform a C string into a `String` and four ways to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>do the inverse. These APIs should be replaced with the following</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>extension String {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// Constructs a `String` having the same contents as `nulTerminatedUTF8`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>///</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// bytes ending just before the first zero byte (NUL character).</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>init(cString nulTerminatedUTF8: UnsafePointer<CChar>)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>///</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// the given `encoding`, ending just before the first zero code unit.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// - Parameter encoding: describes the encoding in which the code units</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// should be interpreted.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>init<Encoding: UnicodeEncoding>(</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> encoding: Encoding)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// Invokes the given closure on the contents of the string, represented as a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>/// pointer to a null-terminated sequence of UTF-8 code units.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>func withCString<Result>(</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>In both of the construction APIs, any invalid encoding sequence detected will</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>have its longest valid prefix replaced by U+FFFD, the Unicode replacement</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>character, per Unicode specification. This covers the common case. The</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>replacement is done *physically* in the underlying storage and the validity of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the result is recorded in the `String`'s `encoding` such that future accesses</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>need not be slowed down by possible error repair separately.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Construction that is aborted when encoding errors are detected can be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>accomplished using APIs on the `encoding`. String types that retain their</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>physical encoding even in the presence of errors and are repaired on-the-fly can</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>be built as different instances of the `Unicode` protocol.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Unicode 9 Conformance</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the process of properly identifying `Character` boundaries. We need to update</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`String` to account for this change.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### High-Performance String Processing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Many strings are short enough to store in 64 bits, many can be stored using only</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>8 bits per unicode scalar, others are best encoded in UTF-16, and some come to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>us already in some other encoding, such as UTF-8, that would be costly to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>translate. Supporting these formats while maintaining usability for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>general-purpose APIs demands that a single `String` type can be backed by many</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>different representations.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>That said, the highest performance code always requires static knowledge of the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>data structures on which it operates, and for this code, dynamic selection of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>representation comes at too high a cost. Heavy-duty text processing demands a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>way to opt out of dynamism and directly use known encodings. Having this</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>ability can also make it easy to cleanly specialize code that handles dynamic</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>cases for maximal efficiency on the most common representations.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>To address this need, we can build models of the `Unicode` protocol that encode</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>representation information into the type, such as `NFCNormalizedUTF16String`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Parsing ASCII Structure</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Although many machine-readable formats support the inclusion of arbitrary</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Unicode text, it is also common that their fundamental structure lies entirely</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>within the ASCII subset (JSON, YAML, many XML formats). These formats are often</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>processed most efficiently by recognizing ASCII structural elements as ASCII,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>and capturing the arbitrary sections between them in more-general strings. The</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>current String API offers no way to efficiently recognize ASCII and skip past</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>everything else without the overhead of full decoding into unicode scalars.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>For these purposes, strings should supply an `extendedASCII` view that is a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>collection of `UInt32`, where values less than `0x80` represent the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>corresponding ASCII character, and other values represent data that is specific</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>to the underlying encoding of the string.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>## Language Support</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This proposal depends on two new features in the Swift language:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>1. **Generic subscripts**, to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> enable unified slicing syntax.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>2. **A subtype relationship** between</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `Substring` and `String`, enabling framework APIs to traffic solely in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `String` while still making it possible to avoid copies by handling</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> `Substring`s where necessary.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Additionally, **the ability to nest types and protocols inside</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>protocols** could significantly shrink the footprint of this proposal</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>on the top-level Swift namespace.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>## Open Questions</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Must `String` be limited to storing UTF-16 subset encodings?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>question here; this is about what encodings must be storable, without</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>transcoding, in the common currency type called “`String`”.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- If we have a way to get at a `String`'s code units, we need a concrete type in</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>which to express them in the API of `String`, which is a concrete type</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- If String needs to be able to represent UTF-32, presumably the code units need</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>to be `UInt32`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Not supporting UTF-32-encoded text seems like one reasonable design choice.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Maybe we can allow UTF-8 storage in `String` and expose its code units as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`UInt16`, just as we would for Latin-1.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>- Supporting only UTF-16-subset encodings would imply that `String` indices can</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>be serialized without recording the `String`'s underlying encoding.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Do we need a type-erasable base protocol for UnicodeEncoding?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>UnicodeEncoding has an associated type, but it may be important to be able to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>traffic in completely dynamic encoding values, e.g. for “tell me the most</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>efficient encoding for this string.”</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### Should there be a string “facade?”</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>One possible design alternative makes `Unicode` a vehicle for expressing</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the storage and encoding of code units, but does not attempt to give it an API</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>appropriate for `String`. Instead, string APIs would be provided by a generic</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>wrapper around an instance of `Unicode`:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>struct StringFacade<U: Unicode> : BidirectionalCollection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// ...APIs for high-level string processing here...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var unicode: U // access to lower-level unicode details</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>typealias String = StringFacade<StringStorage></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>typealias Substring = StringFacade<StringStorage.SubSequence></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This design would allow us to de-emphasize lower-level `String` APIs such as</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>access to the specific encoding, by putting them behind a `.unicode` property.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>A similar effect in a facade-less design would require a new top-level</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`StringProtocol` playing the role of the facade with an an `associatedtype</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Storage : Unicode`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>An interesting variation on this design is possible if defaulted generic</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>parameters are introduced to the language:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```swift</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>struct String<U: Unicode = StringStorage> </span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>: BidirectionalCollection {</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>// ...APIs for high-level string processing here...</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>var unicode: U // access to lower-level unicode details</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>}</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>typealias Substring = String<StringStorage.SubSequence></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>```</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>One advantage of such a design is that naïve users will always extend “the right</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>type” (`String`) without thinking, and the new APIs will show up on `Substring`,</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`MyUTF8String`, etc. That said, it also has downsides that should not be</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>overlooked, not least of which is the confusability of the meaning of the word</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>“string.” Is it referring to the generic or the concrete type?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### `TextOutputStream` and `TextOutputStreamable`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`TextOutputStreamable` is intended to provide a vehicle for</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>efficiently transporting formatted representations to an output stream</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>without forcing the allocation of storage. Its use of `String`, a</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>type with multiple representations, at the lowest-level unit of</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>communication, conflicts with this goal. It might be sufficient to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>change `TextOutputStream` and `TextOutputStreamable` to traffic in an</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>associated type conforming to `Unicode`, but that is not yet clear.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>This area will require some design work.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### `description` and `debugDescription`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Should these be creating localized or non-localized representations?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Is returning a `String` efficient enough?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>* Is `debugDescription` pulling the weight of the API surface area it adds?</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>### `StaticString`</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>`StaticString` was added as a byproduct of standard library developed and kept</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>around because it seemed useful, but it was never truly *designed* for client</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>programmers. We need to decide what happens with it. Presumably *something*</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>should fill its role, and that should conform to `Unicode`.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>## Footnotes</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><b id="f0">0</b> The integers rewrite currently underway is expected to</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> substantially reduce the scope of `Int`'s API by using more</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span> generics. [↩](#a0)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><b id="f1">1</b> In practice, these semantics will usually be tied to the</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>version of the installed [ICU](<a href="http://icu-project.org">http://icu-project.org</a>) library, which</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>programmatically encodes the most complex rules of the Unicode Standard and its</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>de-facto extension, CLDR.[↩](#a1)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><b id="f2">2</b></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>See</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[<a href="http://unicode.org/reports/tr29/#Notation">http://unicode.org/reports/tr29/#Notation</a>](<a href="http://unicode.org/reports/tr29/#Notation">http://unicode.org/reports/tr29/#Notation</a>). Note</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>that inserting Unicode scalar values to prevent merging of grapheme clusters would</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>also constitute a kind of misbehavior (one of the clusters at the boundary would</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>not be found in the result), so would be relatively costly to implement, with</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>little benefit. [↩](#a2)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned by</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the Unicode standard for this purpose. In fact there's</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>a [whole chapter](<a href="http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf">http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf</a>)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>dedicated to it. In particular, §5.17 says:</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>When comparing text that is visible to end users, a correct linguistic sort</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>should be used, as described in _Section 5.16, Sorting and</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>Searching_. However, in many circumstances the only requirement is for a</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>fast, well-defined ordering. In such cases, a binary ordering can be used.</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>[↩](#a4)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><b id="f5">5</b> The queries supported by `NSCharacterSet` map directly onto</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>properties in a table that's indexed by unicode scalar value. This table is</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>part of the Unicode standard. Some of these queries (e.g., “is this an</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>uppercase character?”) may have fairly obvious generalizations to grapheme</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>clusters, but exactly how to do it is a research topic and *ideally* we'd either</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>establish the existing practice that the Unicode committee would standardize, or</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>the Unicode committee would do the research and we'd implement their</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>result.[↩](#a5)</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>_______________________________________________</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>swift-evolution mailing list</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><a href="mailto:swift-evolution@swift.org">swift-evolution@swift.org</a></span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span><a href="https://lists.swift.org/mailman/listinfo/swift-evolution">https://lists.swift.org/mailman/listinfo/swift-evolution</a></span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote></div></body></html>