[swift-evolution] Strings in Swift 4
Saagar Jha
saagar at saagarjha.com
Fri Jan 20 16:58:46 CST 2017
Comments inline.
Saagar Jha
> On Jan 20, 2017, at 12:53 PM, Dave Abrahams via swift-evolution <swift-evolution at swift.org> wrote:
>
>
> on Thu Jan 19 2017, Saagar Jha <swift-evolution at swift.org <mailto:swift-evolution at swift.org>> wrote:
>
>> Looks pretty good in general from my quick glance–at least, it’s much
>> better than the current situation. I do have a couple of comments and
>> questions, which I’ve inlined below.
>>
>> Saagar Jha
>>
>>> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution
>> <swift-evolution at swift.org> wrote:
>>>
>>> Hi all,
>>>
>>> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>>>
>>> Probably best read in rendered markdown on GitHub:
>>> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>>>
>>> We’re eager to hear everyone’s thoughts.
>>>
>>> Regards,
>>> Ben and Dave
>>>
>>>
>>> # String Processing For Swift 4
>>>
>>> * Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben
>> Cohen](https://github.com/airspeedswift)
>>>
>>> The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
>>> far, with just this short blurb in the
>>> [list of
>> goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html):
>>>
>>>> **String re-evaluation**: String is one of the most important fundamental
>>>> types in the language. The standard library leads have numerous ideas of how
>>>> to improve the programming model for it, without jeopardizing the goals of
>>>> providing a unicode-correct-by-default model. Our goal is to be better at
>>>> string processing than Perl!
>>>
>>> For Swift 4 and beyond we want to improve three dimensions of text processing:
>>>
>>> 1. Ergonomics
>>> 2. Correctness
>>> 3. Performance
>>>
>>> This document is meant to both provide a sense of the long-term vision
>>> (including undecided issues and possible approaches), and to define the scope of
>>> work that could be done in the Swift 4 timeframe.
>>>
>>> ## General Principles
>>>
>>> ### Ergonomics
>>>
>>> It's worth noting that ergonomics and correctness are mutually-reinforcing. An
>>> API that is easy to use—but incorrectly—cannot be considered an ergonomic
>>> success. Conversely, an API that's simply hard to use is also hard to use
>>> correctly. Acheiving optimal performance without compromising ergonomics or
>>> correctness is a greater challenge.
>>
>> Minor typo: acheiving->achieving
>>
>>> Consistency with the Swift language and idioms is also important for
>>> ergonomics. There are several places both in the standard library and in the
>>> foundation additions to `String` where patterns and practices found elsewhere
>>> could be applied to improve usability and familiarity.
>>>
>>> ### API Surface Area
>>>
>>> Primary data types such as `String` should have APIs that are easily understood
>>> given a signature and a one-line summary. Today, `String` fails that test. As
>>> you can see, the Standard Library and Foundation both contribute significantly to
>>> its overall complexity.
>>>
>>> **Method Arity** | **Standard Library** | **Foundation**
>>> ---|:---:|:---:
>>> 0: `ƒ()` | 5 | 7
>>> 1: `ƒ(:)` | 19 | 48
>>> 2: `ƒ(::)` | 13 | 19
>>> 3: `ƒ(:::)` | 5 | 11
>>> 4: `ƒ(::::)` | 1 | 7
>>> 5: `ƒ(:::::)` | - | 2
>>> 6: `ƒ(::::::)` | - | 1
>>>
>>> **API Kind** | **Standard Library** | **Foundation**
>>> ---|:---:|:---:
>>> `init` | 41 | 18
>>> `func` | 42 | 55
>>> `subscript` | 9 | 0
>>> `var` | 26 | 14
>>>
>>> **Total: 205 APIs**
>>>
>>> By contrast, `Int` has 80 APIs, none with more than two
>> parameters.[0] String processing is complex enough; users shouldn't
>> have
>>> to press through physical API sprawl just to get started.
>>>
>>> Many of the choices detailed below contribute to solving this problem,
>>> including:
>>>
>>> * Restoring `Collection` conformance and dropping the `.characters` view.
>>> * Providing a more general, composable slicing syntax.
>>> * Altering `Comparable` so that parameterized
>>> (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
>>> * Clearly separating language-dependent operations on text produced
>>> by and for humans from language-independent
>>> operations on text produced by and for machine processing.
>>> * Relocating APIs that fall outside the domain of basic string processing and
>>> discouraging the proliferation of ad-hoc extensions.
>>>
>>>
>>> ### Batteries Included
>>>
>>> While `String` is available to all programs out-of-the-box, crucial APIs for
>>> basic string processing tasks are still inaccessible until `Foundation` is
>>> imported. While it makes sense that `Foundation` is needed for domain-specific
>>> jobs such as
>>> [linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger),
>>> one should not need to import anything to, for example, do case-insensitive
>>> comparison.
>>>
>>> ### Unicode Compliance and Platform Support
>>>
>>> The Unicode standard provides a crucial objective reference point for what
>>> constitutes correct behavior in an extremely complex domain, so
>>> Unicode-correctness is, and will remain, a fundamental design principle behind
>>> Swift's `String`. That said, the Unicode standard is an evolving document, so
>>> this objective reference-point is not fixed.[1] While
>>> many of the most important operations—e.g. string hashing, equality, and
>>> non-localized comparison—will be stable, the semantics
>>> of others, such as grapheme breaking and localized comparison and case
>>> conversion, are expected to change as platforms are updated, so programs should
>>> be written so their correctness does not depend on precise stability of these
>>> semantics across OS versions or platforms. Although it may be possible to
>>> imagine static and/or dynamic analysis tools that will help users find such
>>> errors, the only sure way to deal with this fact of life is to educate users.
>>>
>>> ## Design Points
>>>
>>> ### Internationalization
>>>
>>> There is strong evidence that developers cannot determine how to use
>>> internationalization APIs correctly. Although documentation could and should be
>>> improved, the sheer size, complexity, and diversity of these APIs is a major
>>> contributor to the problem, causing novices to tune out, and more experienced
>>> programmers to make avoidable mistakes.
>>>
>>> The first step in improving this situation is to regularize all localized
>>> operations as invocations of normal string operations with extra
>>> parameters. Among other things, this means:
>>>
>>> 1. Doing away with `localizedXXX` methods
>>> 2. Providing a terse way to name the current locale as a parameter
>>> 3. Automatically adjusting defaults for options such
>>> as case sensitivity based on whether the operation is localized.
>>> 4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
>>> guidance in the
>>> [Internationalization and Localization
>> Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html).
>>>
>>> Along with appropriate documentation updates, these changes will make localized
>>> operations more teachable, comprehensible, and approachable, thereby lowering a
>>> barrier that currently leads some developers to ignore localization issues
>>> altogether.
>>>
>>> #### The Default Behavior of `String`
>>>
>>> Although this isn't well-known, the most accessible form of many operations on
>>> Swift `String` (and `NSString`) are really only appropriate for text that is
>>> intended to be processed for, and consumed by, machines. The semantics of the
>>> operations with the simplest spellings are always non-localized and
>>> language-agnostic.
>>>
>>> Two major factors play into this design choice:
>>>
>>> 1. Machine processing of text is important, so we should have first-class,
>>> accessible functions appropriate to that use case.
>>>
>>> 2. The most general localized operations require a locale parameter not required
>>> by their un-localized counterparts. This naturally skews complexity towards
>>> localized operations.
>>>
>>> Reaffirming that `String`'s simplest APIs have
>>> language-independent/machine-processed semantics has the benefit of clarifying
>>> the proper default behavior of operations such as comparison, and allows us to
>>> make [significant optimizations](#collation-semantics) that were previously
>>> thought to conflict with Unicode.
>>>
>>> #### Future Directions
>>>
>>> One of the most common internationalization errors is the unintentional
>>> presentation to users of text that has not been localized, but regularizing APIs
>>> and improving documentation can go only so far in preventing this error.
>>> Combined with the fact that `String` operations are non-localized by default,
>>> the environment for processing human-readable text may still be somewhat
>>> error-prone in Swift 4.
>>>
>>> For an audience of mostly non-experts, it is especially important that naïve
>>> code is very likely to be correct if it compiles, and that more sophisticated
>>> issues can be revealed progressively. For this reason, we intend to
>>> specifically and separately target localization and internationalization
>>> problems in the Swift 5 timeframe.
>>>
>>> ### Operations With Options
>>>
>>> There are three categories of common string operation that commonly need to be
>>> tuned in various dimensions:
>>>
>>> **Operation**|**Applicable Options**
>>> ---|---
>>> sort ordering | locale, case/diacritic/width-insensitivity
>>> case conversion | locale
>>> pattern matching | locale, case/diacritic/width-insensitivity
>>>
>>> The defaults for case-, diacritic-, and width-insensitivity are different for
>>> localized operations than for non-localized operations, so for example a
>>> localized sort should be case-insensitive by default, and a non-localized sort
>>> should be case-sensitive by default. We propose a standard “language” of
>>> defaulted parameters to be used for these purposes, with usage roughly like this:
>>>
>>> ```swift
>>> x.compared(to: y, case: .sensitive, in: swissGerman)
>>>
>>> x.lowercased(in: .currentLocale)
>>>
>>> x.allMatches(
>>> somePattern, case: .insensitive, diacritic: .insensitive)
>>> ```
>>>
>>> This usage might be supported by code like this:
>>>
>>> ```swift
>>> enum StringSensitivity {
>>> case sensitive
>>> case insensitive
>>> }
>>>
>>> extension Locale {
>>> static var currentLocale: Locale { ... }
>>> }
>>>
>>> extension Unicode {
>>> // An example of the option language in declaration context,
>>> // with nil defaults indicating unspecified, so defaults can be
>>> // driven by the presence/absence of a specific Locale
>>> func frobnicated(
>>> case caseSensitivity: StringSensitivity? = nil,
>>> diacritic diacriticSensitivity: StringSensitivity? = nil,
>>> width widthSensitivity: StringSensitivity? = nil,
>>> in locale: Locale? = nil
>>> ) -> Self { ... }
>>> }
>>> ```
>>
>> Any reason why Locale is defaulted to nil, instead of currentLocale?
>> It seems more useful to me.
>
> We're establishing a repeating pattern: string (and Unicode) operations
> are locale-insensitive by default, meaning the string is treated as
> machine-readable rather than human-readable text.
Makes sense.
>
>>> ### Comparing and Hashing Strings
>>>
>>> #### Collation Semantics
>>>
>>> What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
>>> out to be quite interesting, once you pick it apart. The full Unicode Collation
>>> Algorithm (UCA) works like this:
>>>
>>> 1. Fully normalize both strings
>>> 2. Convert each string to a sequence of numeric triples to form a collation key
>>> 3. “Flatten” the key by concatenating the sequence of first elements to the
>>> sequence of second elements to the sequence of third elements
>>> 4. Lexicographically compare the flattened keys
>>>
>>> While step 1 can usually
>>> be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
>>> incrementally, step 2 uses a collation table that maps matching *sequences* of
>>> unicode scalars in the normalized string to *sequences* of triples, which get
>>> accumulated into a collation key. Predictably, this is where the real costs
>>> lie.
>>>
>>> *However*, there are some bright spots to this story. First, as it turns out,
>>> string sorting (localized or not) should be done down to what's called
>>> the
>>> [“identical” level](http://unicode.org/reports/tr10/#Multi_Level_Comparison),
>>> which adds a step 3a: append the string's normalized form to the flattened
>>> collation key. At first blush this just adds work, but consider what it does
>>> for equality: two strings that normalize the same, naturally, will collate the
>>> same. But also, *strings that normalize differently will always collate
>>> differently*. In other words, for equality, it is sufficient to compare the
>>> strings' normalized forms and see if they are the same. We can therefore
>>> entirely skip the expensive part of collation for equality comparison.
>>>
>>> Next, naturally, anything that applies to equality also applies to hashing: it
>>> is sufficient to hash the string's normalized form, bypassing collation keys.
>>> This should provide significant speedups over the current implementation.
>>> Perhaps more importantly, since comparison down to the “identical” level applies
>>> even to localized strings, it means that hashing and equality can be implemented
>>> exactly the same way for localized and non-localized text, and hash tables with
>>> localized keys will remain valid across current-locale changes.
>>>
>>> Finally, once it is agreed that the *default* role for `String` is to handle
>>> machine-generated and machine-readable text, the default ordering of `String`s
>>> need no longer use the UCA at all. It is sufficient to order them in any way
>>> that's consistent with equality, so `String` ordering can simply be a
>>> lexicographical comparison of normalized forms,[4]
>>> (which is equivalent to lexicographically comparing the sequences of grapheme
>>> clusters), again bypassing step 2 and offering another speedup.
>>>
>>> This leaves us executing the full UCA *only* for localized sorting, and ICU's
>>> implementation has apparently been very well optimized.
>>>
>>> Following this scheme everywhere would also allow us to make sorting behavior
>>> consistent across platforms. Currently, we sort `String` according to the UCA,
>>> except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
>>> unicode scalar value.
>>>
>>> #### Syntax
>>>
>>> Because the current `Comparable` protocol expresses all comparisons with binary
>>> operators, string comparisons—which may require
>>> additional [options](#operations-with-options)—do not fit smoothly into the
>>> existing syntax. At the same time, we'd like to solve other problems with
>>> comparison, as outlined
>>> in
>>> [this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
>>> (implemented by changes at the head
>>> of
>>> [this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
>>> We should adopt a modification of that proposal that uses a method rather than
>>> an operator `<=>`:
>>
>> Why not both? Have the “UFO” operator, with the methods as support for
>> more complicated use cases where the sugar doesn’t hold up.
>
> Two reasons:
>
> 1. It's more API surface area for very little benefit
>
> 2. <,<=,==,>=, and > offer more than enough sugar. We don't see many
> circumstances where <=> would actually get used, and those few cases
> can live with the weight of x.compared(to:y).
Is this aimed at the UFO operator in general, or is this just for Strings?
>
>>> ```swift
>>> enum SortOrder { case before, same, after }
>>>
>>> protocol Comparable : Equatable {
>>> func compared(to: Self) -> SortOrder
>>> ...
>>> }
>>> ```
>>>
>>> This change will give us a syntactic platform on which to implement methods with
>>> additional, defaulted arguments, thereby unifying and regularizing comparison
>>> across the library.
>>>
>>> ```swift
>>> extension String {
>>> func compared(to: Self) -> SortOrder
>>>
>>> }
>>> ```
>>>
>>> **Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
>>> that the standard library simply adopts Foundation's `ComparisonResult` as is,
>>> but we believe the community should at least consider alternate naming before
>>> that happens. There will be an opportunity to discuss the choices in detail
>>> when the modified
>>> [Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
>>> up for review.
>>>
>>> ### `String` should be a `Collection` of `Character`s Again
>>>
>>> In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
>>> convinced ourselves that its semantics differed from those of `Collection` too
>>> significantly.
>>>
>>> It was always well understood that if strings were treated as sequences of
>>> `UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
>>> and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
>>> a collection of `Character` (extended grapheme clusters). During 2.0
>>> development, though, we realized that correct string concatenation could
>>> occasionally merge distinct grapheme clusters at the start and end of combined
>>> strings.
>>>
>>> This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
>>> comport perfectly with Unicode. We think the concatenation problem is tolerable,
>>> because the cases where it occurs all represent partially-formed constructs. The
>>> largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
>>> ACCENT)—are explicitly called out in the Unicode standard as
>>> “[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)” or
>>> “[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The other
>>> cases—such as a string ending in a zero-width joiner or half of a regional
>>> indicator—appear to be equally transient and unlikely outside of a text editor.
>>>
>>> Admitting these cases encourages exploration of grapheme composition and is
>>> consistent with what appears to be an overall Unicode philosophy that “no
>>> special provisions are made to get marginally better behavior for… cases that
>>> never occur in practice.”[2] Furthermore, it seems
>>> unlikely to disturb the semantics of any plausible algorithms. We can handle
>>> these cases by documenting them, explicitly stating that the elements of a
>>> `String` are an emergent property based on Unicode rules.
>>>
>>> The benefits of restoring `Collection` conformance are substantial:
>>>
>>> * Collection-like operations encourage experimentation with strings to
>>> investigate and understand their behavior. This is useful for teaching new
>>> programmers, but also good for experienced programmers who want to
>>> understand more about strings/unicode.
>>>
>>> * Extended grapheme clusters form a natural element boundary for Unicode
>>> strings. For example, searching and matching operations will always produce
>>> results that line up on grapheme cluster boundaries.
>>>
>>> * Character-by-character processing is a legitimate thing to do in many real
>>> use-cases, including parsing, pattern matching, and language-specific
>>> transformations such as transliteration.
>>>
>>> * `Collection` conformance makes a wide variety of powerful operations
>>> available that are appropriate to `String`'s default role as the vehicle for
>>> machine processed text.
>>>
>>> The methods `String` would inherit from `Collection`, where similar to
>>> higher-level string algorithms, have the right semantics. For example,
>>> grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
>>> `flatMap` with case-conversion, produce the same results one would expect
>>> from whole-string ordering comparison, equality comparison, and
>>> case-conversion, respectively. `reverse` operates correctly on graphemes,
>>> keeping diacritics moored to their base characters and leaving emoji intact.
>>> Other methods such as `indexOf` and `contains` make obvious sense. A few
>>> `Collection` methods, like `min` and `max`, may not be particularly useful
>>> on `String`, but we don't consider that to be a problem worth solving, in
>>> the same way that we wouldn't try to suppress `min` and `max` on a
>>> `Set([UInt8])` that was used to store IP addresses.
>>>
>>> * Many of the higher-level operations that we want to provide for `String`s,
>>> such as parsing and pattern matching, should apply to any `Collection`, and
>>> many of the benefits we want for `Collections`, such
>>> as unified slicing, should accrue
>>> equally to `String`. Making `String` part of the same protocol hierarchy
>>> allows us to write these operations once and not worry about keeping the
>>> benefits in sync.
>>>
>>> * Slicing strings into substrings is a crucial part of the vocabulary of
>>> string processing, and all other sliceable things are `Collection`s.
>>> Because of its collection-like behavior, users naturally think of `String`
>>> in collection terms, but run into frustrating limitations where it fails to
>>> conform and are left to wonder where all the differences lie. Many simply
>>> “correct” this limitation by declaring a trivial conformance:
>>>
>>> ```swift
>>> extension String : BidirectionalCollection {}
>>> ```
>>>
>>> Even if we removed indexing-by-element from `String`, users could still do
>>> this:
>>>
>>> ```swift
>>> extension String : BidirectionalCollection {
>>> subscript(i: Index) -> Character { return characters[i] }
>>> }
>>> ```
>>>
>>> It would be much better to legitimize the conformance to `Collection` and
>>> simply document the oddity of any concatenation corner-cases, than to deny
>>> users the benefits on the grounds that a few cases are confusing.
>>>
>>
>> Will String also conform to SequenceType?
>
> You mean Sequence, I presume (SequenceType is the old name). Every
> Collection is-a Sequence, so yes.’
Forgot that this was renamed a couple months back :)
>
>> I’ve seen many users (coming from other languages) confused that they
>> can’t “just” loop over a String’s characters.
>>
>>> Note that the fact that `String` is a collection of graphemes does *not* mean
>>> that string operations will necessarily have to do grapheme boundary
>>> recognition. See the Unicode protocol section for details.
>>>
>>> ### `Character` and `CharacterSet`
>>>
>>> `Character`, which represents a
>>> Unicode
>>> [extended grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
>>> is a bit of a black box, requiring conversion to `String` in order to
>>> do any introspection, including interoperation with ASCII. To fix this, we should:
>>>
>>> - Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
>>> of grapheme clusters is discoverable.
>>> - Add a failable `init` from sequences of scalars (returning nil for sequences
>>> that contain 0 or 2+ graphemes).
>>> - (Lower priority) expose some operations, such as `func uppercase() ->
>>> String`, `var isASCII: Bool`, and, to the extent they can be sensibly
>>> generalized, queries of unicode properties that should also be exposed on
>>> `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .
>>>
>>> Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
>>> type. This means it is usable on `String`, but only by going through the unicode
>>> scalar view. To deal with this clash in the short term, `CharacterSet` should be
>>> renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
>>> introduce a `CharacterSet` that provides similar functionality for extended
>>> grapheme clusters.[5]
>>>
>>> ### Unification of Slicing Operations
>>>
>>> Creating substrings is a basic part of String processing, but the slicing
>>> operations that we have in Swift are inconsistent in both their spelling and
>>> their naming:
>>>
>>> * Slices with two explicit endpoints are done with subscript, and support
>>> in-place mutation:
>>>
>>> ```swift
>>> s[i..<j].mutate()
>>> ```
>>>
>>> * Slicing from an index to the end, or from the start to an index, is done
>>> with a method and does not support in-place mutation:
>>> ```swift
>>> s.prefix(upTo: i).readOnly()
>>> ```
>>>
>>> Prefix and suffix operations should be migrated to be subscripting operations
>>> with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
>>> in
>>> [this
>> proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md).
>>> With generic subscripting in the language, that will allow us to collapse a wide
>>> variety of methods and subscript overloads into a single implementation, and
>>> give users an easy-to-use and composable way to describe subranges.
>>>
>>> Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
>>> is an ongoing research project that can be considered part of the potential
>>> long-term vision of text (and collection) processing.
>>>
>>> ### Substrings
>>>
>>> When implementing substring slicing, languages are faced with three options:
>>>
>>> 1. Make the substrings the same type as string, and share storage.
>>> 2. Make the substrings the same type as string, and copy storage when making the substring.
>>> 3. Make substrings a different type, with a storage copy on conversion to string.
>>>
>>> We think number 3 is the best choice. A walk-through of the tradeoffs follows.
>>>
>>> #### Same type, shared storage
>>>
>>> In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
>>> subrange of the original `String`'s storage. This is why `String` is 3 words in
>>> size (the start, length and buffer owner), unlike the similar `Array` type
>>> which is only one.
>>>
>>> This is a simple model with big efficiency gains when chopping up strings into
>>> multiple smaller strings. But it does mean that a stored substring keeps the
>>> entire original string buffer alive even after it would normally have been
>>> released.
>>>
>>> This arrangement has proven to be problematic in other programming languages,
>>> because applications sometimes extract small strings from large ones and keep
>>> those small strings long-term. That is considered a memory leak and was enough
>>> of a problem in Java that they changed from substrings sharing storage to
>>> making a copy in 1.7.
>>>
>>> #### Same type, copied storage
>>>
>>> Copying of substrings is also the choice made in C#, and in the default
>>> `NSString` implementation. This approach avoids the memory leak issue, but has
>>> obvious performance overhead in performing the copies.
>>>
>>> This in turn encourages trafficking in string/range pairs instead of in
>>> substrings, for performance reasons, leading to API challenges. For example:
>>>
>>> ```swift
>>> foo.compare(bar, range: start..<end)
>>> ```
>>>
>>> Here, it is not clear whether `range` applies to `foo` or `bar`. This
>>> relationship is better expressed in Swift as a slicing operation:
>>>
>>> ```swift
>>> foo[start..<end].compare(bar)
>>> ```
>>>
>>> Not only does this clarify to which string the range applies, it also brings
>>> this sub-range capability to any API that operates on `String` "for free". So
>>> these other combinations also work equally well:
>>>
>>> ```swift
>>> // apply range on argument rather than target
>>> foo.compare(bar[start..<end])
>>> // apply range on both
>>> foo[start..<end].compare(bar[start1..<end1])
>>> // compare two strings ignoring first character
>>> foo.dropFirst().compare(bar.dropFirst())
>>> ```
>>>
>>> In all three cases, an explicit range argument need not appear on the `compare`
>>> method itself. The implementation of `compare` does not need to know anything
>>> about ranges. Methods need only take range arguments when that was an
>>> integral part of their purpose (for example, setting the start and end of a
>>> user's current selection in a text box).
>>>
>>> #### Different type, shared storage
>>>
>>> The desire to share underlying storage while preventing accidental memory leaks
>>> occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
>>> The inconvenience of a separate type is mitigated by most operations used on
>>> `Array` from the standard library being generic over `Sequence` or `Collection`.
>>>
>>> We should apply the same approach for `String` by introducing a distinct
>>> `SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:
>>>
>>>> Important: Long-term storage of `Substring` instances is discouraged. A
>>>> substring holds a reference to the entire storage of a larger string, not
>>>> just to the portion it presents, even after the original string's lifetime
>>>> ends. Long-term storage of a `Substring` may therefore prolong the lifetime
>>>> of large strings that are no longer otherwise accessible, which can appear
>>>> to be memory leakage.
>>>
>>> When assigning a `Substring` to a longer-lived variable (usually a stored
>>> property) explicitly of type `String`, a type conversion will be performed, and
>>> at this point the substring buffer is copied and the original string's storage
>>> can be released.
>>>
>>> A `String` that was not its own `Substring` could be one word—a single tagged
>>> pointer—without requiring additional allocations. `Substring`s would be a view
>>> onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
>>> length. The small string optimization for `Substring` would take advantage of
>>> the larger size, probably with a less compressed encoding for speed.
>>>
>>> The downside of having two types is the inconvenience of sometimes having a
>>> `Substring` when you need a `String`, and vice-versa. It is likely this would
>>> be a significantly bigger problem than with `Array` and `ArraySlice`, as
>>> slicing of `String` is such a common operation. It is especially relevant to
>>> existing code that assumes `String` is the currency type. To ease the pain of
>>> type mismatches, `Substring` should be a subtype of `String` in the same way
>>> that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
>>> conversion from `Substring` to `String`, as well as the usual implicit
>>> conversions such as `[Substring]` to `[String]` that other subtype
>>> relationships receive.
>>>
>>> In most cases, type inference combined with the subtype relationship should
>>> make the type difference a non-issue and users will not care which type they
>>> are using. For flexibility and optimizability, most operations from the
>>> standard library will traffic in generic models of
>>> [`Unicode`](#the--code-unicode--code--protocol).
>>>
>>> ##### Guidance for API Designers
>>>
>>> In this model, **if a user is unsure about which type to use, `String` is always
>>> a reasonable default**. A `Substring` passed where `String` is expected will be
>>> implicitly copied. When compared to the “same type, copied storage” model, we
>>> have effectively deferred the cost of copying from the point where a substring
>>> is created until it must be converted to `String` for use with an API.
>>>
>>> A user who needs to optimize away copies altogether should use this guideline:
>>> if for performance reasons you are tempted to add a `Range` argument to your
>>> method as well as a `String` to avoid unnecessary copies, you should instead
>>> use `Substring`.
>>>
>>> ##### The “Empty Subscript”
>>>
>>> To make it easy to call such an optimized API when you only have a `String` (or
>>> to call any API that takes a `Collection`'s `SubSequence` when all you have is
>>> the `Collection`), we propose the following “empty subscript” operation,
>>>
>>> ```swift
>>> extension Collection {
>>> subscript() -> SubSequence {
>>> return self[startIndex..<endIndex]
>>> }
>>> }
>>> ```
>>>
>>> which allows the following usage:
>>>
>>> ```swift
>>> funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
>>> ```
>>>
>>> The `[]` syntax can be offered as a fixit when needed, similar to `&` for an
>>> `inout` argument. While it doesn't help a user to convert `[String]` to
>>> `[Substring]`, the need for such conversions is extremely rare, can be done with
>>> a simple `map` (which could also be offered by a fixit):
>>>
>>> ```swift
>>> takesAnArrayOfSubstring(arrayOfString.map { $0[] })
>>> ```
>>>
>>> #### Other Options Considered
>>>
>>> As we have seen, all three options above have downsides, but it's possible
>>> these downsides could be eliminated/mitigated by the compiler. We are proposing
>>> one such mitigation—implicit conversion—as part of the the "different type,
>>> shared storage" option, to help avoid the cognitive load on developers of
>>> having to deal with a separate `Substring` type.
>>>
>>> To avoid the memory leak issues of a "same type, shared storage" substring
>>> option, we considered whether the compiler could perform an implicit copy of
>>> the underlying storage when it detects the string is being "stored" for long
>>> term usage, say when it is assigned to a stored property. The trouble with this
>>> approach is it is very difficult for the compiler to distinguish between
>>> long-term storage versus short-term in the case of abstractions that rely on
>>> stored properties. For example, should the storing of a substring inside an
>>> `Optional` be considered long-term? Or the storing of multiple substrings
>>> inside an array? The latter would not work well in the case of a
>>> `components(separatedBy:)` implementation that intended to return an array of
>>> substrings. It would also be difficult to distinguish intentional medium-term
>>> storage of substrings, say by a lexer. There does not appear to be an effective
>>> consistent rule that could be applied in the general case for detecting when a
>>> substring is truly being stored long-term.
>>>
>>> To avoid the cost of copying substrings under "same type, copied storage", the
>>> optimizer could be enhanced to to reduce the impact of some of those copies.
>>> For example, this code could be optimized to pull the invariant substring out
>>> of the loop:
>>>
>>> ```swift
>>> for _ in 0..<lots {
>>> someFunc(takingString: bigString[bigRange])
>>> }
>>> ```
>>>
>>> It's worth noting that a similar optimization is needed to avoid an equivalent
>>> problem with implicit conversion in the "different type, shared storage" case:
>>>
>>> ```swift
>>> let substring = bigString[bigRange]
>>> for _ in 0..<lots { someFunc(takingString: substring) }
>>> ```
>>>
>>> However, in the case of "same type, copied storage" there are many use cases
>>> that cannot be optimized as easily. Consider the following simple definition of
>>> a recursive `contains` algorithm, which when substring slicing is linear makes
>>> the overall algorithm quadratic:
>>>
>>> ```swift
>>> extension String {
>>> func containsChar(_ x: Character) -> Bool {
>>> return !isEmpty && (first == x || dropFirst().containsChar(x))
>>> }
>>> }
>>> ```
>>>
>>> For the optimizer to eliminate this problem is unrealistic, forcing the user to
>>> remember to optimize the code to not use string slicing if they want it to be
>>> efficient (assuming they remember):
>>>
>>> ```swift
>>> extension String {
>>> // add optional argument tracking progress through the string
>>> func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
>>> let idx = idx ?? startIndex
>>> return idx != endIndex
>>> && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
>>> }
>>> }
>>> ```
>>>
>>> #### Substrings, Ranges and Objective-C Interop
>>>
>>> The pattern of passing a string/range pair is common in several Objective-C
>>> APIs, and is made especially awkward in Swift by the non-interchangeability of
>>> `Range<String.Index>` and `NSRange`.
>>>
>>> ```swift
>>> s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
>>> ```
>>>
>>> In general, however, the Swift idiom for operating on a sub-range of a
>>> `Collection` is to *slice* the collection and operate on that:
>>>
>>> ```swift
>>> s2.find(s2[j..<s2.endIndex])
>>> ```
>>>
>>> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
>>> without the `NSRange` argument. The Objective-C importer should be changed to
>>> give these APIs special treatment so that when a `Substring` is passed, instead
>>> of being converted to a `String`, the full `NSString` and range are passed to
>>> the Objective-C method, thereby avoiding a copy.
>>>
>>> As a result, you would never need to pass an `NSRange` to these APIs, which
>>> solves the impedance problem by eliminating the argument, resulting in more
>>> idiomatic Swift code while retaining the performance benefit. To help users
>>> manually handle any cases that remain, Foundation should be augmented to allow
>>> the following syntax for converting to and from `NSRange`:
>>>
>>> ```swift
>>> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
>>> let iToJ = Range(nsr, in: s) // Equivalent to i..<j
>>> ```
>>>
>>> ### The `Unicode` protocol
>>>
>>> With `Substring` and `String` being distinct types and sharing almost all
>>> interface and semantics, and with the highest-performance string processing
>>> requiring knowledge of encoding and layout that the currency types can't
>>> provide, it becomes important to capture the common “string API” in a protocol.
>>> Since Unicode conformance is a key feature of string processing in swift, we
>>> call that protocol `Unicode`:
>>
>> Another minor typo: capitalize “Swift"
>>
>>>
>>> **Note:** The following assumes several features that are planned but not yet implemented in
>>> Swift, and should be considered a sketch rather than a final design.
>>>
>>> ```swift
>>> protocol Unicode
>>> : Comparable, BidirectionalCollection where Element == Character {
>>>
>>> associatedtype Encoding : UnicodeEncoding
>>> var encoding: Encoding { get }
>>>
>>> associatedtype CodeUnits
>>> : RandomAccessCollection where Element == Encoding.CodeUnit
>>> var codeUnits: CodeUnits { get }
>>>
>>> associatedtype UnicodeScalars
>>> : BidirectionalCollection where Element == UnicodeScalar
>>> var unicodeScalars: UnicodeScalars { get }
>>>
>>> associatedtype ExtendedASCII
>>> : BidirectionalCollection where Element == UInt32
>>> var extendedASCII: ExtendedASCII { get }
>>>
>>> var unicodeScalars: UnicodeScalars { get }
>>> }
>>>
>>> extension Unicode {
>>> // ... define high-level non-mutating string operations, e.g. search ...
>>>
>>> func compared<Other: Unicode>(
>>> to rhs: Other,
>>> case caseSensitivity: StringSensitivity? = nil,
>>> diacritic diacriticSensitivity: StringSensitivity? = nil,
>>> width widthSensitivity: StringSensitivity? = nil,
>>> in locale: Locale? = nil
>>> ) -> SortOrder { ... }
>>> }
>>>
>>> extension Unicode : RangeReplaceableCollection where CodeUnits :
>>> RangeReplaceableCollection {
>>> // Satisfy protocol requirement
>>> mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
>>> where C.Element == Element
>>>
>>> // ... define high-level mutating string operations, e.g. replace ...
>>> }
>>>
>>> ```
>>>
>>> The goal is that `Unicode` exposes the underlying encoding and code units in
>>> such a way that for types with a known representation (e.g. a high-performance
>>> `UTF8String`) that information can be known at compile-time and can be used to
>>> generate a single path, while still allowing types like `String` that admit
>>> multiple representations to use runtime queries and branches to fast path
>>> specializations.
>>>
>>> **Note:** `Unicode` would make a fantastic namespace for much of
>>> what's in this proposal if we could get the ability to nest types and
>>> protocols in protocols.
>>>
>>>
>>> ### Scanning, Matching, and Tokenization
>>>
>>> #### Low-Level Textual Analysis
>>>
>>> We should provide convenient APIs processing strings by character. For example,
>>> it should be easy to cleanly express, “if this string starts with `"f"`, process
>>> the rest of the string as follows…” Swift is well-suited to expressing this
>>> common pattern beautifully, but we need to add the APIs. Here are two examples
>>> of the sort of code that might be possible given such APIs:
>>>
>>> ```swift
>>> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
>>> somethingWith(input) // process the rest of input
>>> }
>>>
>>> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
>>> ...
>>> }
>>> ```
>>>
>>> The specific spelling and functionality of APIs like this are TBD. The larger
>>> point is to make sure matching-and-consuming jobs are well-supported.
>>>
>>
>> +100, this kind of work is currently quite painful in Swift. Looking forward to seeing this
>> implemented!
>>
>>> #### Unified Pattern Matcher Protocol
>>>
>>> Many of the current methods that do matching are overloaded to do the same
>>> logical operations in different ways, with the following axes:
>>>
>>> - Logical Operation: `find`, `split`, `replace`, match at start
>>> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
>>> - Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
>>> the method name, and sometimes an argument
>>> - Whole string or subrange.
>>>
>>> We should represent these aspects as orthogonal, composable components,
>>> abstracting pattern matchers into a protocol like
>>> [this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33),
>>> that can allow us to define logical operations once, without introducing
>>> overloads, and massively reducing API surface area.
>>>
>>> For example, using the strawman prefix `%` syntax to turn string literals into
>>> patterns, the following pairs would all invoke the same generic methods:
>>>
>>> ```swift
>>> if let found = s.firstMatch(%"searchString") { ... }
>>> if let found = s.firstMatch(someRegex) { ... }
>>>
>>> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
>>> for m in s.allMatches(someRegex) { ... }
>>>
>>> let items = s.split(separatedBy: ", ")
>>> let tokens = s.split(separatedBy: CharacterSet.whitespace)
>>> ```
>>>
>>> Note that, because Swift requires the indices of a slice to match the indices of
>>> the range from which it was sliced, operations like `firstMatch` can return a
>>> `Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
>>> the string being searched, if needed, can easily be recovered as the
>>> `startIndex` and `endIndex` of the `Substring`.
>>>
>>> Note also that matching operations are useful for collections in general, and
>>> would fall out of this proposal:
>>>
>>> ```
>>> // replace subsequences of contiguous NaNs with zero
>>> forces.replace(oneOrMore([Float.nan]), [0.0])
>>> ```
>>>
>>> #### Regular Expressions
>>>
>>> Addressing regular expressions is out of scope for this proposal.
>>> That said, it is important that to note the pattern matching protocol mentioned
>>> above provides a suitable foundation for regular expressions, and types such as
>>> `NSRegularExpression` can easily be retrofitted to conform to it. In the
>>> future, support for regular expression literals in the compiler could allow for
>>> compile-time syntax checking and optimization.
>>>
>>> ### String Indices
>>>
>>> `String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
>>> `utf16`—each with its own opaque index type. The APIs used to translate indices
>>> between views add needless complexity, and the opacity of indices makes them
>>> difficult to serialize.
>>>
>>> The index translation problem has two aspects:
>>>
>>> 1. `String` views cannot consume one anothers' indices without a cumbersome
>>> conversion step. An index into a `String`'s `characters` must be translated
>>> before it can be used as a position in its `unicodeScalars`. Although these
>>> translations are rarely needed, they add conceptual and API complexity.
>>> 2. Many APIs in the core libraries and other frameworks still expose `String`
>>> positions as `Int`s and regions as `NSRange`s, which can only reference a
>>> `utf16` view and interoperate poorly with `String` itself.
>>>
>>> #### Index Interchange Among Views
>>>
>>> String's need for flexible backing storage and reasonably-efficient indexing
>>> (i.e. without dynamically allocating and reference-counting the indices
>>> themselves) means indices need an efficient underlying storage type. Although
>>> we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
>>> underlying code unit storage makes a good underlying storage type, provided
>>> `String`'s underlying storage supports random-access. We think random-access
>>> *code-unit storage* is a reasonable requirement to impose on all `String`
>>> instances.
>>>
>>> Making these `Int` code unit offsets conveniently accessible and constructible
>>> solves the serialization problem:
>>>
>>> ```swift
>>> clipboard.write(s.endIndex.codeUnitOffset)
>>> let offset = clipboard.read(Int.self)
>>> let i = String.Index(codeUnitOffset: offset)
>>> ```
>>>
>>> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
>>> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
>>> seamless by having them share an index type (semantics of indexing a `String`
>>> between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
>>> Having a common index allows easy traversal into the interior of graphemes,
>>> something that is often needed, without making it likely that someone will do it
>>> by accident.
>>>
>>> - `String.index(after:)` should advance to the next grapheme, even when the
>>> index points partway through a grapheme.
>>>
>>> - `String.index(before:)` should move to the start of the grapheme before
>>> the current position.
>>>
>>> Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
>>> crucial, as the specifics of encoding should not be a concern for most use
>>> cases, and would impose needless costs on the indices of other views. That
>>> said, we can make translation much more straightforward by exposing simple
>>> bidirectional converting `init`s on both index types:
>>>
>>> ```swift
>>> let u8Position = String.UTF8.Index(someStringIndex)
>>> let originalPosition = String.Index(u8Position)
>>> ```
>>>
>>> #### Index Interchange with Cocoa
>>>
>>> We intend to address `NSRange`s that denote substrings in Cocoa APIs as
>>> described [later in this document](#substrings--ranges-and-objective-c-interop).
>>> That leaves the interchange of bare indices with Cocoa APIs trafficking in
>>> `Int`. Hopefully such APIs will be rare, but when needed, the following
>>> extension, which would be useful for all `Collections`, can help:
>>>
>>> ```swift
>>> extension Collection {
>>> func index(offset: IndexDistance) -> Index {
>>> return index(startIndex, offsetBy: offset)
>>> }
>>> func offset(of i: Index) -> IndexDistance {
>>> return distance(from: startIndex, to: i)
>>> }
>>> }
>>> ```
>>>
>>> Then integers can easily be translated into offsets into a `String`'s `utf16`
>>> view for consumption by Cocoa:
>>>
>>> ```swift
>>> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
>>> let swiftIndex = s.utf16.index(offset: cocoaIndex)
>>> ```
>>>
>>> ### Formatting
>>>
>>> A full treatment of formatting is out of scope of this proposal, but
>>> we believe it's crucial for completing the text processing picture. This
>>> section details some of the existing issues and thinking that may guide future
>>> development.
>>>
>>> #### Printf-Style Formatting
>>>
>>> `String.format` is designed on the `printf` model: it takes a format string with
>>> textual placeholders for substitution, and an arbitrary list of other arguments.
>>> The syntax and meaning of these placeholders has a long history in
>>> C, but for anyone who doesn't use them regularly they are cryptic and complex,
>>> as the `printf (3)` man page attests.
>>>
>>> Aside from complexity, this style of API has two major problems: First, the
>>> spelling of these placeholders must match up to the types of the arguments, in
>>> the right order, or the behavior is undefined. Some limited support for
>>> compile-time checking of this correspondence could be implemented, but only for
>>> the cases where the format string is a literal. Second, there's no reasonable
>>> way to extend the formatting vocabulary to cover the needs of new types: you are
>>> stuck with what's in the box.
>>>
>>> #### Foundation Formatters
>>>
>>> The formatters supplied by Foundation are highly capable and versatile, offering
>>> both formatting and parsing services. When used for formatting, though, the
>>> design pattern demands more from users than it should:
>>>
>>> * Matching the type of data being formatted to a formatter type
>>> * Creating an instance of that type
>>> * Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
>>> need for this step prevents the instance from being used and discarded in
>>> the same expression where it is created.
>>> * Overall, introduction of needless verbosity into source
>>>
>>> These may seem like small issues, but the experience of Apple localization
>>> experts is that the total drag of these factors on programmers is such that they
>>> tend to reach for `String.format` instead.
>>>
>>> #### String Interpolation
>>>
>>> Swift string interpolation provides a user-friendly alternative to printf's
>>> domain-specific language (just write ordinary swift code!) and its type safety
>>> problems (put the data right where it belongs!) but the following issues prevent
>>> it from being useful for localized formatting (among other jobs):
>>>
>>> * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to restrict
>>> types used in string interpolation.
>>> * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation can't
>>> distinguish (fragments of) the base string from the string substitutions.
>>>
>>> In the long run, we should improve Swift string interpolation to the point where
>>> it can participate in most any formatting job. Mostly this centers around
>>> fixing the interpolation protocols per the previous item, and supporting
>>> localization.
>>>
>>> To be able to use formatting effectively inside interpolations, it needs to be
>>> both lightweight (because it all happens in-situ) and discoverable. One
>>> approach would be to standardize on `format` methods, e.g.:
>>>
>>> ```swift
>>> "Column 1: \(n.format(radix:16, width:8)) *** \(message)"
>>>
>>> "Something with leading zeroes: \(x.format(fill: zero, width:8))"
>>> ```
>>
>> Another thing that might limit adoption is the verbosity of this
>> format. It works fine if I need to print one or two things, but it
>> gets unwieldy very quickly.
>
> I'd like to see examples of the sorts of uses you're concerned about.
For example, I have a command line tool that draws a “table”, and includes lines that look something like this:
print(“Total: | \(String(format: “%8d | %8d | %8.4f | %8d | %8.4f |”, results, searched, elapsedTime, mean, deviation))”)
>
>>> ### C String Interop
>>>
>>> Our support for interoperation with nul-terminated C strings is scattered and
>>> incoherent, with 6 ways to transform a C string into a `String` and four ways to
>>> do the inverse. These APIs should be replaced with the following
>>>
>>> ```swift
>>> extension String {
>>> /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
>>> ///
>>> /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
>>> /// bytes ending just before the first zero byte (NUL character).
>>> init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
>>>
>>> /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
>>> ///
>>> /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
>>> /// the given `encoding`, ending just before the first zero code unit.
>>> /// - Parameter encoding: describes the encoding in which the code units
>>> /// should be interpreted.
>>> init<Encoding: UnicodeEncoding>(
>>> cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
>>> encoding: Encoding)
>>>
>>> /// Invokes the given closure on the contents of the string, represented as a
>>> /// pointer to a null-terminated sequence of UTF-8 code units.
>>> func withCString<Result>(
>>> _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
>>> }
>>> ```
>>>
>>> In both of the construction APIs, any invalid encoding sequence detected will
>>> have its longest valid prefix replaced by U+FFFD, the Unicode replacement
>>> character, per Unicode specification. This covers the common case. The
>>> replacement is done *physically* in the underlying storage and the validity of
>>> the result is recorded in the `String`'s `encoding` such that future accesses
>>> need not be slowed down by possible error repair separately.
>>>
>>> Construction that is aborted when encoding errors are detected can be
>>> accomplished using APIs on the `encoding`. String types that retain their
>>> physical encoding even in the presence of errors and are repaired on-the-fly can
>>> be built as different instances of the `Unicode` protocol.
>>>
>>> ### Unicode 9 Conformance
>>>
>>> Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
>>> the process of properly identifying `Character` boundaries. We need to update
>>> `String` to account for this change.
>>>
>>> ### High-Performance String Processing
>>>
>>> Many strings are short enough to store in 64 bits, many can be stored using only
>>> 8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
>>> us already in some other encoding, such as UTF-8, that would be costly to
>>> translate. Supporting these formats while maintaining usability for
>>> general-purpose APIs demands that a single `String` type can be backed by many
>>> different representations.
>>>
>>> That said, the highest performance code always requires static knowledge of the
>>> data structures on which it operates, and for this code, dynamic selection of
>>> representation comes at too high a cost. Heavy-duty text processing demands a
>>> way to opt out of dynamism and directly use known encodings. Having this
>>> ability can also make it easy to cleanly specialize code that handles dynamic
>>> cases for maximal efficiency on the most common representations.
>>>
>>> To address this need, we can build models of the `Unicode` protocol that encode
>>> representation information into the type, such as `NFCNormalizedUTF16String`.
>>>
>>> ### Parsing ASCII Structure
>>>
>>> Although many machine-readable formats support the inclusion of arbitrary
>>> Unicode text, it is also common that their fundamental structure lies entirely
>>> within the ASCII subset (JSON, YAML, many XML formats). These formats are often
>>> processed most efficiently by recognizing ASCII structural elements as ASCII,
>>> and capturing the arbitrary sections between them in more-general strings. The
>>> current String API offers no way to efficiently recognize ASCII and skip past
>>> everything else without the overhead of full decoding into unicode scalars.
>>>
>>> For these purposes, strings should supply an `extendedASCII` view that is a
>>> collection of `UInt32`, where values less than `0x80` represent the
>>> corresponding ASCII character, and other values represent data that is specific
>>> to the underlying encoding of the string.
>>
>> There are some things that are know to lie entirely with ASCII–are
>> there any plans to add a way to work with them in a simple manner
>> (subscripting, looping, etc.), possibly through the use of a
>> Array<ASCIIChar>? property or whatever?
>
> Maybe I'm misunderstanding what you have in mind but it sounds like
> that's exactly what extendedASCII is designed for.
>
Sorry if I wasn’t clear; I’m looking for indexing using Int, instead of using formIndex.
>
> --
> -Dave
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org <mailto:swift-evolution at swift.org>
> https://lists.swift.org/mailman/listinfo/swift-evolution <https://lists.swift.org/mailman/listinfo/swift-evolution>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170120/942376b0/attachment.html>
More information about the swift-evolution
mailing list