[swift-dev] State of String: ABI, Performance, Ergonomics, and You!

Michael Ilseman milseman at apple.com
Wed Jan 10 13:55:07 CST 2018


(A gist-formatted version of this email can be found at https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)

# State of String: ABI, Performance, Ergonomics, and You!

Hello, I’ve been working on implementing, optimizing, and improving String in preparation for ABI stability, and I thought I’d share the current status of String in Swift 5 and some potential directions to go. This is the product of conversations with open source contributors and my colleagues on the Swift standard library team at Apple. 

The primary area of focus is stabilizing String’s ABI, but we’d be remiss if we couldn’t also fit in performance and ergonomic improvements. String’s ergonomics in particular is one area where we think the status quo is woefully inadequate, and over half of this email is devoted to that topic. At the end, there’s a section about a community initiative that we hope can help users of String as well as guide future development.

(Note: I’m sending this to swift-dev because much of the contents revolve around implementation concerns. I’ll also cross-reference on swift-evolution and swift-users. See also the [StringManifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md <https://github.com/apple/swift/blob/master/docs/StringManifesto.md>) from Swift 4 era, which elaborates more on the philosophy and justification of many of the things discussed here.)

## String ABI

String’s ABI consists of its in-memory layout, any methods that are public or callable from inlineable methods, and any interpretation of its bit pattern by inlineable methods.  The challenge in designing String’s ABI is determining what avenues are worth keeping open to pursue in the future and what should be closed off for good to ensure performance.

In a non-ABI-stable world, the Swift optimizer is able to peer deeply into the bowels of String’s implementation to perform analyses and optimizations. But, ABI stability introduces resilience barriers through which the optimizer cannot peer. Keeping reasonable performance while stabilizing an ABI requires careful crafting of the String type as well as reevaluating many implementation choices. This area is critical to String’s long term health, even though it has little user-visibility in the short term.

For example, as an implementation detail, String uses UTF-16 as its underlying encoding. For values that fit within the first 7-bit subset of UTF-16 (that is, ASCII), String uses 1-byte backing storage to save space. UTF-16 provides the best interop story for application development and interacting with ICU. But, other domains such as scripting, systems programming, and server-side development strongly prefer UTF-8 and transcoding may not serve their performance needs. If this implementation detail can remain behind a resilience barrier, that is if the performance cost is not too high, then String could evolve in the future to natively support backing UTF-8 storage.

The details are still in flux, and likely not of interest to most readers. Much of the re-gutting work is happening on the appropriately-named [StringGuts branch](https://github.com/apple/swift/pull/12442 <https://github.com/apple/swift/pull/12442>).

## String Performance

Strings are pervasive in any system and their performance matters. We’re planning on two major efforts to improve performance this release: comparison improvements and small-string optimizations. Additionally, internal to the standard library, we’re introducing and using unmanaged strings and some performance flags, which may be worth surfacing as API for highly-performance-sensitive uses.

### String Comparison

**Strings should compare consistently**, with all the nice properties we know and love such as transitivity. String honors Unicode canonical equivalence, i.e. comparison is modulo-normalization. This means that ự (U+1EF1: LATIN SMALL LETTER U WITH HORN AND DOT BELOW) compares equivalent to ư ◌̣ (U+01B0 U+0323: LATIN SMALL LETTER U WITH HORN, COMBINING DOT BELOW) and u ◌̣ ◌̛ (U+0075 U+0323 U+031B: LATIN SMALL LETTER U, COMBINING DOT BELOW, COMBINING HORN), and all the other wonderful wacky permutations thereof.

**String’s ordering is not sufficient for human presentation**, as there is no ordering that’s appropriate for all languages or uses. For example, `ö` comes before `z` in German, but not in Swedish. String’s comparison operators (`==` and `<`) do not attempt to provide locale-specific orderings, but instead a *universal* ordering, that is one which is useful for maintaining programmer invariants such as the sorted-ness of a data structure. String’s choice to honor Unicode canonical equivalence strikes a balance between universality and efficiency while helping most programmers avoid common pitfalls in dealing with Unicode.

(Quick note on localization: Localization is not within the purview of the standard library as it’s better provided by the platform, i.e. Foundation. If you’re presenting an ordering to a user, use a locale-aware method from Foundation such as `localizedStandardCompare()`.)

Given that ordering is not fit for human consumption, but rather machine processing, it might as well be fast. The current ordering differs on Darwin and Linux platforms, with Linux in particular suffering from poor performance due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, [String Comparison Prototype](https://github.com/apple/swift/pull/12115 <https://github.com/apple/swift/pull/12115>)  provides a simpler ordering that allows for many common-case fast-paths and optimizations. For all the Unicode enthusiasts out there, this is the lexicographical ordering of NFC-normalized UTF-16 code units.

### Small String Optimization

Many strings are small. It is wasteful and inefficient to store them on the heap and track their lifetimes when the String struct could store it inline, i.e. encoded directly in the struct’s bit pattern.  Substrings, if they cover a small span of a potentially-large String, can avoid retaining ownership of the storage and instead encode their contents inline with an offset to map indices.

NSString has leveraged this kind of optimization on 64 bit platforms to great effect through tagged-pointer representations. Swift’s String has the advantage of being a struct and, depending on the final ABI, having more bits and flags available. We will probably have multiple small string representations.

For example, assuming a 16-byte String struct and 8 bits used for flags and discriminators (including discriminators for which small form a String is in), 120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 code units, which is sufficient for most graphemes and many common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any packing, which suffices for many programmer/system strings (which have a strong skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit encoding would fit all of these. This would speed up the creation and interpolation of many strings containing numbers.

Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.

### Unmanaged Strings

One of the ways that the standard library internally is dealing with the upcoming optimization barriers is by restructuring much of String’s implementation to be on top of an unmanaged form. These are rough analogues to UnsafeBufferPointer, i.e. a pointer and length with flags and string-semantic operations defined on them. Like small strings, these still have type String as they are just a different form encoded in the struct’s bit pattern.

Such unmanaged strings can be used when the lifetime of the storage is known to outlive all uses. As such, they don’t need to participate in reference counting. In the future, perhaps with ownership annotations or sophisticated static analysis, Swift could convert managed strings into unmanaged ones as an optimization when it knows the contents would not escape. Similarly in the future, a String constructed from a Substring that is known to not outlive the Substring’s owner can use an unmanaged form to skip copying the contents. This would benefit String’s status as the recommended common-currency type for API design.

Some kind of unmanaged string construct is an often-requested feature for highly performance-sensitive code. We haven’t thought too deeply about how best to surface this as API and it can be sliced and diced many ways. Some considerations:

* Does it make sense to introduce a new type (or types) here, or is it better to base everything on a sort of  `withUnsafeCodeUnits`-like construct that supplies UnsafeBufferPointers?
* How does 1-byte or 2-byte backing storage surface? Some options:
	* Closures receive optional arguments. E.g. `withUnsafeUTF16CodeUnits` passes an  `UnsafeBufferPointer<UInt16>?` to its closure. String would also have to be queryable as to the nature of the underlying storage, and those queries also become new API and ABI.
	* The backing storage is abstracted away sufficiently. I.e. the 1-byte or 2-byte decision is tracked by the type provided by e.g. `withUnsafeCodeUnits`. This trades a little performance for simpler usage. This would probably require a new type which in turn is new API and ABI.
	* Operations are given an enum of the different forms. That enum also becomes API as well as ABI.
		* Is the enum open or closed? Can this evolve over time?
	* Note that whether the underlying storage is 1-byte or 2-byte is not necessarily stable version-to-version, e.g. if String utilized UTF-8 storage for mostly-ASCII Strings in the future.
* Many Strings are effectively immutable and immortal, as in, the String’s lifetime will definitely outlive all potential uses (e.g. String literals). Would this work also encompass uses of StaticString? Can they be unified?
* What about a low-level API for efficiently building Strings? Could the user call `reserveCapacity` and then get a buffer pointer to excess capacity, updating count along the way?
* If this is niche and specifically for performance, should it be UnsafeString instead and skip bounds checks in release configurations?
* The APIs need to be designed and named. UnmanagedString’s subscript loads raw code units, but it also provides the functionality to perform grapheme breaking (as String is layered on top of it). How can this be designed in a way that implies semantic symmetry with String?
* How would small strings operate? Should the APIs fail on them, or receive a scratch buffer to spill them into, or be abstracted by the type?

### Performance Flags

Many operations such as decoding, comparison (under the new model), and grapheme breaking have fast-paths for common situations where the portion of the String under consideration does not exhibit the full complexity of Unicode. We plan on introducing flags denoting whether the *entire* String has these properties, allowing the implementation to just execute these fast-paths without repeatedly querying the underlying code units and with better instruction locality. Unsetting a flag will always preserve semantics, just slightly regress performance (the fast-paths will remain). Some flags include:

* **isTriviallyEncoded**: If all scalars are comprised of a single code unit, then decoding is trivial: just zero-extend.
* **isCanonical**: If a String is already in our preferred canonical form for comparison purposes, then honoring canonical equivalence is trivial.
* **isSingleScalarGrapheme**: If all graphemes are composed of a single scalar, then grapheme breaking is trivial.

String is a struct and these flags are stored inline, so they can only be set upon creation or inside a mutable method. In general, it’s not worth expensive checks to set these flags or onerous bookkeeping to maintain them, but a best-effort can be made when it’s cheap to do so. Since unsetting a flag is semantics-preserving, flags can be reserved for future use by always setting them to 0 in the current version. In this way, flags can be added in a backwards-compatible way, where older versions ignore them.

For highly performance-sensitive code, it might be worth having a mutating method to perform a linear scan and update these flags. This should be a very fast check to update inline flags, and exposing it as API should probably also expose getters to query flags. We haven’t thought deeply about how best to surface this for judicious use without creating an easily [mis-applied API](http://perldoc.perl.org/functions/study.html <http://perldoc.perl.org/functions/study.html>) littered throughout code bases.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.
* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an extension point.
* More compact/performant indices.

## String Ergonomics

Ergonomics is an area that’s not been well-served by String. Collection conformance and multi-line literals in Swift 4 were a welcome first step. But, trying to write a script, solve a classic whiteboard interview question, or participate in a coding competition can quickly become an exercise in frustration.

String ergonomics is a long term, multi-release endeavor. It has to continually compete for focus against near-term critical needs such as ABI stability and performance. The below provides some potential focus areas in the Swift 5 timeframe, though we probably won’t be able to focus on them all.

### API Gaps

String’s APIs have a lot of holes in them, and we’re hoping the effort mentioned later in this email can help expose them. Filling these gaps is always in-scope for any release.

#### Character Properties

One immediately obvious gap is Character, which lacks basic queries. For example, it should be possible to write something like `myStr.filter { !$0.isWhitespace }` to get a string with all whitespace removed.

While this can quickly get bogged down in the details (Q: “Does the Mongolian vowel separator count as whitespace?” A: Depends on the version of Unicode!), there should be obvious, well-understood, generally useful queries that Character can answer. This area has been well-trodden by CharacterSet and its static properties can provide guidance on useful queries. Some potential queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, etc. These should probably also be present on UnicodeScalar.

These might be useful on other types such as Substring, where they would apply to the whole slice. E.g., it should be possible to write `myStr.split(separator: ";").filter { $0.isAlphaNumeric }` rather than `myStr.split(separator: ";").filter { !$0.contains { !$0.isAlphaNumeric } }`

Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.

### Interpolation: Building Strings Up

String interpolation, especially when used in conjunction with other language features such as multi-line string literals, present a compelling, visual way to construct strings.  For example, the following theoretical printout of a benchmark result:

```swift
print("""
 Result: \(benchmark.result)
 Execution time:
   user(ms): \(benchmark.userTime.milliseconds)
   system(ms): \(benchmark.systemTime.milliseconds)
   wall(ms): \(benchmark.wallTime.milliseconds)
 """)
```

String interpolation has many issues and limitations:

* Current interface is inefficient. Individual segments are turned into Strings before interpolation. While small string optimizations alleviate some of this, larger segments incur unneeded heap traffic and tracking.
* While it is [fairly customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/ <https://oleb.net/blog/2017/01/fun-with-string-interpolation/>), the interface is clunky and deprecated. Brent Royal-Gordon has clearly articulated the problems and provided potential solutions in [Fix ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md <https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md>). This approach does need to be adjusted to address the performance issue above.
* Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).

Additionally, some concept of string formatters would greatly aid the ergonomics of constructing strings, whether inside interpolations or not.

### Regex: Tearing Strings Apart

Perhaps the most pressing area in need of ergonomic aid is tearing strings apart to extract information embedded within them. This can take many forms, and we would like to focus the discussion on language-level support for string matching literals, aka “regexes”. (Obligatory note that regexes here may or may not have a loose relationship to regular expressions in formal language theory, depending on details.)

This wouldn’t be worth doing if all it afforded was replacing `“foo”` with `/foo/` inside calls to NSRegularExpression. Regexes become compelling when viewed in conjunction with other language constructs such as switches (i.e. a new kind of [Pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html <https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html>)), an ability to bind to variables, interpolate other matchers, a rich type system, etc.

Before delving into details on feature set, semantics, and implementation, here’s a quick tour of how they could fit into Swift.

#### One Possible Approach

This is one, brief and hand-wavy, approach to adding regexes to Swift. This is not a proposal, and **all syntax is a total strawman** meant for illustration purposes only. The regex literal syntax below is an adaptation of  [Chris Lattner’s strawman syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html>). In this syntax, whitespace is insignificant.

###### Compile Regex Literals into `Regex<T>`

```swift
let usPhoneNumber = /
 (let area: Int? <- \d{3}?) -
 (let routing: Int <- \d{3}) -
 (let local: Int <- \d{4}) /
print(type(of: usPhoneNumber)) // => Regex<(area: Int?, routing: Int, local: Int)>
...
let (_, routing, local) = try usPhoneNumber.match(myStr)!
print(type(of: routing)) // => Int
```

If no type is specified, the default is `Substring` covering the match.

###### Types in Captures Conform to Some Protocol

```swift
extension Int: RegexSubmatchableiblewobble {
 init?(regexSubmatch: Substring)
}
```

###### Captures and Interpolation Propagated Through Type System

```swift
let countryCode = / \+ (let country: Int <- \d+) /
print(type(of: countryCode)) // => Regex<(country: Int)>

let betterUSPhoneNumber = /
 (let country: Int? <- countryCode?)
 (let (area, routing, local) <- usPhoneNumber) /
print(type(of: betterUSPhoneNumber)) 
 // => Regex<(country: Int?, area: Int?, routing: Int, local: Int)>

let nestedCapture = /
 (let country: Int? <- countryCode?)
 (let number <- usPhoneNumber) /
print(type(of: nestedCapture)) 
 // => Regex<(country: Int?, number: (area: Int?, routing: Int, local: Int))> 
```

This alone is not sufficient to use regexes in pattern matching, as `~=` returns a bool rather than a tuple describing a match. For that, we need to **generalize pattern matching in Swift**.

###### Pattern Matching Through Conformance to `Pattern`

```swift
protocol Pattern {
 associatedtype In
 associatedtype Out
 func match(_: In) -> Out?
}
```

`~=` can still be checked if needed for compatibility. Alternatively, a default implementation of `match` could be provided in terms of `~=`, where `Out` is `Void`, i.e. only match success or failure. We may want a default implementation of match generic over any collection that `In` is a SubSequence of, to address the `Slice` vs `Collection` or `Substring` vs `String` divide.

###### Syntax for Destructing Matching

Introduce some kind of syntax for a new [value-binding-pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern <https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern>) that performs destructing. E.g.

```swift
let myValue: T = ...
let pattern: MyTPattern<(U,U)> = ... // matches a T to (U, U)
let otherPattern: OtherTPattern<V> = ... // matches a T to V
switch myValue {
 case (let a: Int, let b: Int) <- pattern: ...
 case let pair <- pattern: ... // pair has type (U,U)
 case let d: Double <- otherPattern: ...
 case let num <- otherPattern: // num has type V
}
```

which also (hand-wavy timey-wimey) invokes some failable init on `Int` taking a `U`, `Double` taking a `V`, etc., guided by some protocol. 

###### Regexes are Just Patterns

```swift
struct Regex<T> {
 var program: CompiledRegex<T>
}
extension Regex: Pattern {
 typealias In = Substring
 typealias Out = T
 func match(_ s: In) -> Out? {
   // magic goes here
 }
}
```

Regex *literals* may have special syntax, but otherwise regexes are not a special case in the language and libraries can take advantage of generalized destructing matching. Similarly, character classes (which may be user defined, e.g. with a closure), could be Patterns over Characters.

###### Regexes Provide Variants

```swift
extension Regex {
 var caseInsensitive: Regex<T> { get }
 var allMatches: Regex<[T]> { get }
 var lineBased: Regex<[T]> { get }
 // ... matches with ranges, maybe even a Match object, etc.
 // ... lazy allMatches variant, etc.
}
```

###### Final Contrived Example (for Illustrative Purposes)

```swift
switch myStringlyTypedNumber {
case let (_, area?, routing, local) <- betterUSPhoneNumber: // ...
case let hexNumStr <- 
	/ 0 x (let num <- [0-9 A-F]+) /.caseInsensitive: // ...
case let (mantissa: Double, exponent: Int) <- scientificNumber: // ...
}
```

#### Feature Set

The examples above merely illustrate how regex literals could appear, but we need to carefully design the feature set in accordance with Swift’s goals and principles as a language.

One goal of Swift is to eventually [be better at string processing than Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html>), so let’s take a look at Perl. Perl’s history is full of interesting insights. Perl organically evolved into the de-facto standard for regex-based string processing culminating in Perl 5. More importantly, they reflected on the [“mess”](https://perl6.org/archive/doc/design/apo/A05.html <https://perl6.org/archive/doc/design/apo/A05.html>) and cleaned it up in [Perl 6](https://docs.perl6.org/language/regexes <https://docs.perl6.org/language/regexes>).  Perl 6 also has a [PEG](https://dl.acm.org/citation.cfm?id=964001.964011 <https://dl.acm.org/citation.cfm?id=964001.964011>)-like [grammar system](https://docs.perl6.org/language/grammars <https://docs.perl6.org/language/grammars>) and we think similar approaches may be a part of Swift’s long-term future. For now, we’re focusing on regexes, although future considerations do impact the desired feature set.

Swift contrasts with Perl in that Swift places greater importance on clarity and simplicity over raw expressive power and extreme configurability. However, these are some features Swift should outright steal from Perl 6.

* **Free form syntax**. Whitespace is insignificant, comments can occur inside literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a name.
* **Multi-line matching**. A new line is just another character, not a match terminator. 
* **Regex interpolation**. Refactor complex regexes into simpler ones (with names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 regexes!
* Miscellaneous features common to modern regex systems, such as:
	* Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
	* Non-capturing grouping, ratcheting (“atomic”) grouping
	* Separator-quantification (“[modified quantifier](https://docs.perl6.org/language/regexes#Modified_quantifier <https://docs.perl6.org/language/regexes#Modified_quantifier>:_%,_%%)”)
	* String literals
	* Rich set of character classes (more on that next)

##### Character Classes

We will want a broad array of builtin pre-defined character classes for use in regexes. Additionally, users should be able to construct and define their own, whether from CharacterSets (i.e. enumeration) or through an arbitrary closure of type `(Character) -> Bool`. There should be syntax to use user-defined character classes in a regex literal.

Character classes should provide many set-algebraic operations such as union, intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note that for our purposes, the standard library’s `SetAlgebra` protocol is not appropriate for character classes, as `SetAlgebra` requires conformances to be able to check for subset relationships which is intractable, if not undecidable, for closures over graphemes.)

##### Arguable Features

Some literal syntactic features common in other regex systems may be unnecessary or a bad fit for Swift.

1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
	* If String had better substitution APIs, such as one that took a `Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s delimiters)
	* These could be Regex variants accessible via computed properties, which may be clearer
3. Anonymous (numbered) captures
	* In general, if something’s worth capturing it’s worth giving it a name.
	* A construct like `(let _ = \d+)` could produce an unlabeled tuple element.
	* Tuples (the representation for literal captures) support 0-based access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in  `[A-Z]`)
	* Convenient for ASCII, but meaningless for graphemes.
	* Good built-in classes and convenient ways for users to combine and enumerate their own may even obviate the need for ranges for ASCII.

##### Forbidden Features (with Alternatives)

There are some regex features that we argue should be disallowed, or at least carefully constrained. Regexes are powerful, specifying simple textual analysis in a concise syntax, but are easily misused in ways that add exponentially-compounding complexity in non-obvious ways. This misuse contrasts with Swift’s values of balancing expressivity with clarity, where complex behavior should be *explicitly expressed in code*.

However, we think there’s so much flexibility in what can accomplished without the most general form of these features that it might not be an issue in practice.

###### 1. Arbitrarily-Complex Look-Around Assertions

Assertions are the ability for a regex to “peek” ahead or behind at surrounding characters without consuming them as part of a match. They can be positive or negative. Arbitrarily-complex assertions include the ability to do captures as part of an assertion, unknown-length assertions satisfied by subpatterns, etc., which can have non-intuitively complex interactions with other parts of the regex.

**Alternative:** The following constrained assertions could be permissible:

1. Anchors (zero-width assertions), which test for positions within a string, e.g. `^` and `$`.

2. Fixed-length character-class or closure-based assertions, which help avoid [common pitfalls in negation](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word <https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word>) as well as aid readability. This would permit fixed-length arbitrary computation *explicitly expressed in code*. These might also encompass boundaries.

3. Invoking a closure on the string prefix/suffix prior/after the assertion point. This would permit variable-length arbitrary computation, but *explicitly expressed in code*.

###### 2. Arbitrary Uses of Backreferences

Backreferences allow testing a later portion of a string against a previously-matched portion. Unfortunately, in their most general form, they require exploration of the entire search space and [make matching NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html <https://perl.plover.com/NPC/NPC-3SAT.html>), that is take exponential time. Very simple uses of backreferences just want to check for an unambiguous capture after the fact, for which a `where` clause suffices. Very complex uses are better served with an exponential loop *explicitly expressed in code*.

**Alternative:** There is a range of convenient uses between these two extremes, which we may be able to tractably address. The feasibility of these is still under investigation, but some potential constrained backreferences may include:

1. A “ratcheting” backreference, where upon matching (i.e. the current section of the string satisfies the backreference), the referred-to capture is “frozen” in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the user to write a real grammar/parser (assuming some future grammar/parser description system in Swift), simple delimiter matching can greatly aid readability. This might even be general and powerful enough to define tasks such as pairing XML tags and ignoring escaped delimiters or delimiters inside literals. Ideally, this would be done in a direction conducive to a future parsing solution.

#### Semantics

Matching semantics should roughly corresponding to [UTS #18 Level 2](https://unicode.org/reports/tr18/#Extended_Unicode_Support <https://unicode.org/reports/tr18/#Extended_Unicode_Support>) of Unicode support. That is, “any character” would be any Swift Character (a grapheme), matching obeys canonical equivalence, etc. This is consistent with String’s overall philosophy of universality and harm reduction. This does require that the implementation use Unicode rules and tables provided at run time, e.g. for grapheme breaking.

Regex interpolation should effectively “inline” the interpolated regex, i.e. [formal concatenation](https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition <https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition>). We will probably also want different combination semantics for combining patterns in general (including regexes), possibly in conjunction with a future parsing solution. For example, a form of alternation that is short-circuiting and some form of concatenation that is [“ratcheting”](https://docs.perl6.org/language/regexes#Ratchet <https://docs.perl6.org/language/regexes#Ratchet>), where something like  `(myRegex1 || myRegex2) ++ myRegex3` tries myRegex1 first, and if successful myRegex2 is *never* tried regardless of whether myRegex3 succeeds or not. This is similar to the semantics of [PEGs](https://en.wikipedia.org/wiki/Parsing_expression_grammar <https://en.wikipedia.org/wiki/Parsing_expression_grammar>) or parser-combinators and much more closely reflects programmer intuition for parsing.

We’ll need to nail down a lot of important details:

* How are alternations matched? [LTM](https://perlgeek.de/en/article/longest-token-matching <https://perlgeek.de/en/article/longest-token-matching>)?
* What are the built in character classes and what do they cover?
	* What Unicode properties to expose?
	* Should there be API symmetry with Character queries mentioned above?
* What are the built in anchors, fixed-length assertions, and can they be customized?
* What is the type of a quantified capture? An array of the capture’s declared type?
* What guarantees about runtime and space complexity do we provide, if any?
* While we may have variants for whole-string, partial-string, line-by-line, etc., what should the defaults be?

#### Implementation

With static knowledge, many regexes can be compiled into [efficient, minimal state machines](https://swtch.com/%7Ersc/regexp/regexp1.html <https://swtch.com/%7Ersc/regexp/regexp1.html>) (albeit with Unicode rules and tables as input at run time). [Virtual machine approaches](https://swtch.com/%7Ersc/regexp/regexp2.html <https://swtch.com/%7Ersc/regexp/regexp2.html>) can make it easier to specify and guarantee behavior while also being much more extensible. They also may alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable while growing new features.

Delivering ergonomic regexes in Swift 5 should not be blocked on having an efficient implementation available immediately. However, long term, it is important to have a [predictable and consistent performance model](https://github.com/google/re2/wiki/WhyRE2 <https://github.com/google/re2/wiki/WhyRE2>) to prevent [unanticipated exponential behavior](https://en.wikipedia.org/wiki/ReDoS <https://en.wikipedia.org/wiki/ReDoS>). Thus, implementation concerns are relevant in scoping the initial feature set (e.g. regarding backreferences). 

Some regexes are amenable to task-parallel strategies (i.e. threads), data-parallel strategies (i.e. vector or GPU), or both. Swift will probably grow new strategies and combinations of strategies over time. Implementations can also take advantage of the aforementioned performance flags (e.g. isCanonical) to greatly speed up matches.

The interface to a regex engine will be part of Swift’s ABI, including whatever intermediate representation regexes are compiled into. This could be a simple regex AST, a virtual machine bytecode, etc., but it needs to be extensible to accommodate any potential future additions and implementation strategies.

### Recap: Potential Additions for Swift 5

* Some form of unmanaged or unsafe Strings, and corresponding APIs
* Exposing performance flags, and some way to request a scan to populate them
* API gaps
* Character and UnicodeScalar properties, such as isNewline
* Generalizing, and optimizing, String interpolation
* Regex literals, Regex type, and generalized pattern match destructuring
* Substitution APIs, in conjunction with Regexes.

## Community

With Swift’s mailing lists [moving to Discourse](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html>), this allows the opportunity for better community engagement and cross-referencing posts. We plan on starting an area under the “Using Swift” category focusing on idiomatic and/or performant solutions to string programming needs. These might be anything from simple programming problems to make-my-parser-faster challenges.

While this provides a community benefit by showing how best to use String, our ulterior motive is to mine posts looking for benchmarks, API gaps, and representative code that new features could improve. We’ll also, of course, want to bikeshed the name! My vote is for “String Dojo”, but there’s probably a reason I don’t normally name things.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20180110/4c6b6b70/attachment.html>


More information about the swift-dev mailing list