[swift-dev] State of String: Ergonomics, and You!

Michael Ilseman milseman at apple.com
Sat Jan 13 12:30:40 CST 2018


(I’m still working on a response on the ABI-side of this conversation, which deserves more time and thought)


> On Jan 11, 2018, at 10:28 PM, Chris Lattner <clattner at nondot.org> wrote:
> 
> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev at swift.org <mailto:swift-dev at swift.org>> wrote:
>> ## String Ergonomics
>> 
>> Ergonomics is an area that’s not been well-served by String. Collection conformance and multi-line literals in Swift 4 were a welcome first step. But, trying to write a script, solve a classic whiteboard interview question, or participate in a coding competition can quickly become an exercise in frustration.
> 
> Indeed.  It has come a long way, but isn’t there yet.
> 
>> #### Character Properties
>> 
>> One immediately obvious gap is Character, which lacks basic queries. For example, it should be possible to write something like `myStr.filter { !$0.isWhitespace }` to get a string with all whitespace removed.
> 
> Yes, I think that this is critical for multiple reasons.  First of which is direct utility, which you’re focusing on.  The second of which is that I think that this directly dovetails into the regex syntax discussion: ideally we’d have named character classes and be able to explain them as being *exactly* defined as a corresponding character classification property.
> 
> It would be great to get a property for each of (e.g.) the Perl 6 character classes:
> https://docs.perl6.org/language/regexes#Backslashed,_predefined_character_classes <https://docs.perl6.org/language/regexes#Backslashed,_predefined_character_classes>
> 
> I consider this (along with all the annoying work to define what they should match) as a prerequisite for getting regex’s nailed down.
> 

I agree 100%

>> While this can quickly get bogged down in the details (Q: “Does the Mongolian vowel separator count as whitespace?” A: Depends on the version of Unicode!), there should be obvious, well-understood, generally useful queries that Character can answer. This area has been well-trodden by CharacterSet and its static properties can provide guidance on useful queries. Some potential queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, etc. These should probably also be present on UnicodeScalar.
> 
> Relatedly, the Foundation CharacterSet type is extremely confusing because it isn’t a set of swift Characters.  I don’t suppose there is any appetite to rectify or generalize it somehow is there?

We could rectify it through renaming, which could be proposed in a source-compatible way. I don’t know the naming history here, but I assumed it had some kind of term-of-art status. As far as generalizing, that may be more complicated than it seems (more details below in response to “A Grapheme-cluster based analog…”).


>  I wouldn’t overly rely on it for guidance on these issues give that it it stuck so squarely in the realm of UTF16.
> 

Wading a little into the weeds here, CharacterSet’s builtins model older Unicode semantics than what we probably want to provide. E.g. CharacterSet.lowercaseLetters means general category “Ll”, while modern Unicode defines property Lowercase which includes more scalars. The Lowercase property vs CharacterSet.lowercaseLetters is equivalent to ICU’s u_isULowercase [1] vs u_islower [2], where the former is the modern recommended API. Contrarily, Perl 6’s <lower> built-in rule[3] seems to also just be “Ll” and I’m not familiar with the rationale there (compatibility?). Certainly needs investigation.

I supposed by “guidance” I meant “here are some of the things people commonly ask about characters”. Built-in character classes in various regex flavors also provide guidance.

[1] u_isULowercase: http://icu-project.org/apiref/icu4c/uchar_8h.html#a8321c9ba617ed00787f20c4b23a254bc
[2] u_islower: http://icu-project.org/apiref/icu4c/uchar_8h.html#aa26a51b768147fbf34029f8141132815 <http://icu-project.org/apiref/icu4c/uchar_8h.html#aa26a51b768147fbf34029f8141132815>
[3] https://docs.perl6.org/language/regexes#Backslashed,_predefined_character_classes



>> Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.
> 
> No it isn’t.  A Grapheme-cluster based analog of CharacterSet would be a reasonable model to consider though.  It could conceptually support predicates like isEmoji()
> 

In the regex section I talk about character classes as being modeled by `(Character) -> Bool` rather than something that conforms to SetAlgebra. One big reason (beyond being awesomely convenient) is because I don’t think graphemes are something to be enumerated.

Theoretically speaking, I’m not sure if the set of graphemes is even recursively enumerable (e.g. zalgo-text or zalgo-emoji). Practically speaking, it darn-well might as well be uncountable. Best case, if even possible, any enumeration would be emergent behavior of encoding details and change significantly with each Unicode version. If we cannot enumerate graphemes, we cannot answer equality, isDisjoint, isSubset, etc.

I don’t know if those operations are important for a grapheme analogue. They just demonstrate that a grapheme set follows a different model than a scalar set, and different than what we have traditionally used the word “Set” for in Swift, though you would know the history here better.

As far as uses of such a grapheme set are concerned, they seem equivalent to application of a regex consisting of a character class. For example, rangeOfCharacter(from:options:) could be accomplished with the below. In an ever-futile attempt to avoid focusing on specific syntax, I’ll form an army of straw-people, using 「」 delimiters for a regex literal, « » to denote character classes, subscript on String to model application of a Regex, parenthesis to denote a capture that must always be a decl (even if the name is dropped by using `_`), and Regex literals defaulting to whole-string matching rather than the first partial-string match:

extension Character {
    var isEmoji: Bool { get }
}
extension Regex {
    var rangeOfMatch: Regex<(Range<String.Index>)>  // Drops the captured value, just wants the range
    var allMatches: Regex<LazySequence<T>>
}

let emojiPattern =「(let _ = «.isEmoji»)」 // Regex<(Character)>

let theFirst  = myString[emojiPattern.rangeOfMatch.firstPartial] // Range<String.Index>
let allOfThem = myString[emojiPattern.rangeOfMatch.allMatches] // LazySequence<Range<String.Index>>

(Alternatively, Regex could have an init taking a (Character) -> Bool)

In such a world, what do you image the role of a grapheme set type is? It might have more uses as provided by the platform, alongside things such as localization, linguistic analysis, etc.


>> ### Interpolation: Building Strings Up
>> 
>> String interpolation, especially when used in conjunction with other language features such as multi-line string literals, present a compelling, visual way to construct strings.  For example, the following theoretical printout of a benchmark result:
>> 
>> ```swift
>> print("""
>>  Result: \(benchmark.result)
>>  Execution time:
>>    user(ms): \(benchmark.userTime.milliseconds)
>>    system(ms): \(benchmark.systemTime.milliseconds)
>>    wall(ms): \(benchmark.wallTime.milliseconds)
>>  """)
>> ```
>> 
>> String interpolation has many issues and limitations:
>> 
>> * Current interface is inefficient. Individual segments are turned into Strings before interpolation. While small string optimizations alleviate some of this, larger segments incur unneeded heap traffic and tracking.
> 
> Completely agreed: personally I’d consider this a critical P1 to fix for ABI stability, far more important than the ergonomic issues you’re describing here.
> 
>> * While it is [fairly customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/ <https://oleb.net/blog/2017/01/fun-with-string-interpolation/>), the interface is clunky and deprecated. Brent Royal-Gordon has clearly articulated the problems and provided potential solutions in [Fix ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md <https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md>). This approach does need to be adjusted to address the performance issue above.
> 
> Yes, yes yes.
> 
>> * Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).
> 
> The original (2013 era) idea was to allow for interpolants to take multiple arguments (the contents of \(….) would be passed as the argument list, so you could specify formatting information as subsequent parameters, e.g.:  
> 
>     “your file access is set to \(mode, .Octal)”.
> 
> or:
>     “your file access is set to \(mode, format: .Octal)”.
>   
> or something like that.  Of course each type would define its own formatting modes, and this would probably be reflected through a type-specific initializer.
> 

That looks very similar to one of Brent’s proposals (that I can no longer find a link to). Was there a reason it didn’t happen other than not getting around to it?

>> ### Regex: Tearing Strings Apart
> 
> This is clearly necessary but also clearly not part of Swift 5.  Regex’s should themselves be the subject of an intense design process.  Swift 6 pretty please???  I’d suggest splitting this section out and starting a regex manifesto. :-)
> 

Since it is so big, I feel like if we’re going to make any progress we need to start soon. Of course, any focus will be preempted as needed by ABI stability. Even if Swift N doesn’t get the fully formed feature, we’ll pick up things along the way. E.g. as you mention there’s a parity between Character’s Bool properties and character classes.

I definitely want to circulate and discuss these ideas a little bit before writing a “manifesto”.

> 
>> This wouldn’t be worth doing if all it afforded was replacing `“foo”` with `/foo/` inside calls to NSRegularExpression. Regexes become compelling when viewed in conjunction with other language constructs such as switches (i.e. a new kind of [Pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html <https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html>)), an ability to bind to variables, interpolate other matchers, a rich type system, etc.
> 
> +1
> 
>> #### One Possible Approach
>> 
>> This is one, brief and hand-wavy, approach to adding regexes to Swift. This is not a proposal, and **all syntax is a total strawman** meant for illustration purposes only. The regex literal syntax below is an adaptation of  [Chris Lattner’s strawman syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html>). In this syntax, whitespace is insignificant.
> 
> Nice. :-)
> 
>> 
>> ###### Compile Regex Literals into `Regex<T>`
> 
> Agreed.
> 
> Ok, I might sound like a madman here, but you realize that there is a really logical progression here:
> 
> 1. You teach the compiler to parse and validate regex literals.
> 2. You teach the compiler to know what regex’s are actually regular, so you can use linear time algorithms like DFA/NFAs instead of exponential algorithms where possible.
> 3. You teach the compiler to form the DFA.
> 4. You then teach the compiler that a switch on string where you have multiple regular regex’s can be turned into a flex style “lexer” that matches all of the cases in parallel.
> 5. You rewrite the Swift compiler in Swift, because obviously the lexer was the hard part. :-)
> 

Clearly the hardest part!

> QED 
> 
>> #### Feature Set
>> 
>> The examples above merely illustrate how regex literals could appear, but we need to carefully design the feature set in accordance with Swift’s goals and principles as a language.
>> 
>> One goal of Swift is to eventually [be better at string processing than Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html <https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html>), so let’s take a look at Perl. Perl’s history is full of interesting insights. Perl organically evolved into the de-facto standard for regex-based string processing culminating in Perl 5. More importantly, they reflected on the [“mess”](https://perl6.org/archive/doc/design/apo/A05.html <https://perl6.org/archive/doc/design/apo/A05.html>) and cleaned it up in [Perl 6](https://docs.perl6.org/language/regexes <https://docs.perl6.org/language/regexes>).  Perl 6 also has a [PEG](https://dl.acm.org/citation.cfm?id=964001.964011 <https://dl.acm.org/citation.cfm?id=964001.964011>)-like [grammar system](https://docs.perl6.org/language/grammars <https://docs.perl6.org/language/grammars>) and we think similar approaches may be a part of Swift’s long-term future. For now, we’re focusing on regexes, although future considerations do impact the desired feature set.
> 
> Completely agreed. I’d strongly suggest thinking about this in terms of defining what the ultimate goal is, and then defining a progressive set of steps towards it.  We should design a solution that can support grammars in the fullness of time, but shouldn’t block on that to get basic regexes.
> 

I definitely want to see where some discussion here leads and whether the “StringDojo” effort provides good example fodder.

>> Swift contrasts with Perl in that Swift places greater importance on clarity and simplicity over raw expressive power and extreme configurability. However, these are some features Swift should outright steal from Perl 6.
>> 
>> * **Free form syntax**. Whitespace is insignificant, comments can occur inside literals, etc., all of which aids readability
>> * **Named Captures**. If something’s worth capturing, it’s worth giving it a name.
>> * **Multi-line matching**. A new line is just another character, not a match terminator. 
>> * **Regex interpolation**. Refactor complex regexes into simpler ones (with names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 regexes!
>> * Miscellaneous features common to modern regex systems, such as:
>> 	* Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
>> 	* Non-capturing grouping, ratcheting (“atomic”) grouping
>> 	* Separator-quantification (“[modified quantifier](https://docs.perl6.org/language/regexes#Modified_quantifier <https://docs.perl6.org/language/regexes#Modified_quantifier>:_%,_%%)”)
>> 	* String literals
>> 	* Rich set of character classes (more on that next)
> 
> I want to live in the future, give it to me now :-) :-)
> 
>> 
>> ##### Character Classes
>> 
>> We will want a broad array of builtin pre-defined character classes for use in regexes. Additionally, users should be able to construct and define their own, whether from CharacterSets (i.e. enumeration) or through an arbitrary closure of type `(Character) -> Bool`. There should be syntax to use user-defined character classes in a regex literal.
>> 
>> Character classes should provide many set-algebraic operations such as union, intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note that for our purposes, the standard library’s `SetAlgebra` protocol is not appropriate for character classes, as `SetAlgebra` requires conformances to be able to check for subset relationships which is intractable, if not undecidable, for closures over graphemes.)
> 
> Have you considered making the ‘named character class’ regex feature be sugar for calling Character methods inline in the rest of the regex?

If by “Character methods” you mean “such as those new Unicode computed properties of type Bool we’re adding”, then that’s the plan!

>  This seems best to me both because it offers user extensibility as well as an easy way to explain the behavior of the model.
> 
>> ##### Arguable Features
>> 
>> Some literal syntactic features common in other regex systems may be unnecessary or a bad fit for Swift.
>> 
>> 1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
>> 	* If String had better substitution APIs, such as one that took a `Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` or `.firstOnly`, that might be sufficient and/or clearer anyways.
>> 2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s delimiters)
>> 	* These could be Regex variants accessible via computed properties, which may be clearer
>> 3. Anonymous (numbered) captures
>> 	* In general, if something’s worth capturing it’s worth giving it a name.
>> 	* A construct like `(let _ = \d+)` could produce an unlabeled tuple element.
>> 	* Tuples (the representation for literal captures) support 0-based access already, e.g. `myCaptures.3`.
>> 4. Enumerated character class ranges for non-ASCII graphemes (the `-` in  `[A-Z]`)
>> 	* Convenient for ASCII, but meaningless for graphemes.
>> 	* Good built-in classes and convenient ways for users to combine and enumerate their own may even obviate the need for ranges for ASCII.
> 
> Agreed that all of those are marginal or discardable except for #1.  IMO, we’ll need some solution for it, but it need not have first class language syntax like sed: a well designed API would be perfectly fine.
> 

Right, something like (strawman):

extension String {
	mutating func substitute<T>(_: Regex<T>, _: (T) -> String)
}

>> ##### Forbidden Features (with Alternatives)
>> 
>> There are some regex features that we argue should be disallowed, or at least carefully constrained. Regexes are powerful, specifying simple textual analysis in a concise syntax, but are easily misused in ways that add exponentially-compounding complexity in non-obvious ways. This misuse contrasts with Swift’s values of balancing expressivity with clarity, where complex behavior should be *explicitly expressed in code*.
> 
> Right, this is one of the things that I think would be nice about tying character classes together with methods.  The really nice thing about this is that tying it to predicates on Characters keeps usage of these as *actually regular* which permits efficient implementation, even if the generated code does a call to the user defined method.
> 

+1, and communicating that such a method could be invoked many times in an unspecified order.

>> However, we think there’s so much flexibility in what can accomplished without the most general form of these features that it might not be an issue in practice.
>> 
>> ###### 1. Arbitrarily-Complex Look-Around Assertions
>> 
>> Assertions are the ability for a regex to “peek” ahead or behind at surrounding characters without consuming them as part of a match. They can be positive or negative. Arbitrarily-complex assertions include the ability to do captures as part of an assertion, unknown-length assertions satisfied by subpatterns, etc., which can have non-intuitively complex interactions with other parts of the regex.
>> 
>> **Alternative:** The following constrained assertions could be permissible:
>> 
>> 1. Anchors (zero-width assertions), which test for positions within a string, e.g. `^` and `$`.
> 
> ^ and $ anchors are clearly ok.  They keep it regular and are widely used and understood.  I don’t see a pitfall to supporting them.
> 
>> 2. Fixed-length character-class or closure-based assertions, which help avoid [common pitfalls in negation](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word <https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word>) as well as aid readability. This would permit fixed-length arbitrary computation *explicitly expressed in code*. These might also encompass boundaries.
>> 
>> 3. Invoking a closure on the string prefix/suffix prior/after the assertion point. This would permit variable-length arbitrary computation, but *explicitly expressed in code*.
> 
> I don’t understand the tradeoffs involved on this - I’d suggest lazily evaluating this after the general model is more nailed down.  We’ll know more and understand the implications of the decisions better.
> 
>> ###### 2. Arbitrary Uses of Backreferences
>> 
>> Backreferences allow testing a later portion of a string against a previously-matched portion. Unfortunately, in their most general form, they require exploration of the entire search space and [make matching NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html <https://perl.plover.com/NPC/NPC-3SAT.html>), that is take exponential time. Very simple uses of backreferences just want to check for an unambiguous capture after the fact, for which a `where` clause suffices. Very complex uses are better served with an exponential loop *explicitly expressed in code*.
>> 
>> **Alternative:** There is a range of convenient uses between these two extremes, which we may be able to tractably address. The feasibility of these is still under investigation, but some potential constrained backreferences may include:
>> 
>> 1. A “ratcheting” backreference, where upon matching (i.e. the current section of the string satisfies the backreference), the referred-to capture is “frozen” in place, so that alternative assignments need not be explored.
>> 2. A delimiter-checking construct. While eventually we want to encourage the user to write a real grammar/parser (assuming some future grammar/parser description system in Swift), simple delimiter matching can greatly aid readability. This might even be general and powerful enough to define tasks such as pairing XML tags and ignoring escaped delimiters or delimiters inside literals. Ideally, this would be done in a direction conducive to a future parsing solution.
> 
> Interesting, I’m not familiar with the tradeoffs implied by these. I have always assumed that we would support the generality of exponential regex matching but optimize the common regular cases.  It would be awesome if there is a better model, but I hope we don’t give up too much familiarity and power in the process.
> 

I think these will be clearer after more prototyping and investigation. My main inspiration for how regexes could fit in a language is Perl 6, and my inspiration for implementation is RE2. Perl 6 strongly encourages users off of regex and onto grammars (made up of “ratcheting” tokens), which is a more appropriate tool. RE2 forbids backreferences and most non-trivial assertions altogether. I think we can extend the RE2-style approach to service many of these uses without compromising its goals.

>> 
>> #### Semantics
>> 
>> Matching semantics should roughly corresponding to [UTS #18 Level 2](https://unicode.org/reports/tr18/#Extended_Unicode_Support <https://unicode.org/reports/tr18/#Extended_Unicode_Support>) of Unicode support. That is, “any character” would be any Swift Character (a grapheme), matching obeys canonical equivalence, etc. This is consistent with String’s overall philosophy of universality and harm reduction. This does require that the implementation use Unicode rules and tables provided at run time, e.g. for grapheme breaking.
> 
> +1.  Of course we’d fastpath ASCII.
> 
>> #### Implementation
>> 
>> With static knowledge, many regexes can be compiled into [efficient, minimal state machines](https://swtch.com/%7Ersc/regexp/regexp1.html <https://swtch.com/%7Ersc/regexp/regexp1.html>) (albeit with Unicode rules and tables as input at run time). [Virtual machine approaches](https://swtch.com/%7Ersc/regexp/regexp2.html <https://swtch.com/%7Ersc/regexp/regexp2.html>) can make it easier to specify and guarantee behavior while also being much more extensible. They also may alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable while growing new features.
>> 
>> Delivering ergonomic regexes in Swift 5 should not be blocked on having an efficient implementation available immediately. However, long term, it is important to have a [predictable and consistent performance model](https://github.com/google/re2/wiki/WhyRE2 <https://github.com/google/re2/wiki/WhyRE2>) to prevent [unanticipated exponential behavior](https://en.wikipedia.org/wiki/ReDoS <https://en.wikipedia.org/wiki/ReDoS>). Thus, implementation concerns are relevant in scoping the initial feature set (e.g. regarding backreferences). 
> 
> Agreed.  We should define the model with the aim of making it easily optimizable.  Once the model is defined and implemented, getting compiler-generated DFA/NFA implementations would be a great intern project.
> 
>> Some regexes are amenable to task-parallel strategies (i.e. threads), data-parallel strategies (i.e. vector or GPU), or both. Swift will probably grow new strategies and combinations of strategies over time. Implementations can also take advantage of the aforementioned performance flags (e.g. isCanonical) to greatly speed up matches.
>> 
>> The interface to a regex engine will be part of Swift’s ABI, including whatever intermediate representation regexes are compiled into. This could be a simple regex AST, a virtual machine bytecode, etc., but it needs to be extensible to accommodate any potential future additions and implementation strategies.
> 
> Obviously we should define a new Turing complete bytecode format for the regex VM, then build an LLVM backend that generates it, then compile the swift compiler into it, then embed that into the Swift compiler and use it to implement a first class macro system in Swift.
> 

Finally, a reasonable approach to a hygienic macro system.

> Oh wait, no we shouldn’t.  :-)
> 
> -Chris
> 
> 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20180113/917745e0/attachment.html>


More information about the swift-dev mailing list