[swift-evolution] Strings in Swift 4

Dave Abrahams dabrahams at apple.com
Wed Jan 25 21:32:06 CST 2017


on Tue Jan 24 2017, Chris Lattner <sabre-AT-nondot.org> wrote:

> On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>> I agree that being able to implement parsers in a nice way can be a
>> huge step forward in being really good at string processing.
>
> +1 from me as well, I agree with Joe that Swift can learn a lot from
> Perl 6 grammar’s and we should take the time to do it right.  Below I
> say “regex” a lot, but I really mean a more general grammar system
> (and even Perl 5 regex’s aren’t regular :-)
>
>> There are a couple of possibilities that come to mind directly:
>> 
>> 1. Build parsers right into the language (like Perl 6 grammars)
>> 2. Provide a parser combinator language
>> (e.g. https://github.com/davedufresne/SwiftParsec
>> <https://github.com/davedufresne/SwiftParsec>).
>> 3. Rely on external tools like bison/yacc/etc.
>> 4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner
> alternative).
>
> My opinion is that #1 is the right path to start with, but it wouldn’t
> preclude doing #2.  Here’s my rationale / half-baked thought process:
>
> There are two important use cases for regex's: the literal case
> (e.g. /aa+b*/) and the dynamically computed case.  The former is
> really what we’re talking about here, the latter should obviously be
> handled with some sort of Regex type which can be formed from string
> values or whatever.  

Ideally these patterns interoperate so that you can combine them.

> Regex literals in an expression context should default to producing
> the Regex type of course.
>
> This means that when you pass a regex literal into an API call
> (e.g. split on a string), it is really just creating something of
> Regex type, and passing it down.  If you wanted to introduce a parser
> combinator DSL, you could totally plug it into the system, by having
> the combinators produce something of the Regex type.
>
> So why bless regex literals with language support at all?  I see
> several reasons:
>
> 1. Diagnostics: These will be heavily used by people, and you want to
> have good compiler error and warning messages for them.  You want to
> be able to validate the regex at compile time, not wait until runtime
> to detect syntactic mistakes like unbalanced parens.
>
> 2. Syntax Familiarity: To take advantage of people’s familiarity with
> other languages, we should strive to make the basic regex syntax
> familiar and obvious.  I’d argue that /aa+b*/ should “just work” and
> do the thing you think it does.  Relying on a combinator library to do
> that would be crazy.
>
> 3. Performance: Many regex’s are actually regular, so they can be
> trivially compiled into DFAs.  There is a well understood body of work
> that can be simply dropped into the compiler to do this. Regex’s that
> are not regular can be compiled into hybrid DFA/NFA+backtracking
> schemes, and allowing a divide and conquer style of compiler
> optimization to do this is the path that makes the most sense (to me
> at least).  Further, if you switch on a string and have a bunch of
> cases that are regex’s, you’d obviously want the compiler to generate
> a single state machine (like a lexer), not check each pattern in
> series.
>
> 4. Pattern matching greatness: One of the most obnoxious/error prone
> aspects of regex’s in many languages is that when you match a pattern,
> the various matches are dumped into numbered result values (often by
> the order of the parens in the pattern).  This is totally barbaric: it
> begs for off by one errors, often breaks as the program is being
> evolved/maintained, etc.  It is just as bad as printf/scanf!
>
> You should instead be able to directly bind subexpressions into local
> variables.  For example if you were trying to match something like
> “42: Chris”, you should be able to use straw man syntax like this:
>
>    case /(let id: \d+): (let name: \w+)/: print(id); print(name)

This is a good start, but inadequate for handling the kind of recursive
grammars to which you want to generalize regexes, because you have to
bind the same variable multiple times—often re-entrantly—during the same
match.  Actually the Kleene star (*) already has this basic problem,
without the re-entrancy, but if you want to build real parsers, you need
to do more than simply capture the last substring matched by each group.

> Unless we were willing to dramatically expand how patterns work, this
> requires baking support into the language.

I don't understand the "Unless" part of that sentence.  It seems obvious
that no expansion of how patterns work could make the above work without
language changes.

> 5. Scanner/“Formatter" integration: Taking the above one step farther,
> we could have default patterns for known types (and make it extensible
> to user defined types of course). For example, \d+ is the obvious
> pattern for integers, so you should be able to write the above like
> this (in principle):
>
>    case /(let id: Int): (let name: \w+)/: print(id); print(name)
>
> In addition to avoiding having to specify \d+ all the time, this
> eliminates the need for a “string to int” conversion after the pattern
> is matched, because id would be bound as type Int already.

Yup.

> Anyway, to summarize, I think that getting regex’s into the language
> is really important and expect them to be widely used.  As such, I
> think it is worth burning compiler/language complexity to make them be
> truly great in Swift.

Thanks for this post, Chris; it clarifies beautifully the reasons that
we're not trying to tackle regexes right away.

-- 
-Dave


More information about the swift-evolution mailing list