<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">If you haven’t seen Perl 6 grammars, I highly suggest taking a look.</div><div class=""><br class=""></div><div class=""><a href="https://perl6advent.wordpress.com/2009/12/21/day-21-grammars-and-actions/" class="">https://perl6advent.wordpress.com/2009/12/21/day-21-grammars-and-actions/</a></div><div class=""><br class=""></div><div class=""><a href="https://docs.perl6.org/language/grammars" class="">https://docs.perl6.org/language/grammars</a></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">My off-the-cuff straw-man attempt to apply this to Swift. (Fair warning, I’m no expert on Perl 6 grammars, just an enthusiast)</div><div class=""><br class=""></div><div class=""><div class=""><font face="Menlo" class="">grammar FooGrammar {</font></div><div class=""><font face="Menlo" class=""> token top = {</font></div><div class=""><font face="Menlo" class=""> <string> |</font></div><div class=""><font face="Menlo" class=""> <number></font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> token string {</font></div><div class=""><font face="Menlo" class=""> pattern {</font></div><div class=""><font face="Menlo" class=""> '\"' <literal>\w+ '\"'</font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> matched {</font></div><div class=""><font face="Menlo" class=""> // literal is in scope and holds the match</font></div><div class=""><font face="Menlo" class=""> // manipulate the AST</font></div><div class=""><font face="Menlo" class=""> return SomeTypeAdoptingAstNodeProtocol(x: y, value: literal)</font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> token number {</font></div><div class=""><font face="Menlo" class=""> pattern {</font></div><div class=""><font face="Menlo" class=""> /regex here/</font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> matched { </font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class=""> }</font></div><div class=""><font face="Menlo" class="">}</font></div><div class=""><font face="Menlo" class=""><br class=""></font></div><div class=""><font face="Menlo" class="">// produces an AST:</font></div><div class=""><font face="Menlo" class="">let ast = FooGrammar.parse("some input")</font></div></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">The key is that you use regexes for the basic tokens but you combine tokens more or less along BNF lines. When a token matches, the matched() block executes and you can manipulate the resulting AST. Presumably the AST types would come from the standard library. </div><div class=""><br class=""></div><div class="">That is also a stepping stone to a hygienic macro system where the macro invocation can receive an AST representing the syntax inside the macro. You could even require a macro to define a grammar so the compiler could at least validate the syntax of a macro is well-formed.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Russ</div><div class=""><br class=""></div><br class=""><div><blockquote type="cite" class=""><div class="">On Jan 25, 2017, at 7:27 AM, Chris Eidhof <<a href="mailto:chris@eidhof.nl" class="">chris@eidhof.nl</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Great stuff!<div class=""><br class=""></div><div class="">I wonder if built-in grammars (that's what Perl calls them) would work only for things that are backed by string literals, or if it's worth the time/effort to make them work for other kind of data as well. For example, what if you write a grammar to tokenize (yielding some sequence of `Token`s), and then want to parse those `Token`s? Or what if you want to parse other kinds of data? Or should we try to make the 80% case work (only provide grammar/regex literals for Strings) to avoid complexity?</div><div class=""><br class=""></div><div class="">I think it's worth looking at parser combinators. The really cool thing about them is that they provide a few basic elements, and well-defined ways to combine those elements. Specifically, it's interesting to look at the problems surrounding parser combinators and seeing how we could do better:</div><div class=""><br class=""></div><div class="">- Error messages can be cryptic</div><div class="">- Because it's built on top of the language, you can't really do stuff like binding subexpressions to local variables</div><div class="">- Once you allow something like flatMap in a parser combinator library, you lose the possibility to optimize by rewriting the parser structure. This is because the parser that comes out of the rhs of a flatMap can depend on the parsed output of the lhs. </div><div class=""><br class=""></div><div class="">To me it seems like there's a lot of (exciting) work to be done to get this right :).</div><div class=""><br class=""></div></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On Wed, Jan 25, 2017 at 6:35 AM, Chris Lattner <span dir="ltr" class=""><<a href="mailto:sabre@nondot.org" target="_blank" class="">sabre@nondot.org</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><span class="">On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution <<a href="mailto:swift-evolution@swift.org" target="_blank" class="">swift-evolution@swift.org</a>> wrote:<br class=""></span><div class=""><span class=""><blockquote type="cite" class=""><br class="m_8210535650356298369Apple-interchange-newline"><div class=""><div dir="ltr" class="">I agree that being able to implement parsers in a nice way can be a huge step forward in being really good at string processing.</div></div></blockquote><div class=""><br class=""></div></span><div class="">+1 from me as well, I agree with Joe that Swift can learn a lot from Perl 6 grammar’s and we should take the time to do it right. Below I say “regex” a lot, but I really mean a more general grammar system (and even Perl 5 regex’s aren’t regular :-)</div><span class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="">There are a couple of possibilities that come to mind directly:</div><div class=""><br class=""></div><div class="">1. Build parsers right into the language (like Perl 6 grammars)</div><div class="">2. Provide a parser combinator language (e.g. <a href="https://github.com/davedufresne/SwiftParsec" target="_blank" class="">https://github.com/<wbr class="">davedufresne/SwiftParsec</a>). </div><div class="">3. Rely on external tools like bison/yacc/etc.</div><div class="">4. Make it easy for people to write hand-written parsers (e.g. by providing an NSScanner alternative).</div></div></div></blockquote></span></div><div class=""><br class=""></div><div class="">My opinion is that #1 is the right path to start with, but it wouldn’t preclude doing #2. Here’s my rationale / half-baked thought process:</div><div class=""><br class=""></div><div class="">There are two important use cases for regex's: the literal case (e.g. /aa+b*/) and the dynamically computed case. The former is really what we’re talking about here, the latter should obviously be handled with some sort of Regex type which can be formed from string values or whatever. Regex literals in an expression context should default to producing the Regex type of course.</div><div class=""><br class=""></div><div class="">This means that when you pass a regex literal into an API call (e.g. split on a string), it is really just creating something of Regex type, and passing it down. If you wanted to introduce a parser combinator DSL, you could totally plug it into the system, by having the combinators produce something of the Regex type.</div><div class=""><br class=""></div><div class="">So why bless regex literals with language support at all? I see several reasons:</div><div class=""><br class=""></div><div class="">1. Diagnostics: These will be heavily used by people, and you want to have good compiler error and warning messages for them. You want to be able to validate the regex at compile time, not wait until runtime to detect syntactic mistakes like unbalanced parens.</div><div class=""><br class=""></div><div class="">2. Syntax Familiarity: To take advantage of people’s familiarity with other languages, we should strive to make the basic regex syntax familiar and obvious. I’d argue that /aa+b*/ should “just work” and do the thing you think it does. Relying on a combinator library to do that would be crazy.</div><div class=""><br class=""></div><div class="">3. Performance: Many regex’s are actually regular, so they can be trivially compiled into DFAs. There is a well understood body of work that can be simply dropped into the compiler to do this. Regex’s that are not regular can be compiled into hybrid DFA/NFA+backtracking schemes, and allowing a divide and conquer style of compiler optimization to do this is the path that makes the most sense (to me at least). Further, if you switch on a string and have a bunch of cases that are regex’s, you’d obviously want the compiler to generate a single state machine (like a lexer), not check each pattern in series.</div><div class=""><br class=""></div><div class="">4. Pattern matching greatness: One of the most obnoxious/error prone aspects of regex’s in many languages is that when you match a pattern, the various matches are dumped into numbered result values (often by the order of the parens in the pattern). This is totally barbaric: it begs for off by one errors, often breaks as the program is being evolved/maintained, etc. It is just as bad as printf/scanf! </div><div class=""><br class=""></div><div class="">You should instead be able to directly bind subexpressions into local variables. For example if you were trying to match something like “42: Chris”, you should be able to use straw man syntax like this:</div><div class=""><br class=""></div><div class=""> case /(let id: \d+): (let name: \w+)/: print(id); print(name)</div><div class=""><br class=""></div><div class="">Unless we were willing to dramatically expand how patterns work, this requires baking support into the language.</div><div class=""><br class=""></div><div class="">5. Scanner/“Formatter" integration: Taking the above one step farther, we could have default patterns for known types (and make it extensible to user defined types of course). For example, \d+ is the obvious pattern for integers, so you should be able to write the above like this (in principle):</div><div class=""><br class=""></div><div class=""><div class=""> case /(let id: Int): (let name: \w+)/: print(id); print(name)</div><div class=""><br class=""></div><div class="">In addition to avoiding having to specify \d+ all the time, this eliminates the need for a “string to int” conversion after the pattern is matched, because id would be bound as type Int already.</div><div class=""><br class=""></div></div><div class=""><br class=""></div><div class="">Anyway, to summarize, I think that getting regex’s into the language is really important and expect them to be widely used. As such, I think it is worth burning compiler/language complexity to make them be truly great in Swift.</div><span class="HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class="">-Chris</div><div class=""><br class=""></div><br class=""></font></span></div></blockquote></div><br class=""><br clear="all" class=""><div class=""><br class=""></div>-- <br class=""><div class="gmail_signature" data-smartmail="gmail_signature">Chris Eidhof</div>
</div>
</div></blockquote></div><br class=""></body></html>