[swift-evolution] Strings in Swift 4

Thu Jan 26 19:02:23 CST 2017

To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in more “traditional” regex implementations. See:

http://userguide.icu-project.org/strings/regexp

For example:

> Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the (?i) flag within a pattern itself.  Unicode case insensitive matching is complicated by the fact that changing the case of a string may change its length.  See http://unicode.org/faq/casemap_charprop.html for more information on Unicode casing operations.
> 
> Examples:
> 	• pattern "fussball" will match "fußball or "fussball"
> 	• pattern "fu(s)(s)ball" or "fus{2}ball" will match  "fussball" or "FUSSBALL" but not "fußball.
> 	• pattern "ß" will find occurences of "ss" or "ß"
> 	• pattern "s+" will not find "ß"
> 

and

> 
> w	UREGEX_UWORD	Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
> 

If regexes are going to be used on human language text, these are all important considerations.

Debbie

> On Jan 26, 2017, at 11:15 AM, Dave Abrahams via swift-evolution <swift-evolution at swift.org> wrote:
> 
> 
> on Wed Jan 25 2017, Chris Lattner <sabre-AT-nondot.org> wrote:
> 
>> On Jan 25, 2017, at 7:32 PM, Dave Abrahams <dabrahams at apple.com> wrote:
>>>> There are two important use cases for regex's: the literal case
>>>> (e.g. /aa+b*/) and the dynamically computed case.  The former is
>>>> really what we’re talking about here, the latter should obviously be
>>>> handled with some sort of Regex type which can be formed from string
>>>> values or whatever.  
>>> 
>>> Ideally these patterns interoperate so that you can combine them.
>> 
>> Yes, as I mentioned, the regex literal should form something of the
>> Regex type.  Any API that takes a Regex would work with them.
> 
> But I think we want distinct types for some of these patterns so they
> can capture compile-time knowledge.  That's why
> https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L32
> has a Pattern protocol.  If you mean “type” in a looser sense that
> admits protocols, then we are aligned.
> 
>>>> You should instead be able to directly bind subexpressions into local
>>>> variables.  For example if you were trying to match something like
>>>> “42: Chris”, you should be able to use straw man syntax like this:
>>>> 
>>>>  case /(let id: \d+): (let name: \w+)/: print(id); print(name)
>>> 
>>> This is a good start, but inadequate for handling the kind of recursive
>>> grammars to which you want to generalize regexes, because you have to
>>> bind the same variable multiple times—often re-entrantly—during the same
>>> match.  Actually the Kleene star (*) already has this basic problem,
>>> without the re-entrancy, but if you want to build real parsers, you need
>>> to do more than simply capture the last substring matched by each group.
>> 
>> Please specify some more details about what the problem is, because
>> I’m not seeing it.  Lots of existing regex implementations work with
>> "(…)*” patterns by binding to the last value.  From my perspective,
>> this is pragmatic, useful, and proven.  What is your specific concern?
> 
> My specific concern is that merely capturing the last match is
> inadequate to many real parsing jobs.
> 
>> 
>> 
>> When you say “real” parsers, you’re implicitly insulting the “unreal"
>> parsers, 
> 
> No offense intended, truly.  As a PL guy I assumed you'd know what I
> meant.  As you know, regexes aren't sufficiently powerful to handle
> parsing languages like Swift, and even if they were, retaining only the
> last match of a capture would be insufficient to go from recognizing
> valid input (parsing) to semantic analysis.
> 
>> without explaining what the “real” ones are, or why they matter.
>> Please provide specific use cases that would be harmed by this
>> approach.,
> 
> I'm talking about the kinds of parsers made possible by Perl 6 grammars,
> which can be recursive.  Some examples:
> 
> http://stackoverflow.com/questions/18561179/example-of-perl-6-grammar-with-operator-precedence-rules?answertab=active#tab-top
> 
>>>> Unless we were willing to dramatically expand how patterns work, this
>>>> requires baking support into the language.
>>> 
>>> I don't understand the "Unless" part of that sentence.  It seems obvious
>>> that no expansion of how patterns work could make the above work without
>>> language changes.
>> 
>> I’m not a believer in this approach, but someone could argue that we
>> should allow arbitrary user-defined syntactic expansion of the pattern
>> grammar, similar to how we allow syntactic expansion of the expression
>> grammar through operator definitions.  This is what I meant by
>> “dramatically expanding” how patterns work.
> 
> Oh, I see what you mean.  You need to bake *something* into the
> language.  That thing could either be regex support or it could be
> something more general, like a macro system that allowed beautiful regex
> support to be built in a library.  Well, I'd love to have the latter,
> but wouldn't be willing to sacrifice much quality-of-user-experience
> with in order to get it.  It would have to be roughly indistinguishable
> from the end-user's point-of-view.
> 
> -- 
> -Dave
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution