[swift-evolution] Strings in Swift 4

Deborah Goldsmith goldsmit at apple.com
Fri Jan 27 18:11:34 CST 2017


I think they affect the implementation, and (to a small extent) the exposed semantics. It’s also possible that, to provide some of this functionality, it’s necessary to use an NFA implementation where a traditional regex might be able to use DFA.

They certainly impact the implementation to the extent that a traditional regex wouldn’t exhibit Unicode-compliant behavior in some cases.

Debbie

> On Jan 26, 2017, at 7:31 PM, Dave Abrahams via swift-evolution <swift-evolution at swift.org> wrote:
> 
> 
> on Thu Jan 26 2017, Deborah Goldsmith <swift-evolution at swift.org> wrote:
> 
>> To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in
>> more “traditional” regex implementations. See:
>> 
>> http://userguide.icu-project.org/strings/regexp
>> 
>> For example:
>> 
>>> Case insensitive matching is specified by the
>>> UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the
>>> (?i) flag within a pattern itself.  Unicode case insensitive
>>> matching is complicated by the fact that changing the case of a
>>> string may change its length.  See
>>> http://unicode.org/faq/casemap_charprop.html for more information on
>>> Unicode casing operations.
>>> 
>>> Examples:
>>> 	• pattern "fussball" will match "fußball or "fussball"
>>> • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
>>> 	• pattern "ß" will find occurences of "ss" or "ß"
>>> 	• pattern "s+" will not find "ß"
>>> 
> 
> These all appear to be issues for users to consider rather than design
> issues for the regex implementation.  Am I mistaken?
> 
>> and
>> 
>>> 
>>> w UREGEX_UWORD Controls the behavior of \b in a pattern. If set,
>>> word boundaries are found according to the definitions of word found
>>> in Unicode UAX 29, Text Boundaries. By default, word boundaries are
>>> identified by means of a simple classification of characters as
>>> either “word” or “non-word”, which approximates traditional regular
>>> expression behavior. The results obtained with the two options can
>>> be quite different in runs of spaces and other non-word characters.
>>> 
>> 
>> If regexes are going to be used on human language text, these are all
>> important considerations.
> 
> Yup, but I don't see how they affect the design, other than that maybe
> matching on a LocalizedString type would use UREGEX_WORD by default.
> 
> -- 
> -Dave
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution



More information about the swift-evolution mailing list