[swift-evolution] Strings in Swift 4

Thu Jan 26 21:31:05 CST 2017

on Thu Jan 26 2017, Deborah Goldsmith <swift-evolution at swift.org> wrote:

> To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in
> more “traditional” regex implementations. See:
>
> http://userguide.icu-project.org/strings/regexp
>
> For example:
>
>> Case insensitive matching is specified by the
>> UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the
>> (?i) flag within a pattern itself.  Unicode case insensitive
>> matching is complicated by the fact that changing the case of a
>> string may change its length.  See
>> http://unicode.org/faq/casemap_charprop.html for more information on
>> Unicode casing operations.
>> 
>> Examples:
>> 	• pattern "fussball" will match "fußball or "fussball"
>> • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
>> 	• pattern "ß" will find occurences of "ss" or "ß"
>> 	• pattern "s+" will not find "ß"
>> 

These all appear to be issues for users to consider rather than design
issues for the regex implementation.  Am I mistaken?

> and
>
>> 
>> w UREGEX_UWORD Controls the behavior of \b in a pattern. If set,
>> word boundaries are found according to the definitions of word found
>> in Unicode UAX 29, Text Boundaries. By default, word boundaries are
>> identified by means of a simple classification of characters as
>> either “word” or “non-word”, which approximates traditional regular
>> expression behavior. The results obtained with the two options can
>> be quite different in runs of spaces and other non-word characters.
>> 
>
> If regexes are going to be used on human language text, these are all
> important considerations.

Yup, but I don't see how they affect the design, other than that maybe
matching on a LocalizedString type would use UREGEX_WORD by default.

-- 
-Dave