[swift-evolution] Strings in Swift 4
Dave Abrahams
dabrahams at apple.com
Thu Jan 26 21:31:05 CST 2017
on Thu Jan 26 2017, Deborah Goldsmith <swift-evolution at swift.org> wrote:
> To throw another ingredient into the mix, there are issues for Unicode regex that don’t appear in
> more “traditional” regex implementations. See:
>
> http://userguide.icu-project.org/strings/regexp
>
> For example:
>
>> Case insensitive matching is specified by the
>> UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the
>> (?i) flag within a pattern itself. Unicode case insensitive
>> matching is complicated by the fact that changing the case of a
>> string may change its length. See
>> http://unicode.org/faq/casemap_charprop.html for more information on
>> Unicode casing operations.
>>
>> Examples:
>> • pattern "fussball" will match "fußball or "fussball"
>> • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" but not "fußball.
>> • pattern "ß" will find occurences of "ss" or "ß"
>> • pattern "s+" will not find "ß"
>>
These all appear to be issues for users to consider rather than design
issues for the regex implementation. Am I mistaken?
> and
>
>>
>> w UREGEX_UWORD Controls the behavior of \b in a pattern. If set,
>> word boundaries are found according to the definitions of word found
>> in Unicode UAX 29, Text Boundaries. By default, word boundaries are
>> identified by means of a simple classification of characters as
>> either “word” or “non-word”, which approximates traditional regular
>> expression behavior. The results obtained with the two options can
>> be quite different in runs of spaces and other non-word characters.
>>
>
> If regexes are going to be used on human language text, these are all
> important considerations.
Yup, but I don't see how they affect the design, other than that maybe
matching on a LocalizedString type would use UREGEX_WORD by default.
--
-Dave
More information about the swift-evolution
mailing list