[swift-dev] State of String: Ergonomics, and You!

Chris Lattner clattner at nondot.org
Wed Jan 17 12:58:15 CST 2018


> On Jan 13, 2018, at 10:30 AM, Michael Ilseman via swift-dev <swift-dev at swift.org> wrote:
>>  I wouldn’t overly rely on it for guidance on these issues give that it it stuck so squarely in the realm of UTF16.
>> 
> 
> Wading a little into the weeds here, CharacterSet’s builtins model older Unicode semantics than what we probably want to provide. E.g. CharacterSet.lowercaseLetters means general category “Ll”, while modern Unicode defines property Lowercase which includes more scalars. The Lowercase property vs CharacterSet.lowercaseLetters is equivalent to ICU’s u_isULowercase [1] vs u_islower [2], where the former is the modern recommended API. Contrarily, Perl 6’s <lower> built-in rule[3] seems to also just be “Ll” and I’m not familiar with the rationale there (compatibility?). Certainly needs investigation.

Makes sense.  It would be great to standardize towards the grapheme cluster as the canonical concept of character (this is why Character is defined that way) and by pushing regex’s to use that definition of character, we can simplify and reduce aggregate complexity.  This is a definition that “just works” most often, and the introduction of regex’s will eliminate all of the awkardness of having to use them - by eliminating some of the most common reasons that people want to use integer indexes into strings.

> I supposed by “guidance” I meant “here are some of the things people commonly ask about characters”. Built-in character classes in various regex flavors also provide guidance.
> 
> [1] u_isULowercase: http://icu-project.org/apiref/icu4c/uchar_8h.html#a8321c9ba617ed00787f20c4b23a254bc <http://icu-project.org/apiref/icu4c/uchar_8h.html#a8321c9ba617ed00787f20c4b23a254bc>
> [2] u_islower: http://icu-project.org/apiref/icu4c/uchar_8h.html#aa26a51b768147fbf34029f8141132815 <http://icu-project.org/apiref/icu4c/uchar_8h.html#aa26a51b768147fbf34029f8141132815>
> [3] https://docs.perl6.org/language/regexes#Backslashed,_predefined_character_classes <https://docs.perl6.org/language/regexes#Backslashed,_predefined_character_classes>
One of the frustrating things about the regex state of the art is that most engines were built pre-modern-unicode-complexity.  As you say, it is surprising that Perl6 didn’t fix some of this, but it could either be because of compatibility or (more likely) that they made many of these decisions 15 years ago when Perl 6 was just getting going.  Perl 6 had a long gestation period.

>>> Note: In no way would these properties obviate the need for CharacterSet, as CharacterSet API is independently useful and the right model for many whole-string operations.
>> 
>> No it isn’t.  A Grapheme-cluster based analog of CharacterSet would be a reasonable model to consider though.  It could conceptually support predicates like isEmoji()
>> 
> 
> In the regex section I talk about character classes as being modeled by `(Character) -> Bool` rather than something that conforms to SetAlgebra. One big reason (beyond being awesomely convenient) is because I don’t think graphemes are something to be enumerated.

+1

> Theoretically speaking, I’m not sure if the set of graphemes is even recursively enumerable (e.g. zalgo-text or zalgo-emoji). Practically speaking, it darn-well might as well be uncountable. Best case, if even possible, any enumeration would be emergent behavior of encoding details and change significantly with each Unicode version. If we cannot enumerate graphemes, we cannot answer equality, isDisjoint, isSubset, etc.

+2

> I don’t know if those operations are important for a grapheme analogue. They just demonstrate that a grapheme set follows a different model than a scalar set, and different than what we have traditionally used the word “Set” for in Swift, though you would know the history here better.

+3.  I completely agree that a “set” in the computer sciency usage is no longer a useful notion any longer for characters.  It is still technically correct with the mathematical definition, but any alignment of terminology towards this will just be confusing to users.

> As far as uses of such a grapheme set are concerned, they seem equivalent to application of a regex consisting of a character class. For example, rangeOfCharacter(from:options:) could be accomplished with the below.

Yes, that’s a great way to consider it.  

> In an ever-futile attempt to avoid focusing on specific syntax, I’ll form an army of straw-people, using 「」 delimiters for a regex literal, « » to denote character classes, subscript on String to model application of a Regex, parenthesis to denote a capture that must always be a decl (even if the name is dropped by using `_`), and Regex literals defaulting to whole-string matching rather than the first partial-string match:
> 
> extension Character {
>     var isEmoji: Bool { get }
> }
> extension Regex {
>     var rangeOfMatch: Regex<(Range<String.Index>)>  // Drops the captured value, just wants the range
>     var allMatches: Regex<LazySequence<T>>
> }
> 
> let emojiPattern =「(let _ = «.isEmoji»)」 // Regex<(Character)>

I assume that we will support an escape sequence of some sort and use the \ character for it.  Given that, it makes sense to use \(.isEmoji) as the syntax for referring to user-defined predicates, given that we use it for string literal interpolation, and that regex’s are the dual of it.

> 
> let theFirst  = myString[emojiPattern.rangeOfMatch.firstPartial] // Range<String.Index>
> let allOfThem = myString[emojiPattern.rangeOfMatch.allMatches] // LazySequence<Range<String.Index>>
> 
> (Alternatively, Regex could have an init taking a (Character) -> Bool)
> 
> In such a world, what do you image the role of a grapheme set type is? It might have more uses as provided by the platform, alongside things such as localization, linguistic analysis, etc.

I think you’re arguing here that CharacterSet should go away.  If that is possible, then that would be great.  I’d check to see how CharacterSet is used across Cocoa, I’m not an expert on that.  It would be fantastic if those could be replaced with versions that take predicate functions or regex’s.

>> 
>>> * Interpolated expressions are supplied inside the literal, meaning they cannot be passed around like format strings without extra boilerplate (e.g. a wrapping function).
>> 
>> The original (2013 era) idea was to allow for interpolants to take multiple arguments (the contents of \(….) would be passed as the argument list, so you could specify formatting information as subsequent parameters, e.g.:  
>> 
>>     “your file access is set to \(mode, .Octal)”.
>> 
>> or:
>>     “your file access is set to \(mode, format: .Octal)”.
>>   
>> or something like that.  Of course each type would define its own formatting modes, and this would probably be reflected through a type-specific initializer.
>> 
> 
> That looks very similar to one of Brent’s proposals (that I can no longer find a link to). Was there a reason it didn’t happen other than not getting around to it?

It simply wasn’t the priority at the time (e.g. we didn’t even have let vs var yet :-).  There were bigger fish to fry and we didn’t come back to it.


>>> ### Regex: Tearing Strings Apart
>> 
>> This is clearly necessary but also clearly not part of Swift 5.  Regex’s should themselves be the subject of an intense design process.  Swift 6 pretty please???  I’d suggest splitting this section out and starting a regex manifesto. :-)
>> 
> 
> Since it is so big, I feel like if we’re going to make any progress we need to start soon. Of course, any focus will be preempted as needed by ABI stability. Even if Swift N doesn’t get the fully formed feature, we’ll pick up things along the way. E.g. as you mention there’s a parity between Character’s Bool properties and character classes.
> 
> I definitely want to circulate and discuss these ideas a little bit before writing a “manifesto”.

K, I’m just saying that this is meaty enough and should continue to develop, so I think that it being a manifesto is an inevitability. :-) I’m thrilled you’re pushing this forward btw.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20180117/5b7e2806/attachment.html>


More information about the swift-dev mailing list