[swift-evolution] Pitch: Renaming CharacterSet to UnicodeScalarSet

Xiaodi Wu xiaodi.wu at gmail.com
Wed Sep 28 22:46:12 CDT 2016


On Wed, Sep 28, 2016 at 10:34 PM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:

> On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution <
> swift-evolution at swift.org> wrote:
>
>> On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution <
>> swift-evolution at swift.org> wrote:
>>
>>
>> D'erp. I missed that. And that's an unambiguous answer.
>>
>> So let me move on to part B of the pitch: I think CharacterSets are
>> broken.
>>
>> Xiaodi Wu: "isn't the problem you're presenting really an argument that
>> the type should be fleshed out to handle characters (grapheme clusters)
>> containing more than one Unicode scalar?"
>>
>>
>> It seems that it already does handle such characters:
>>
>> (done in Objective-C so we can log the length of the range as a count of
>> UTF-16 code units)
>>
>> #import <Foundation/Foundation.h>
>>
>> int main(int argc, char *argv[]) {
>>     @autoreleasepool {
>>         NSCharacterSet *bikeSet = [NSCharacterSet
>> characterSetWithCharactersInString:@"🚲"];
>>         NSString *str = @"foo🚲bar";
>>
>>
>>         NSRange range = [str rangeOfCharacterFromSet:bikeSet];
>>
>>
>>         NSLog(@"location: %lu length: %lu", range.location, range.length
>> );
>>     }
>> }
>>
>> - - - - - - -
>>
>> *2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2*
>> *Program ended with exit code: 0*
>>
>> - - - - - - -
>>
>> As we can see, the character from the set is recognized as consisting of
>> two code units. There are a few bugs in the system, though. See the
>> cocoa-dev thread “Where is my bicycle?” from about a year ago:
>> http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html
>>
>
> The bike emoji might be two code units, but it is one Unicode scalar
> (U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode
> scalars (U+1F1E8 U+1F1E6) but nonetheless one character.
>

To illustrate in code how CharacterSet doesn't actually handle characters
made up of multiple Unicode scalars:

```
import Foundation

let str1 = "🇦🇩"
let first = CharacterSet(charactersIn: str1) // this actually crashes
corelibs-foundation
let str2 = "🇦🇺"
let second = CharacterSet(charactersIn: str2)
let intersection = first.intersection(second)
print(intersection.isEmpty)
// actual output: false
// obviously, if we were really dealing with characters, the intersection
should be empty
```


> Charles
>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160928/9321ca5d/attachment.html>


More information about the swift-evolution mailing list