[swift-evolution] Strings in Swift 4

Wed Jan 25 15:08:15 CST 2017

> On Jan 24, 2017, at 7:02 PM, Félix Cloutier via swift-evolution <swift-evolution at swift.org> wrote:
> 
> 
>> Le 24 janv. 2017 à 11:33, Dave Abrahams via swift-evolution <swift-evolution at swift.org> a écrit :
>>>>> I've never seen anyone start a string with a combining character on purpose, 
>>>> 
>>>> It will occur as a byproduct of the process of attaching a diacritic
>>>> to a base character.
>>> 
>>> Unless you're in the business of writing a text editor, I don't know
>>> if that's a common use case.
>> 
>> I don't either, to be honest.  But the experts I consult with keep
>> reassuring me that it's an important one.
> 
> Would it be possible that the Unicode experts' use cases are different from non-experts' use cases? It would make sense to put people who know a lot about Unicode in charge of handling complex Unicode operations, and that makes that use case very important to them, but through their hard work no one else needs to care about it.
> 
>>>>> though I'm familiar with just one natural language that needs
>>>>> combining characters. I can imagine that it could be a convenient
>>>>> feature in other natural languages.
>>>>> 
>>>>> However, if Swift Strings are now designed for machine processing
>>>>> and less for human language convenience, for me, it's easy enough to
>>>>> justify a safe default in the context of machine processing: `a+b`
>>>>> will not combine the end of `a` with the start of `b`. You could do
>>>>> this by inserting a ◌ that `b` could combine with if necessary.
>>>> 
>>>> You can do it, but it trades one semantic problem for a usability
>>>> problem, without solving all the semantic problems: you end up with
>>>> a.count + b.count == (a+b).count, sure, but you still don't satisfy
>>>> the usual law of collections that (a+b).contains(b.first!) if b is
>>>> non-empty, and now you've made it difficult to attach diacritics to
>>>> base characters.
>>> 
>>> "Difficult".
>>> 
>>> What kind of processing would you suggest on a variable "b" in the
>>> expression "\(a),\(b)" to ensure that the result can be split with a
>>> comma?
>> 
>> I'm sorry, I don't understand what you're driving at, here.
> 
> Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.
> 
> For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.
> 
> How do I fix it?
> 

One option (once Character acquires a unicodeScalars view similar to String’s) would be:

s.split { $0.unicodeScalars.first == "," }

There’s probably also a case to be made for a String-specific overload split(separator: UnicodeScalar) in which case you’d pass in the scalar of “,”. This would replicate similar behavior to languages that use code points as their “character”.

Alternatively, the right solution is to sanitize your input before the interpolation. Sanitization is a big topic, of which this is just one example. Essentially, you are asking for this kind of sanitization to be automatically applied for all range-replaceable operations on strings for this specific use case. I’m not sure that’s a good precedent to set. There are other ways in which Unicode can be abused that wouldn’t be covered, should we be sanitizing for those too on all low-level operations?

This would also have pretty far-reaching implications across lots of different types and operations. For example, it’s not just on append:

var s = "pokemon"
let i = s.index(of: "m”)!
// insert not just \u{0301} but also a separator?
s.insert("\u{0301}", at: i)

It also would apply to in-place mutation on slices, given you can do this:

var a = [1,2,3,4]
a[0...2].append(99)
a // [1,2,3,99,4]

In this case, suppose you appended "e" to a slice that ended between "m" and "\u{0301}”. The append operation on the substring would need to look into the outer string, see that the next scalar is a combining character, and then insert a spacer element in between them.

We would still need the ability to append modifiers to characters legitimately. If users could not do this by inserting/appending these modifiers into String, we would have to put this logic onto Character, which would need to have the ability to range-replace within its scalars, which adds to a lot to the complexity of that type. It would also be fiddly to use, given that String is not going to conform to MutableCollection (because mutation on an element cannot be done in constant time). So you couldn’t do it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

> Félix
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170125/cd8c883b/attachment.html>