[swift-evolution] Strings in Swift 4

Félix Cloutier felixcca at yahoo.ca
Wed Jan 25 21:54:25 CST 2017

> Le 25 janv. 2017 à 13:08, Ben Cohen <ben_cohen at apple.com> a écrit :
>> Okay, so I'm serializing two strings "a" and "b", and later on I want to deserialize them. I control "a", and the user controls "b". I know that I'll never have a comma in "a", so one obvious way to serialize the two strings is with "\(a),\(b)", and the most obvious way to deserialize them is with string.split(maxSplits: 2) { $0 == "," }.
>> For the example, string "a" is "hello", and the user put in "\u{0301}screw you" for "b". This makes the result "hello,́screw you". Now split misses the comma.
>> How do I fix it?
> One option (once Character acquires a unicodeScalars view similar to String’s) would be:
> s.split { $0.unicodeScalars.first == "," }

My two main objections to this are that (1) this drops the acute accent (although that's probably an acceptable sacrifice in the face of purposefully bad input); and (2) it's annoying to me that you have to drop below the Character level to safely perform a task this simple.

> There’s probably also a case to be made for a String-specific overload split(separator: UnicodeScalar) in which case you’d pass in the scalar of “,”. This would replicate similar behavior to languages that use code points as their “character”.

The way they're being built, I'm leaning towards the opinion that Strings wouldn't be the right tool to serialize anything. Unfortunately, in a world of XML, JSON, YAML, Markdown and such, they're also a very obvious choice.

> Alternatively, the right solution is to sanitize your input before the interpolation. Sanitization is a big topic, of which this is just one example. Essentially, you are asking for this kind of sanitization to be automatically applied for all range-replaceable operations on strings for this specific use case. I’m not sure that’s a good precedent to set. There are other ways in which Unicode can be abused that wouldn’t be covered, should we be sanitizing for those too on all low-level operations?

I agree that the general Unicode abuse problem cannot be solved. The novel thing here is that Swift is one of the first languages to bring grapheme-cluster-aware strings to a wide audience, and doing so, it introduces a class of bugs that have essentially no precedent. I feel like this should worry people a little bit. People have been able to abuse RTL overrides for several years now, and we found that it's a problem to users but machines are pretty good at dealing with it. However, if you'll allow me to dramatize, these are characters that basically eat their neighbor.

> This would also have pretty far-reaching implications across lots of different types and operations. For example, it’s not just on append:
> var s = "pokemon"
> let i = s.index(of: "m”)!
> // insert not just \u{0301} but also a separator?
> s.insert("\u{0301}", at: i)
> It also would apply to in-place mutation on slices, given you can do this:
> var a = [1,2,3,4]
> a[0...2].append(99)
> a // [1,2,3,99,4]
> In this case, suppose you appended "e" to a slice that ended between "m" and "\u{0301}”. The append operation on the substring would need to look into the outer string, see that the next scalar is a combining character, and then insert a spacer element in between them.
> We would still need the ability to append modifiers to characters legitimately. If users could not do this by inserting/appending these modifiers into String, we would have to put this logic onto Character, which would need to have the ability to range-replace within its scalars, which adds to a lot to the complexity of that type. It would also be fiddly to use, given that String is not going to conform to MutableCollection (because mutation on an element cannot be done in constant time). So you couldn’t do it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

I'd argue that no one should feel particularly great about writing code points to a collection that exposes Characters in return. Have any alternatives around modifying a Unicode scalar view been explored? I don't have any problem with making it impossible to add a Character-that-is-not-a-Character to a String's Character view if you can opt in to Unicode scalars when you mean it.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170125/6fc974a6/attachment.html>

More information about the swift-evolution mailing list