[swift-evolution] Trial balloon: Ensure that String always contains valid Unicode

Paul Cantrell cantrell at pobox.com
Fri Dec 18 15:56:59 CST 2015


Er, typo in the first sentence! I meant to say:

I was quite surprised to learn that it’s possible to create Swift strings that contain things other than valid Unicode characters.


> On Dec 18, 2015, at 3:47 PM, Paul Cantrell via swift-evolution <swift-evolution at swift.org> wrote:
> 
> I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?
> 
> String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:
> 
>     // Returns nil
>     String(
>         bytes: [0xD8, 0x00] as [UInt8],
>         encoding: NSUTF8StringEncoding)
> 
> However, that initializer does not reject invalid surrogate characters in UTF-16:
> 
>     // Succeeds (wat?!)
>     let bogusStr = String(
>         bytes: [0xD8, 0x00] as [UInt8],
>         encoding: NSUTF16BigEndianStringEncoding)!
> 
> Ever wonder why dataWithJSONObject(…) is declared “throws?” Now you know!
> 
>     // Throws an error
>     try! NSJSONSerialization.dataWithJSONObject(
>         ["foo": bogusStr], options: [])
> 
> And why does the URL escaping method in Foundation return an optional even though it escapes the string using UTF-8, which is a complete Unicode encoding? Same reason:
>     // Returns nil
>     bogusStr.stringByAddingPercentEncodingWithAllowedCharacters(
>         NSCharacterSet.alphanumericCharacterSet())
> 
> AFAIK, the first method could lose its “throws” modifier and the second method would not need to return an optional if only String itself guaranteed that it would always contain valid Unicode. There are likely other APIs that would see similar benefits.
> 
> Are there downsides to making all String initializers guarantee that the Strings always contain valid Unicode? I can think of two possibilities:
> 
> Is there some circumstance where you actually want a String to contain unpaired UTF-16 surrogate characters? I can’t imagine what that would be, but perhaps someone else can.
> Is it important to ensure that String.init(…) is O(1) when it uses UTF-16? This seems thin: I assume that the library has to copy the raw bytes regardless, and it’s O(n) for other character encodings, so…?
> 
> Cheers,
> 
> Paul
> 
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20151218/a8215760/attachment.html>


More information about the swift-evolution mailing list