<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body><div>I agree in principle that it would be great if String could enforce that it's always valid.<br></div>
<div> </div>
<div>But unfortunately, in practice, there's no way to do that without making it expensive to bridge from Obj-C. Because, as you've demonstrated, you can create NSStrings that contain things that aren't actually valid unicode sequences, every single bridge from an NSString to a String would have to be checked for validity. Not only that, but it's not clear what the behavior would be if an invalid string is found, since these bridges are unconditional - would Swift panic? Would it silently replace the invalid sequence with U+FFFD? Or something else entirely? But the question doesn't really matter, because turning these bridges from O(1) into O(N) would be an unacceptable performance penalty anyway.</div>
<div> </div>
<div>-Kevin Ballard</div>
<div> </div>
<div>On Fri, Dec 18, 2015, at 01:47 PM, Paul Cantrell via swift-evolution wrote:<br></div>
<blockquote type="cite"><div>I was quite surprised to learn that it’s possible to create Swift strings that do not contain things other than valid Unicode characters. Is it feasible to guarantee that this cannot happen?<br></div>
<div> </div>
<div>String.init(bytes:encoding:) is failable, and does in fact validate that the given bytes are decodable with the given encoding in most circumstances:<br></div>
<div> </div>
<div><div style="color:rgb(102, 139, 73);font-family:Menlo;font-size:10.5px;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;line-height:normal;"><span class="colour" style="color:rgb(0, 0, 0)"></span>// Returns nil<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;color:rgb(88, 126, 168);"><span class="colour" style="color:rgb(0, 0, 0)"></span>String<span class="colour" style="color:rgb(0, 0, 0)">(</span><br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;"> bytes: [<span class="colour" style="color:rgb(50, 62, 125)">0xD8</span>, <span class="colour" style="color:rgb(50, 62, 125)">0x00</span>] <span class="colour" style="color:rgb(50, 62, 125)">as</span> [<span class="colour" style="color:rgb(88, 126, 168)">UInt8</span>],<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;"> encoding: <span class="colour" style="color:rgb(88, 126, 168)">NSUTF8StringEncoding</span>)<br></div>
</div>
<div> </div>
<div>However, that initializer does <i>not</i> reject invalid surrogate characters in UTF-16:<br></div>
<div> </div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;color:rgb(102, 139, 73);"><span class="colour" style="color:rgb(0, 0, 0)"></span>// Succeeds (wat?!)<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;"><span class="colour" style="color:rgb(50, 62, 125)">let</span> bogusStr = <span class="colour" style="color:rgb(88, 126, 168)">String</span>(<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;"> bytes: [<span class="colour" style="color:rgb(50, 62, 125)">0xD8</span>, <span class="colour" style="color:rgb(50, 62, 125)">0x00</span>] <span class="colour" style="color:rgb(50, 62, 125)">as</span> [<span class="colour" style="color:rgb(88, 126, 168)">UInt8</span>],<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;color:rgb(88, 126, 168);"><span class="colour" style="color:rgb(0, 0, 0)"> encoding: </span>NSUTF16BigEndianStringEncoding<span class="colour" style="color:rgb(0, 0, 0)">)!</span><br></div>
<div> </div>
<div>Ever wonder why dataWithJSONObject(…) is declared “throws?” Now you know!<br></div>
<div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;color:rgb(102, 139, 73);"><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;min-height:12px;"><div> </div>
</div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;"><span class="colour" style="color:rgb(0, 0, 0)"><span class="size" style="font-size:10.5px"></span></span><span class="size" style="font-size:10.5px">// Throws an error</span><br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;color:rgb(88, 126, 168);"><span class="colour" style="color:rgb(0, 0, 0)"></span><span class="colour" style="color:rgb(50, 62, 125)">try</span><span class="colour" style="color:rgb(0, 0, 0)">! </span>NSJSONSerialization<span class="colour" style="color:rgb(0, 0, 0)">.</span>dataWithJSONObject<span class="colour" style="color:rgb(0, 0, 0)">(</span><br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;"> [<span class="colour" style="color:rgb(132, 62, 100)">"foo"</span>: <span class="colour" style="color:rgb(88, 126, 168)">bogusStr</span>], options: [])<br></div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;"> </div>
<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;"><div style="color:rgb(0, 0, 0);font-family:'Helvetica Neue';font-size:13px;">And why does the URL escaping method in Foundation return an optional even though it escapes the string using UTF-8, which is a complete Unicode encoding? Same reason:<br></div>
<div style="color:rgb(0, 0, 0);font-family:'Helvetica Neue';font-size:13px;"><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;font-family:Menlo;color:rgb(102, 139, 73);"><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font-size:10.5px;line-height:normal;min-height:12px;"> </div>
</div>
</div>
</div>
<div><div style="font-size:10.5px;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;line-height:normal;"><span class="colour" style="color:rgb(0, 0, 0)"></span>// Returns nil<br></div>
<div style="font-size:10.5px;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;line-height:normal;color:rgb(88, 126, 168);"><span class="colour" style="color:rgb(0, 0, 0)"></span>bogusStr<span class="colour" style="color:rgb(0, 0, 0)">.</span>stringByAddingPercentEncodingWithAllowedCharacters<span class="colour" style="color:rgb(0, 0, 0)">(</span><br></div>
<div style="font-size:10.5px;margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;line-height:normal;color:rgb(88, 126, 168);"><span class="colour" style="color:rgb(0, 0, 0)"></span>NSCharacterSet<span class="colour" style="color:rgb(0, 0, 0)">.</span>alphanumericCharacterSet<span class="colour" style="color:rgb(0, 0, 0)">())</span><br></div>
</div>
<div> </div>
</div>
</div>
<div>AFAIK, the first method could lose its “throws” modifier and the second method would not need to return an optional if only String itself guaranteed that it would always contain valid Unicode. There are likely other APIs that would see similar benefits.<br></div>
<div> </div>
<div>Are there downsides to making all String initializers guarantee that the Strings always contain valid Unicode? I can think of two possibilities:<br></div>
<div> </div>
<div><ul><li>Is there some circumstance where you actually want a String to contain unpaired UTF-16 surrogate characters? I can’t imagine what that would be, but perhaps someone else can.<br></li><li>Is it important to ensure that String.init(…) is O(1) when it uses UTF-16? This seems thin: I assume that the library has to copy the raw bytes regardless, and it’s O(n) for other character encodings, so…?<br></li></ul></div>
<div> </div>
<div>Cheers,<br></div>
<div> </div>
<div>Paul<br></div>
<div> </div>
<div><img style="height:1px !important;width:1px !important;border-top-width:0px !important;border-right-width:0px !important;border-bottom-width:0px !important;border-left-width:0px !important;margin-top:0px !important;margin-bottom:0px !important;margin-right:0px !important;margin-left:0px !important;padding-top:0px !important;padding-bottom:0px !important;padding-right:0px !important;padding-left:0px !important;" border="0" height="1" width="1" alt="" src="https://www.fastmailusercontent.com/proxy/cf6135a421fc070f1bd64a6f59cef4aa4b2de4df0b9ad726e390456c7e4149bf/8647470737a3f2f25723030323431303e23647e23756e64676279646e2e65647f27766f2f60756e6f35707e6d3148765176786c673171614a7d2236454230345272776e48583149424867507f425e49505d6853773278635773567366497870547d2232463e4439746767377744343745355130366879376d67414b6561366a4a5c62527364796d22364d22324c696358417f4d22364f4d2236447e61745e6f6735607039766071553179555d694f637b44546d22324e6971794d2236496438305555555632526557574753675d466e683659445c695840384a7a405b60785e496e446d2236434373377c694d4942423467384d2236467a417648655e64436874584534523b41444732534234555f635c4f6d2232464c41415d23344d23344/open"><br></div>
<div><u>_______________________________________________</u><br></div>
<div>swift-evolution mailing list<br></div>
<div><a href="mailto:swift-evolution@swift.org">swift-evolution@swift.org</a><br></div>
<div><a href="https://lists.swift.org/mailman/listinfo/swift-evolution">https://lists.swift.org/mailman/listinfo/swift-evolution</a><br></div>
</blockquote><div> </div>
</body>
</html>