[swift-evolution] Trial balloon: Ensure that String always contains valid Unicode

Tue Jan 5 11:01:06 CST 2016

> On Jan 4, 2016, at 5:39 PM, Kevin Ballard <kevin at sb.org> wrote:
> 
> On Mon, Jan 4, 2016, at 03:22 PM, Paul Cantrell wrote:
>>  
>> The bottom line is that not every NSString → String bridge need to be O(n). At least in theory. Someone with more intimate knowledge of NSString can correct me if I’m wrong.
>  
> I thought it was a given that we can't modify NSString. If we can modify it, all bets are off; heck, if we can modify it, why not just make NSString reject invalid sequences to begin with?

Good question. And if we can’t modify NSString, then yes, we’re up against a tough problem.

But should NSString legacy constraints really compromise the design of Swift’s native String type?

Félix and Dmitri’s comments suggest that there are ways to prevent that, and that there’s precedent for placing any distasteful behavior necessary for compatibility in the bridging, not in the core type.

>> Keep in mind that we’re already incurring that O(n) expense right now for every Swift operation that turns an NSString-backed string into characters — that plus the API burden of having that check deferred, which is what originally motivated this thread.
>  
> That's true for native Strings as well. The native String storage is actually a sequence of UTF-16 code units, it's not a sequence of characters. Any time you iterate over the CharacterView, it has to calculate the grapheme cluster boundaries.

Aren’t Swift strings encoded as UTF-8, —or at least designed to behave as if they are, however they might be stored under the hood?

https://github.com/apple/swift/blob/master/docs/StringDesign.rst#strings-are-encoded-as-utf-8 <https://github.com/apple/swift/blob/master/docs/StringDesign.rst#strings-are-encoded-as-utf-8>
https://github.com/apple/swift/blob/master/docs/StringDesign.rst#how-would-you-design-it <https://github.com/apple/swift/blob/master/docs/StringDesign.rst#how-would-you-design-it>

Given the warning at the top about this having been a planning document, I see that this may no longer be true. But at least the original design rationale strongly suggests that String’s failable initializers should fail when given invalid Unicode.

> But that's ok, because unless you call `count` on it, you're typically doing an O(N) operation _anyway_. But there's plenty of things you can do with strings that don't require iterating over the CharacterView.

Indeed, but per my earlier message, those things could all still be O(1) except in the case when you’re transcoding a string from something other than ASCII or UTF-8 — and those transcoding cases are O(n) already. That certainly seems like a better design for the core lib.

Really hoping a core team member can weigh in on this….

Cheers,

Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160105/94bb520b/attachment.html>