[swift-evolution] Trial balloon: Ensure that String always contains valid Unicode

Mon Jan 4 17:39:20 CST 2016

On Mon, Jan 4, 2016, at 03:22 PM, Paul Cantrell wrote:
>
>> On Jan 4, 2016, at 5:11 PM, Kevin Ballard <kevin at sb.org> wrote:
>>
>> On Mon, Jan 4, 2016, at 03:08 PM, Paul Cantrell wrote:
>>>>> But doing lazy checking of strings would end up having to check
>>>>> *every* string that comes from ObjC
>>>
>>> I don’t think that’s necessarily true. There’s a limited set of
>>> places where invalid Unicode can creep into an NSString, and so the
>>> lazy check could probably bypass quite a few common cases — an ASCII
>>> string for example. Without digging into it, I suspect any NSString
>>> created from UTF-8 data can be safely bridged, since unpaired
>>> surrogate chars can’t make it through UTF-8.
>>
>> Every single method you implement that takes a `String` property and
>> is either exposed to Obj-C or is overriding an Obj-C declaration will
>> have to check the String parameter every single time the function is
>> called.
>>
>> Every time you call an Obj-C method that returns a String, you'll
>> have to check that String result.
>
> Not necessarily. While it’s true that an NSString is represented as
> UTF-16 internally (right?), there’s a limited set of operations that
> can introduce invalid Unicode. In theory, at least, an NSString could
> keep a flag that tracks whether it could potentially contain be
> invalid.
>
> This is much better than the doomsday scenario you lay out in two
> respects:
>
> (1) That flag would start out false in many common situations
>     (including NSStrings decoded from UTF-8, Latin-1, and ASCII), and
>     could stay false with O(1) effort for substring operations. My
>     guess is that this covers the vast majority of strings floating
>     around in a typical app.
>
> (2) Once a string is verified, the flag can be flipped true. No need
>     to keep revalidating. Yes, there are threading concerns with that,
>     but I trust the team that made the dark magic of Swift’s weak work
>     may have some bright ideas on this.
>
> The bottom line is that not every NSString → String bridge need to be
> O(n). At least in theory. Someone with more intimate knowledge of
> NSString can correct me if I’m wrong.

I thought it was a given that we can't modify NSString. If we can modify
it, all bets are off; heck, if we can modify it, why not just make
NSString reject invalid sequences to begin with?

Besides the fact that NSString is provided by the OS instead of the
Swift stdlib, relying on a modification to NSString also means that the
logic will only work on a new version of the OS that contains the
modified NSString.

>> Basically, any time a String object is backed by an NSString, which
>> is going to be very common in most apps, that backing NSString will
>> have to be checked.
> Keep in mind that we’re *already* incurring that O(n) expense right
> now for every Swift operation that turns an NSString-backed string
> into characters — that plus the API burden of having that check
> deferred, which is what originally motivated this thread.

That's true for native Strings as well. The native String storage is
actually a sequence of UTF-16 code units, it's not a sequence of
characters. Any time you iterate over the CharacterView, it has to
calculate the grapheme cluster boundaries. But that's ok, because unless
you call `count` on it, you're typically doing an O(N) operation
_anyway_. But there's plenty of things you can do with strings that
don't require iterating over the CharacterView.

-Kevin Ballard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160104/b01af69f/attachment.html>