[swift-evolution] Strings in Swift 4

Tue Jan 24 11:54:05 CST 2017

> On Jan 24, 2017, at 12:35 AM, Russ Bishop <xenadu at gmail.com> wrote:
> 
>> ## Open Questions
>> 
>> ### Must `String` be limited to storing UTF-16 subset encodings?
>> 
>> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
>> question here; this is about what encodings must be storable, without
>> transcoding, in the common currency type called “`String`”.
>> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets.  UTF-8 is not.
> 
> Depending on who you believe UTF-8 is the encoding of ~65-88% of all text content transmitted over the web. JSON and XML represent the lion’s share of REST and non-REST APIs in use and both are almost exclusively transmitted as UTF-8. As you point out with extendedASCII, a lot of markup and structure is ASCII even if the content is not so UTF-8 represents a significant size savings even on Chinese/Japanese web pages that require 3 bytes to represent many characters (the savings on markup overwhelming the loss on textual content).

Right – unfortunately these conversations when we’ve had them start with solid data (like https://w3techs.com/technologies/history_overview/character_encoding/ms/y <https://w3techs.com/technologies/history_overview/character_encoding/ms/y> – though note this data tells you nothing about what proportion of UTF8 would be better held in memory as Latin1 or UTF16), but then descends into hand-wavy “utf8 is faster because tags” :) I’d love to see some empirical data on this, there must be some out there. It’s no small undertaking to produce it though.

The Java folks did some research about the speedup gains from compacting down to Latin1 where possible for their implementation (http://openjdk.java.net/jeps/254). But their hands were presumably partly tied by needing to preserve random access into UTF16 so I don’t know if they really considered/benched UTF8, and the detail about the exact nature of the corpus of data they tested on appears scant ("a collection of over 950 heap dumps from a variety of different Oracle software applications using Java”) unless there are some more in-depth papers available I haven’t found.

> Any model that makes using UTF-8 backed Strings difficult or cumbersome to use can have a negative performance and memory impact. I don’t have a good idea of the actual cost but it might be worth doing some test to determine that.
> 

A UTF8String type that a developer chooses to use explicitly for performance reasons oughtn’t to be cumbersome from the perspective of using much of the standard library, which will be implemented mostly generically on Unicode or Collection. The question is, will it be a problem that creators of UTF8String will find themselves needing to convert to regular String often when passing values into an API implemented in terms of String. This isn’t really something we can test easily – it can only be determined in the field, based on what APIs those users find themselves using. My instinct is that it won’t be a big problem – that the set of people needing to hold UTF8 for performance doesn’t overlap much with the set of people needing to pass those strings into higher-level libraries (at least on the performance-critical path of their code), as long as we provide enough batteries-included features on String. But that’s not an evidence-based notion, just a hope.

> Is NSString interop the only reason to not just use UTF-8 as the default storage? If so, is that a solvable problem? Could one choose by typealias or a compiler flag which default storage they wanted?
> 

Not just NSString interop – ICU interop as well, which is very important given we rely on ICU to implement much of our Unicode functionality.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170124/de021d1b/attachment.html>