[swift-evolution] Strings in Swift 4

Dave Abrahams dabrahams at apple.com
Tue Jan 24 12:10:11 CST 2017


on Tue Jan 24 2017, Russ Bishop <xenadu-AT-gmail.com> wrote:

>> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>> ### Formatting
>> 
>> A full treatment of formatting is out of scope of this proposal, but
>> we believe it's crucial for completing the text processing picture.  This
>> section details some of the existing issues and thinking that may guide future
>> development.
>> 
>
> Filesystem paths are Strings on Apple platforms but not on Linux. 

What, exactly, do you mean by that?

> How are we going to square that circle? What about Swift on the
> server, where distinguishing HTML and JavaScript is security-critical?
> There are huge security implications to string processing, often
> around platforms making it easy to do the wrong thing in a careless
> way and promoting ad-hoc formatting, serialization and parsing. That’s
> a huge area to consider of course but it might be worth thinking about
> how a ergonomic API for a few example cases would work.
>
> I guess my point is that formatting and interpolation is far more than
> “just formatting”; making the right thing difficult will directly lead
> to exploitable security vulnerabilities or not as the case may be. (To
> be clear I’m not saying the follow-on proposals from this need to
> solve those problems, maybe just give them some consideration).
>
>> ## Open Questions
>> 
>> ### Must `String` be limited to storing UTF-16 subset encodings?
>> 
>> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
>>  question here; this is about what encodings must be storable, without
>>  transcoding, in the common currency type called “`String`”.
>> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets.  UTF-8 is not.
>
> Depending on who you believe UTF-8 is the encoding of ~65-88% of all
> text content transmitted over the web. JSON and XML represent the
> lion’s share of REST and non-REST APIs in use and both are almost
> exclusively transmitted as UTF-8. As you point out with extendedASCII,
> a lot of markup and structure is ASCII even if the content is not so
> UTF-8 represents a significant size savings even on Chinese/Japanese
> web pages that require 3 bytes to represent many characters (the
> savings on markup overwhelming the loss on textual content).
>
> Any model that makes using UTF-8 backed Strings difficult or
> cumbersome to use can have a negative performance and memory impact. I
> don’t have a good idea of the actual cost but it might be worth doing
> some test to determine that.
>
> Is NSString interop the only reason to not just use UTF-8 as the
> default storage? 

Perhaps more universally important is interop with ICU, which also
requires UTF-16 in most places.

None of this necessarily means String won't be able to store UTF-8, but
it does mean you'd pay a conversion cost when a UTF-8 string crosses the
boundary into Foundation or ICU.

> If so, is that a solvable problem? Could one choose by typealias or a
> compiler flag which default storage they wanted?

The latter *might* be possible, but I'd very much rather not.

>> - If we have a way to get at a `String`'s code units, we need a concrete type in
>>  which to express them in the API of `String`, which is a concrete type
>> - If String needs to be able to represent UTF-32, presumably the code units need
>>  to be `UInt32`.
>> - Not supporting UTF-32-encoded text seems like one reasonable design choice.
>> - Maybe we can allow UTF-8 storage in `String` and expose its code units as
>>  `UInt16`, just as we would for Latin-1.
>> - Supporting only UTF-16-subset encodings would imply that `String` indices can
>>  be serialized without recording the `String`'s underlying encoding.
>
> I suppose you could be clever on 64-bit platforms by stealing some
> bits to indicate the encoding…  not that I recommend that :D

There's nothing particularly tricky about that; it's absolutely
possible and the sort of thing we'd consider.

>> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>> 
>> UnicodeEncoding has an associated type, but it may be important to be able to
>> traffic in completely dynamic encoding values, e.g. for “tell me the most
>> efficient encoding for this string.”
>
> Generalized Existentials 
> tis but happiness by another name
> For we who live 
> in The Land of Protocols and Faeries
>
>> 
>> ### Should there be a string “facade?”
>> 
>> One possible design alternative makes `Unicode` a vehicle for expressing
>> the storage and encoding of code units, but does not attempt to give it an API
>> appropriate for `String`.  Instead, string APIs would be provided by a generic
>> wrapper around an instance of `Unicode`:
>> 
>> ```swift
>> struct StringFacade<U: Unicode> : BidirectionalCollection {
>> 
>>  // ...APIs for high-level string processing here...
>> 
>>  var unicode: U // access to lower-level unicode details
>> }
>> 
>> typealias String = StringFacade<StringStorage>
>> typealias Substring = StringFacade<StringStorage.SubSequence>
>> ```
>> 
>> This design would allow us to de-emphasize lower-level `String` APIs such as
>> access to the specific encoding, by putting them behind a `.unicode` property.
>> A similar effect in a facade-less design would require a new top-level
>> `StringProtocol` playing the role of the facade with an an `associatedtype
>> Storage : Unicode`.
>> 
>> An interesting variation on this design is possible if defaulted generic
>> parameters are introduced to the language:
>> 
>> ```swift
>> struct String<U: Unicode = StringStorage> 
>>  : BidirectionalCollection {
>> 
>>  // ...APIs for high-level string processing here...
>> 
>>  var unicode: U // access to lower-level unicode details
>> }
>> 
>> typealias Substring = String<StringStorage.SubSequence>
>> ```
>> 
>> One advantage of such a design is that naïve users will always extend “the right
>> type” (`String`) without thinking, and the new APIs will show up on `Substring`,
>> `MyUTF8String`, etc.  That said, it also has downsides that should not be
>> overlooked, not least of which is the confusability of the meaning of the word
>> “string.”  Is it referring to the generic or the concrete type?
>
> Fair point, but I do like the idea of separating the two and
> encouraging people to extend String while automatically extending all
> the String-ish types. This would compose well with a hypothetical
> HTMLString, JavaScriptString, etc (assuming one could design a model
> where those things compose well, e.g. appending MyUTF8String to
> HTMLString performs automatic HTML-escaping whereas appending
> HTMLString to HTMLString does not).

Another thing that has been pointed out is that we'll be able to make
String RangeReplaceable when its storage is, but we can't do the same
thing for a protocol.

> Anything that avoids forcing the average app or library author to stop
> and think about which String type to use is probably a net win if the
> performance isn’t horrible; someone writing a web server pipeline will
> need to write their own String-ish type for performance reasons anyway
> so a slight perf hit may be no great loss.

What are you imagining would cause a “perf hit?”

> Thanks to you and Ben for the hard work so far; I can’t even imagine
> taking on such a task!

Best job in the world.  Bar none.

-- 
-Dave


More information about the swift-evolution mailing list