[swift-evolution] Strings in Swift 4

Russ Bishop xenadu at gmail.com
Tue Jan 24 02:35:33 CST 2017


> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution <swift-evolution at swift.org> wrote:
> 
> ### Formatting
> 
> A full treatment of formatting is out of scope of this proposal, but
> we believe it's crucial for completing the text processing picture.  This
> section details some of the existing issues and thinking that may guide future
> development.
> 

Filesystem paths are Strings on Apple platforms but not on Linux. How are we going to square that circle? What about Swift on the server, where distinguishing HTML and JavaScript is security-critical? There are huge security implications to string processing, often around platforms making it easy to do the wrong thing in a careless way and promoting ad-hoc formatting, serialization and parsing. That’s a huge area to consider of course but it might be worth thinking about how a ergonomic API for a few example cases would work. 

I guess my point is that formatting and interpolation is far more than “just formatting”; making the right thing difficult will directly lead to exploitable security vulnerabilities or not as the case may be. (To be clear I’m not saying the follow-on proposals from this need to solve those problems, maybe just give them some consideration).



> ## Open Questions
> 
> ### Must `String` be limited to storing UTF-16 subset encodings?
> 
> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
>  question here; this is about what encodings must be storable, without
>  transcoding, in the common currency type called “`String`”.
> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets.  UTF-8 is not.

Depending on who you believe UTF-8 is the encoding of ~65-88% of all text content transmitted over the web. JSON and XML represent the lion’s share of REST and non-REST APIs in use and both are almost exclusively transmitted as UTF-8. As you point out with extendedASCII, a lot of markup and structure is ASCII even if the content is not so UTF-8 represents a significant size savings even on Chinese/Japanese web pages that require 3 bytes to represent many characters (the savings on markup overwhelming the loss on textual content).

Any model that makes using UTF-8 backed Strings difficult or cumbersome to use can have a negative performance and memory impact. I don’t have a good idea of the actual cost but it might be worth doing some test to determine that.

Is NSString interop the only reason to not just use UTF-8 as the default storage? If so, is that a solvable problem? Could one choose by typealias or a compiler flag which default storage they wanted?


> - If we have a way to get at a `String`'s code units, we need a concrete type in
>  which to express them in the API of `String`, which is a concrete type
> - If String needs to be able to represent UTF-32, presumably the code units need
>  to be `UInt32`.
> - Not supporting UTF-32-encoded text seems like one reasonable design choice.
> - Maybe we can allow UTF-8 storage in `String` and expose its code units as
>  `UInt16`, just as we would for Latin-1.
> - Supporting only UTF-16-subset encodings would imply that `String` indices can
>  be serialized without recording the `String`'s underlying encoding.

I suppose you could be clever on 64-bit platforms by stealing some bits to indicate the encoding… not that I recommend that :D

> 
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
> 
> UnicodeEncoding has an associated type, but it may be important to be able to
> traffic in completely dynamic encoding values, e.g. for “tell me the most
> efficient encoding for this string.”

Generalized Existentials 
tis but happiness by another name
For we who live 
in The Land of Protocols and Faeries

> 
> ### Should there be a string “facade?”
> 
> One possible design alternative makes `Unicode` a vehicle for expressing
> the storage and encoding of code units, but does not attempt to give it an API
> appropriate for `String`.  Instead, string APIs would be provided by a generic
> wrapper around an instance of `Unicode`:
> 
> ```swift
> struct StringFacade<U: Unicode> : BidirectionalCollection {
> 
>  // ...APIs for high-level string processing here...
> 
>  var unicode: U // access to lower-level unicode details
> }
> 
> typealias String = StringFacade<StringStorage>
> typealias Substring = StringFacade<StringStorage.SubSequence>
> ```
> 
> This design would allow us to de-emphasize lower-level `String` APIs such as
> access to the specific encoding, by putting them behind a `.unicode` property.
> A similar effect in a facade-less design would require a new top-level
> `StringProtocol` playing the role of the facade with an an `associatedtype
> Storage : Unicode`.
> 
> An interesting variation on this design is possible if defaulted generic
> parameters are introduced to the language:
> 
> ```swift
> struct String<U: Unicode = StringStorage> 
>  : BidirectionalCollection {
> 
>  // ...APIs for high-level string processing here...
> 
>  var unicode: U // access to lower-level unicode details
> }
> 
> typealias Substring = String<StringStorage.SubSequence>
> ```
> 
> One advantage of such a design is that naïve users will always extend “the right
> type” (`String`) without thinking, and the new APIs will show up on `Substring`,
> `MyUTF8String`, etc.  That said, it also has downsides that should not be
> overlooked, not least of which is the confusability of the meaning of the word
> “string.”  Is it referring to the generic or the concrete type?

Fair point, but I do like the idea of separating the two and encouraging people to extend String while automatically extending all the String-ish types. This would compose well with a hypothetical HTMLString, JavaScriptString, etc (assuming one could design a model where those things compose well, e.g. appending MyUTF8String to HTMLString performs automatic HTML-escaping whereas appending HTMLString to HTMLString does not). 

Anything that avoids forcing the average app or library author to stop and think about which String type to use is probably a net win if the performance isn’t horrible; someone writing a web server pipeline will need to write their own String-ish type for performance reasons anyway so a slight perf hit may be no great loss.


Thanks to you and Ben for the hard work so far; I can’t even imagine taking on such a task!

Russ



More information about the swift-evolution mailing list