[swift-evolution] Strings in Swift 4

Wed Jan 25 15:16:47 CST 2017

on Tue Jan 24 2017, Karl Wagner <swift-evolution at swift.org> wrote:

>> 
>>> I hope I am correct about the no-copy thing, and I would also like to
>>> permit promoting C strings to Swift strings without validation.  This
>>> is obviously unsafe in general, but I know my strings... and I care
>>> about performance. ;)
>> 
>> We intend to support that use-case.  That's part of the reason for the
>> ValidUTF8 and ValidUTF16 encodings you see here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L598
>> <https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L598>
>> and here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L862
>> <https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L862>
>
> It seems a little strange to me that a pre-validated UTF8 string from C would have different types
> to a UTF8String (i.e. using ValidUTF8 vs UTF8). It defeats the point of having the encoding
> represented in the type-system.

Why do you say that?  

The main point is to allow the compiler to make static choices about how
to do decoding efficiently.

> For example, if I write a generic function:
>
> func sendMessage<Source: Unicode where Source.Encoding == UTF8>(from: Source)
>
> I would only be able to accept UTF-8 text which hasn’t already been
> validated. 

protocol UTF8Encoding : UnicodeEncoding where CodeUnit == UInt8 {}
extension UTF8 : UTF8Encoding {}
extension ValidUTF8 : UTF8Encoding {}

func sendMessage<Source: Unicode where Source.Encoding : UTF8Encoding>(from: Source)

> What about if we allowed each encoding to provide multiple kinds of decoder? That would also allow
> us to substitute our own decoders in, if there are application-specific shortcuts we can take.
>
> protocol UnicodeEncoding {
>   associatedtype CodeUnit
>
>   associatedtype ValidatingDecoder: UnicodeDecoder
>   associatedtype NonValidatingDecoder: UnicodeDecoder
> }
>
> protocol UnicodeDecoder {
>     associatedtype Encoding: UnicodeEncoding
>     associatedtype DecodedScalar: RandomAccessCollection where Iterator.Element == Encoding.CodeUnit
>
>     static func parse1Forward<C>(…) -> ParseResult<DecodedScalar, C.Index>
>     static func parse1Backward<C>(…) -> ParseResult<DecodedScalar, C.Index>
> }
> // Not shown: UnicodeEncoder protocol, with transcodeScalar<T> function.
>
> struct UTF8: UnicodeEncoding  { 
>     typealias CodeUnit             = UInt8  
>     typealias ValidatingDecoder    = ValidatingUTF8Decoder
>     typealias NonValidatingDecoder = NonValidatingUTF8Decoder
> }
>
> struct NonValidatingUTF8Decoder: UnicodeDecoder {
>     typealias Encoding = UTF8
>     struct DecodedScalar: RandomAccessCollection { … }
>     // Parsing functions
> }
>
> struct ValidatingUTF8Decoder: UnicodeDecoder {
>     typealias Encoding = UTF8
>     typealias DecodedScalar = NonValidatingUTF8Decoder.DecodedScalar // newtype would be cool here
>     // Parsing functions
> }
>
> struct String {
>     init<C, Encoding, Decoder>(from: C, encodedAs: Encoding, using: Decoder =
> Encoding.ValidatingDecoder)
>         where C: Collection, C.Iterator.Element == Encoding.CodeUnit, Decoder.Encoding == Encoding {
>
>          // transcode to native String encoding using ‘Decoder’ we were given
>     }
> }

That's another way to slice the same pie.  I'll think about this, thanks.

Note: part of the thinking had been that we might want to represent other
information, like "it's NFC normalized" in the encoding type.  At that
point, I think a design like your suggestion above may start to get messy.

-- 
-Dave