[swift-evolution] Strings in Swift 4

Tue Jan 31 13:23:49 CST 2017

on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:

> Thanks for the clarifications.
> More comments below.
>
> dabrahams at apple.com wrote on 01/24/2017 05:50:59 PM:
>
>> Maybe it wasn't clear from the document, but the intention is that
>> String would be able to use any model of Unicode as a backing store, and
>> that you could easily build unsafe models of Unicode... but also that
>> you could use your unsafe model of Unicode directly, in string-ish ways.
>
> I see. If I understand correctly, it will be possible for instance to 
> implement an unsafe model of Unicode with a UInt8 code unit and a 
> maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits of 
> Unicode scalars.

Eh... I think you'd just use an unsafe Latin-1 for that; why waste a
bit?

Here's an example (work very much in-progress):
https://github.com/apple/swift/blob/9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877

>> > A lot of machine processing of strings continues to deal with 8-bit
>> > quantities (even 7-bit quantities, not UTF-8).  Swift strings are
>> > not very good at that. I see progress in the manifesto but nothing
>> > to really close the performance gap with C.  That's where "unsafe"
>> > mechanisms could come into play.
>> 
>> extendedASCII is supposed to address that.  Given a smart enough
>> optimizer, it should be possible to become competitive with C even
>> without using unsafe constructs.  However, we recognize the importance
>> of being able to squeeze out that last bit of performance by dropping
>> down to unsafe storage.
>
> I doubt a 32-bit encoding can bridge the performance gap with C in
> particular because wire protocols will continue to favor compact
> encodings.  Incoming strings will have to be expanded to the
> extendedASCII representation before processing and probably compacted
> afterwards. So while this may address the needs of computationally
> intensive string processing tasks, this does not help simple parsing
> tasks on simple strings.

I'm pretty sure it does; we're not going to change representations

extendedASCII doesn't require anything to actually be expanded to
32-bits per code unit, except *maybe* in a register, and then only if
the optimizer isn't smart enough to eliminate zero-extension followed by
comparison with a known narrow value.  You can always

  latin1.lazy.map { UInt32($0) }

to produce 32-bit code units.  All the common encodings are ASCII
supersets, so this will “just work” for those.  The only places where it
becomes more complicated is in encodings like Shift-JIS (which might not
even be important enough to support as a String backing-storage format).

>
>> > To guarantee Unicode correctness, a C string must be validated or 
>> > transformed to be considered a Swift string.
>> 
>> Not really.  You can do error-correction on the fly.  However, I think
>> pre-validation is often worthwhile because once you know something is
>> valid it's much cheaper to decode correctly (especially for UTF-8).
>
> Sure. Eager vs. lazy validation is a valuable distinction, but what I am 
> after here is side-stepping validation altogether. I understand now that 
> user-defined encodings will make side-stepping validation possible.

Right.

>
>> > If I understand the C String interop section correctly, in Swift 4,
>> > this should not force a copy, but traversing the string is still
>> > required. 
>> 
>> *What* should not force a copy?
>
> I would like to have a constructor that takes a pointer to a 
> null-terminated sequence of bytes (or a sequence of bytes and a length) 
> and turns it into a Swift string without allocation of a new backing store 
> for the string and without copying the bytes in the sequence from one 
> place in memory to another. 

We probably won't expose this at the top level of String, but you should
be able to construct an UnsafeCString (which is-a Unicode) and then, if
you really need the String type, construct a String from that:

   String(UnsafeCString(ntbs))

That would not do any copying.

> I understand this may require the programmer to handle memory
> management for the backing store.
>
>> > I hope I am correct about the no-copy thing, and I would also like to
>> > permit promoting C strings to Swift strings without validation.  This
>> > is obviously unsafe in general, but I know my strings... and I care
>> > about performance. ;)
>> 
>> We intend to support that use-case.  That's part of the reason for the
>> ValidUTF8 and ValidUTF16 encodings you see here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L598
>> and here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L862
>
> OK
>
>> > More importantly, it is not possible to mutate bytes in a Swift string
>> > at will.  Again it makes sense from the point of view of always
>> > correct Unicode sequences.  But it does not for machine processing of
>> > C strings with C-like performance.  Today, I can cheat using a
>> > "_public" API for this, i.e., myString._core.  _baseAddress!.  This
>> > should be doable from an official "unsafe" API.
>> 
>> We intend to support that use-case.
>> 
>> > Memory safety is also at play here, as well as ownership.  A proper
>> > API could guarantee the backing store is writable for instance, that
>> > it is not shared.  A memory-safe but not unicode-safe API could do
>> > bounds checks.
>> >
>> > While low-level C string processing can be done using unsafe memory
>> > buffers with performance, the lack of bridging with "real" Swift
>> > strings kills the deal.  No literals syntax (or costly coercions),
>> > none of the many useful string APIs.
>> >
>> > To illustrate these points here is a simple experiment: code written
>> > to synthesize an http date string from a bunch of integers.  There are
>> > four versions of the code going from nice high-level Swift code to
>> > low-level C-like code.  (Some of this code is also about avoiding ARC
>> > overheads, and string interpolation overheads, hence the four
>> > versions.)
>> >
>> > On my macbook pro (swiftc -O), the performance is as follows:
>> >
>> > interpolation + func:  2.303032365s
>> > interpolation + array: 1.224858418s
>> > append:                0.918512377s
>> > memcpy:                0.182104674s
>> >
>> > While the benchmarking could be done more carefully, I think the main
>> > observation is valid.  The nice code is more than 10x slower than the
>> > C-like code.  Moreover, the ugly-but-still-valid-Swift code is still
>> > about 5x slower than the C like code.  For some applications, e.g. web
>> > servers, this kind of numbers matter...
>> >
>> > Some of the proposed improvements would help with this, e.g., small
>> > strings optimization, and maybe changes to the concatenation
>> > semantics.  But it seems to me that a big performance gap will remain.
>> > (Concatenation even with strncat is significantly slower than memcpy
>> > for fixed-size strings.)
>> >
>> > I believe there is a need and an opportunity for a fast "less safe"
>> > String API.  I hope it will be on the roadmap soon.
>> 
>> I think it's already in the roadmap...the one that's in my head.  If you
>> want to submit a PR with amendments to the manifesto, that'd be great.
>> Also thanks very much for the example below; we'll definitely
>> be referring to it as we proceed forward.
>
> Here is a gist for the example code:
> https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5
>
> I can sketch key elements of an unsafe String API and some motivating 
> arguments in a pull request. Is this what you are asking for?

That would be awesome, thanks!

-- 
-Dave