[swift-evolution] Strings in Swift 4
Dave Abrahams
dabrahams at apple.com
Tue Jan 31 13:23:49 CST 2017
on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:
> Thanks for the clarifications.
> More comments below.
>
> dabrahams at apple.com wrote on 01/24/2017 05:50:59 PM:
>
>> Maybe it wasn't clear from the document, but the intention is that
>> String would be able to use any model of Unicode as a backing store, and
>> that you could easily build unsafe models of Unicode... but also that
>> you could use your unsafe model of Unicode directly, in string-ish ways.
>
> I see. If I understand correctly, it will be possible for instance to
> implement an unsafe model of Unicode with a UInt8 code unit and a
> maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits of
> Unicode scalars.
Eh... I think you'd just use an unsafe Latin-1 for that; why waste a
bit?
Here's an example (work very much in-progress):
https://github.com/apple/swift/blob/9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877
>> > A lot of machine processing of strings continues to deal with 8-bit
>> > quantities (even 7-bit quantities, not UTF-8). Swift strings are
>> > not very good at that. I see progress in the manifesto but nothing
>> > to really close the performance gap with C. That's where "unsafe"
>> > mechanisms could come into play.
>>
>> extendedASCII is supposed to address that. Given a smart enough
>> optimizer, it should be possible to become competitive with C even
>> without using unsafe constructs. However, we recognize the importance
>> of being able to squeeze out that last bit of performance by dropping
>> down to unsafe storage.
>
> I doubt a 32-bit encoding can bridge the performance gap with C in
> particular because wire protocols will continue to favor compact
> encodings. Incoming strings will have to be expanded to the
> extendedASCII representation before processing and probably compacted
> afterwards. So while this may address the needs of computationally
> intensive string processing tasks, this does not help simple parsing
> tasks on simple strings.
I'm pretty sure it does; we're not going to change representations
extendedASCII doesn't require anything to actually be expanded to
32-bits per code unit, except *maybe* in a register, and then only if
the optimizer isn't smart enough to eliminate zero-extension followed by
comparison with a known narrow value. You can always
latin1.lazy.map { UInt32($0) }
to produce 32-bit code units. All the common encodings are ASCII
supersets, so this will “just work” for those. The only places where it
becomes more complicated is in encodings like Shift-JIS (which might not
even be important enough to support as a String backing-storage format).
>
>> > To guarantee Unicode correctness, a C string must be validated or
>> > transformed to be considered a Swift string.
>>
>> Not really. You can do error-correction on the fly. However, I think
>> pre-validation is often worthwhile because once you know something is
>> valid it's much cheaper to decode correctly (especially for UTF-8).
>
> Sure. Eager vs. lazy validation is a valuable distinction, but what I am
> after here is side-stepping validation altogether. I understand now that
> user-defined encodings will make side-stepping validation possible.
Right.
>
>> > If I understand the C String interop section correctly, in Swift 4,
>> > this should not force a copy, but traversing the string is still
>> > required.
>>
>> *What* should not force a copy?
>
> I would like to have a constructor that takes a pointer to a
> null-terminated sequence of bytes (or a sequence of bytes and a length)
> and turns it into a Swift string without allocation of a new backing store
> for the string and without copying the bytes in the sequence from one
> place in memory to another.
We probably won't expose this at the top level of String, but you should
be able to construct an UnsafeCString (which is-a Unicode) and then, if
you really need the String type, construct a String from that:
String(UnsafeCString(ntbs))
That would not do any copying.
> I understand this may require the programmer to handle memory
> management for the backing store.
>
>> > I hope I am correct about the no-copy thing, and I would also like to
>> > permit promoting C strings to Swift strings without validation. This
>> > is obviously unsafe in general, but I know my strings... and I care
>> > about performance. ;)
>>
>> We intend to support that use-case. That's part of the reason for the
>> ValidUTF8 and ValidUTF16 encodings you see here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L598
>> and here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L862
>
> OK
>
>> > More importantly, it is not possible to mutate bytes in a Swift string
>> > at will. Again it makes sense from the point of view of always
>> > correct Unicode sequences. But it does not for machine processing of
>> > C strings with C-like performance. Today, I can cheat using a
>> > "_public" API for this, i.e., myString._core. _baseAddress!. This
>> > should be doable from an official "unsafe" API.
>>
>> We intend to support that use-case.
>>
>> > Memory safety is also at play here, as well as ownership. A proper
>> > API could guarantee the backing store is writable for instance, that
>> > it is not shared. A memory-safe but not unicode-safe API could do
>> > bounds checks.
>> >
>> > While low-level C string processing can be done using unsafe memory
>> > buffers with performance, the lack of bridging with "real" Swift
>> > strings kills the deal. No literals syntax (or costly coercions),
>> > none of the many useful string APIs.
>> >
>> > To illustrate these points here is a simple experiment: code written
>> > to synthesize an http date string from a bunch of integers. There are
>> > four versions of the code going from nice high-level Swift code to
>> > low-level C-like code. (Some of this code is also about avoiding ARC
>> > overheads, and string interpolation overheads, hence the four
>> > versions.)
>> >
>> > On my macbook pro (swiftc -O), the performance is as follows:
>> >
>> > interpolation + func: 2.303032365s
>> > interpolation + array: 1.224858418s
>> > append: 0.918512377s
>> > memcpy: 0.182104674s
>> >
>> > While the benchmarking could be done more carefully, I think the main
>> > observation is valid. The nice code is more than 10x slower than the
>> > C-like code. Moreover, the ugly-but-still-valid-Swift code is still
>> > about 5x slower than the C like code. For some applications, e.g. web
>> > servers, this kind of numbers matter...
>> >
>> > Some of the proposed improvements would help with this, e.g., small
>> > strings optimization, and maybe changes to the concatenation
>> > semantics. But it seems to me that a big performance gap will remain.
>> > (Concatenation even with strncat is significantly slower than memcpy
>> > for fixed-size strings.)
>> >
>> > I believe there is a need and an opportunity for a fast "less safe"
>> > String API. I hope it will be on the roadmap soon.
>>
>> I think it's already in the roadmap...the one that's in my head. If you
>> want to submit a PR with amendments to the manifesto, that'd be great.
>> Also thanks very much for the example below; we'll definitely
>> be referring to it as we proceed forward.
>
> Here is a gist for the example code:
> https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5
>
> I can sketch key elements of an unsafe String API and some motivating
> arguments in a pull request. Is this what you are asking for?
That would be awesome, thanks!
--
-Dave
More information about the swift-evolution
mailing list