[swift-evolution] Strings in Swift 4

Mon Jan 30 09:08:58 CST 2017

Thanks for the clarifications.
More comments below.

dabrahams at apple.com wrote on 01/24/2017 05:50:59 PM:

> Maybe it wasn't clear from the document, but the intention is that
> String would be able to use any model of Unicode as a backing store, and
> that you could easily build unsafe models of Unicode... but also that
> you could use your unsafe model of Unicode directly, in string-ish ways.

I see. If I understand correctly, it will be possible for instance to 
implement an unsafe model of Unicode with a UInt8 code unit and a 
maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits of 
Unicode scalars.

> > A lot of machine processing of strings continues to deal with 8-bit 
> > quantities (even 7-bit quantities, not UTF-8).
> > Swift strings are not very good at that. I see progress in the 
manifesto 
> > but nothing to really close the performance gap with C.
> > That's where "unsafe" mechanisms could come into play.
> 
> extendedASCII is supposed to address that.  Given a smart enough
> optimizer, it should be possible to become competitive with C even
> without using unsafe constructs.  However, we recognize the importance
> of being able to squeeze out that last bit of performance by dropping
> down to unsafe storage.

I doubt a 32-bit encoding can bridge the performance gap with C in 
particular because wire protocols will continue to favor compact 
encodings. Incoming strings will have to be expanded to the extendedASCII 
representation before processing and probably compacted afterwards. So 
while this may address the needs of computationally intensive string 
processing tasks, this does not help simple parsing tasks on simple 
strings.

> > To guarantee Unicode correctness, a C string must be validated or 
> > transformed to be considered a Swift string.
> 
> Not really.  You can do error-correction on the fly.  However, I think
> pre-validation is often worthwhile because once you know something is
> valid it's much cheaper to decode correctly (especially for UTF-8).

Sure. Eager vs. lazy validation is a valuable distinction, but what I am 
after here is side-stepping validation altogether. I understand now that 
user-defined encodings will make side-stepping validation possible.

> > If I understand the C String interop section correctly, in Swift 4,
> > this should not force a copy, but traversing the string is still
> > required. 
> 
> *What* should not force a copy?

I would like to have a constructor that takes a pointer to a 
null-terminated sequence of bytes (or a sequence of bytes and a length) 
and turns it into a Swift string without allocation of a new backing store 
for the string and without copying the bytes in the sequence from one 
place in memory to another. I understand this may require the programmer 
to handle memory management for the backing store.

> > I hope I am correct about the no-copy thing, and I would also like to
> > permit promoting C strings to Swift strings without validation.  This
> > is obviously unsafe in general, but I know my strings... and I care
> > about performance. ;)
> 
> We intend to support that use-case.  That's part of the reason for the
> ValidUTF8 and ValidUTF16 encodings you see here:
> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> core/Unicode2.swift#L598
> and here:
> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> core/Unicode2.swift#L862

OK

> > More importantly, it is not possible to mutate bytes in a Swift string
> > at will.  Again it makes sense from the point of view of always
> > correct Unicode sequences.  But it does not for machine processing of
> > C strings with C-like performance.  Today, I can cheat using a
> > "_public" API for this, i.e., myString._core.  _baseAddress!.  This
> > should be doable from an official "unsafe" API.
> 
> We intend to support that use-case.
> 
> > Memory safety is also at play here, as well as ownership.  A proper
> > API could guarantee the backing store is writable for instance, that
> > it is not shared.  A memory-safe but not unicode-safe API could do
> > bounds checks.
> >
> > While low-level C string processing can be done using unsafe memory
> > buffers with performance, the lack of bridging with "real" Swift
> > strings kills the deal.  No literals syntax (or costly coercions),
> > none of the many useful string APIs.
> >
> > To illustrate these points here is a simple experiment: code written
> > to synthesize an http date string from a bunch of integers.  There are
> > four versions of the code going from nice high-level Swift code to
> > low-level C-like code.  (Some of this code is also about avoiding ARC
> > overheads, and string interpolation overheads, hence the four
> > versions.)
> >
> > On my macbook pro (swiftc -O), the performance is as follows:
> >
> > interpolation + func:  2.303032365s
> > interpolation + array: 1.224858418s
> > append:                0.918512377s
> > memcpy:                0.182104674s
> >
> > While the benchmarking could be done more carefully, I think the main
> > observation is valid.  The nice code is more than 10x slower than the
> > C-like code.  Moreover, the ugly-but-still-valid-Swift code is still
> > about 5x slower than the C like code.  For some applications, e.g. web
> > servers, this kind of numbers matter...
> >
> > Some of the proposed improvements would help with this, e.g., small
> > strings optimization, and maybe changes to the concatenation
> > semantics.  But it seems to me that a big performance gap will remain.
> > (Concatenation even with strncat is significantly slower than memcpy
> > for fixed-size strings.)
> >
> > I believe there is a need and an opportunity for a fast "less safe"
> > String API.  I hope it will be on the roadmap soon.
> 
> I think it's already in the roadmap...the one that's in my head.  If you
> want to submit a PR with amendments to the manifesto, that'd be great.
> Also thanks very much for the example below; we'll definitely
> be referring to it as we proceed forward.

Here is a gist for the example code:
https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5

I can sketch key elements of an unsafe String API and some motivating 
arguments in a pull request. Is this what you are asking for?

Best,

Olivier

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170130/59613ef2/attachment.html>