[swift-evolution] Strings in Swift 4

Tue Feb 14 19:14:43 CST 2017

As suggested, I created a pull request for the String manifesto adding an 
unsafe String API discussion.
https://github.com/apple/swift/pull/7479

I included in the comments a tentative implementation in Swift 3.
https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e

I focused for now on the most essential capabilities that, hopefully, are 
not too controversial.

Regards,

Olivier

dabrahams at apple.com wrote on 01/31/2017 02:23:49 PM:

> From: Dave Abrahams <dabrahams at apple.com>
> To: Olivier Tardieu/Watson/IBM at IBMUS
> Cc: Ben Cohen <ben_cohen at apple.com>, swift-evolution <swift-
> evolution at swift.org>
> Date: 01/31/2017 02:24 PM
> Subject: Re: [swift-evolution] Strings in Swift 4
> Sent by: dabrahams at apple.com
> 
> 
> on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:
> 
> > Thanks for the clarifications.
> > More comments below.
> >
> > dabrahams at apple.com wrote on 01/24/2017 05:50:59 PM:
> >
> >> Maybe it wasn't clear from the document, but the intention is that
> >> String would be able to use any model of Unicode as a backing store, 
and
> >> that you could easily build unsafe models of Unicode... but also that
> >> you could use your unsafe model of Unicode directly, in string-ish 
ways.
> >
> > I see. If I understand correctly, it will be possible for instance to 
> > implement an unsafe model of Unicode with a UInt8 code unit and a 
> > maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits 
of 
> > Unicode scalars.
> 
> Eh... I think you'd just use an unsafe Latin-1 for that; why waste a
> bit?
> 
> Here's an example (work very much in-progress):
> https://github.com/apple/swift/blob/
> 
9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877
> 
> 
> >> > A lot of machine processing of strings continues to deal with 8-bit
> >> > quantities (even 7-bit quantities, not UTF-8).  Swift strings are
> >> > not very good at that. I see progress in the manifesto but nothing
> >> > to really close the performance gap with C.  That's where "unsafe"
> >> > mechanisms could come into play.
> >> 
> >> extendedASCII is supposed to address that.  Given a smart enough
> >> optimizer, it should be possible to become competitive with C even
> >> without using unsafe constructs.  However, we recognize the 
importance
> >> of being able to squeeze out that last bit of performance by dropping
> >> down to unsafe storage.
> >
> > I doubt a 32-bit encoding can bridge the performance gap with C in
> > particular because wire protocols will continue to favor compact
> > encodings.  Incoming strings will have to be expanded to the
> > extendedASCII representation before processing and probably compacted
> > afterwards. So while this may address the needs of computationally
> > intensive string processing tasks, this does not help simple parsing
> > tasks on simple strings.
> 
> I'm pretty sure it does; we're not going to change representations
> 
> extendedASCII doesn't require anything to actually be expanded to
> 32-bits per code unit, except *maybe* in a register, and then only if
> the optimizer isn't smart enough to eliminate zero-extension followed by
> comparison with a known narrow value.  You can always
> 
>   latin1.lazy.map { UInt32($0) }
> 
> to produce 32-bit code units.  All the common encodings are ASCII
> supersets, so this will “just work” for those.  The only places where it
> becomes more complicated is in encodings like Shift-JIS (which might not
> even be important enough to support as a String backing-storage format).
> 
> >
> >> > To guarantee Unicode correctness, a C string must be validated or 
> >> > transformed to be considered a Swift string.
> >> 
> >> Not really.  You can do error-correction on the fly.  However, I 
think
> >> pre-validation is often worthwhile because once you know something is
> >> valid it's much cheaper to decode correctly (especially for UTF-8).
> >
> > Sure. Eager vs. lazy validation is a valuable distinction, but what I 
am 
> > after here is side-stepping validation altogether. I understand now 
that 
> > user-defined encodings will make side-stepping validation possible.
> 
> Right.
> 
> >
> >> > If I understand the C String interop section correctly, in Swift 4,
> >> > this should not force a copy, but traversing the string is still
> >> > required. 
> >> 
> >> *What* should not force a copy?
> >
> > I would like to have a constructor that takes a pointer to a 
> > null-terminated sequence of bytes (or a sequence of bytes and a 
length) 
> > and turns it into a Swift string without allocation of a new backing 
store 
> > for the string and without copying the bytes in the sequence from one 
> > place in memory to another. 
> 
> We probably won't expose this at the top level of String, but you should
> be able to construct an UnsafeCString (which is-a Unicode) and then, if
> you really need the String type, construct a String from that:
> 
>    String(UnsafeCString(ntbs))
> 
> That would not do any copying.
> 
> > I understand this may require the programmer to handle memory
> > management for the backing store.
> >
> >> > I hope I am correct about the no-copy thing, and I would also like 
to
> >> > permit promoting C strings to Swift strings without validation. 
This
> >> > is obviously unsafe in general, but I know my strings... and I care
> >> > about performance. ;)
> >> 
> >> We intend to support that use-case.  That's part of the reason for 
the
> >> ValidUTF8 and ValidUTF16 encodings you see here:
> >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> >> core/Unicode2.swift#L598
> >> and here:
> >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> >> core/Unicode2.swift#L862
> >
> > OK
> >
> >> > More importantly, it is not possible to mutate bytes in a Swift 
string
> >> > at will.  Again it makes sense from the point of view of always
> >> > correct Unicode sequences.  But it does not for machine processing 
of
> >> > C strings with C-like performance.  Today, I can cheat using a
> >> > "_public" API for this, i.e., myString._core.  _baseAddress!.  This
> >> > should be doable from an official "unsafe" API.
> >> 
> >> We intend to support that use-case.
> >> 
> >> > Memory safety is also at play here, as well as ownership.  A proper
> >> > API could guarantee the backing store is writable for instance, 
that
> >> > it is not shared.  A memory-safe but not unicode-safe API could do
> >> > bounds checks.
> >> >
> >> > While low-level C string processing can be done using unsafe memory
> >> > buffers with performance, the lack of bridging with "real" Swift
> >> > strings kills the deal.  No literals syntax (or costly coercions),
> >> > none of the many useful string APIs.
> >> >
> >> > To illustrate these points here is a simple experiment: code 
written
> >> > to synthesize an http date string from a bunch of integers.  There 
are
> >> > four versions of the code going from nice high-level Swift code to
> >> > low-level C-like code.  (Some of this code is also about avoiding 
ARC
> >> > overheads, and string interpolation overheads, hence the four
> >> > versions.)
> >> >
> >> > On my macbook pro (swiftc -O), the performance is as follows:
> >> >
> >> > interpolation + func:  2.303032365s
> >> > interpolation + array: 1.224858418s
> >> > append:                0.918512377s
> >> > memcpy:                0.182104674s
> >> >
> >> > While the benchmarking could be done more carefully, I think the 
main
> >> > observation is valid.  The nice code is more than 10x slower than 
the
> >> > C-like code.  Moreover, the ugly-but-still-valid-Swift code is 
still
> >> > about 5x slower than the C like code.  For some applications, e.g. 
web
> >> > servers, this kind of numbers matter...
> >> >
> >> > Some of the proposed improvements would help with this, e.g., small
> >> > strings optimization, and maybe changes to the concatenation
> >> > semantics.  But it seems to me that a big performance gap will 
remain.
> >> > (Concatenation even with strncat is significantly slower than 
memcpy
> >> > for fixed-size strings.)
> >> >
> >> > I believe there is a need and an opportunity for a fast "less safe"
> >> > String API.  I hope it will be on the roadmap soon.
> >> 
> >> I think it's already in the roadmap...the one that's in my head.  If 
you
> >> want to submit a PR with amendments to the manifesto, that'd be 
great.
> >> Also thanks very much for the example below; we'll definitely
> >> be referring to it as we proceed forward.
> >
> > Here is a gist for the example code:
> > https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5
> >
> > I can sketch key elements of an unsafe String API and some motivating 
> > arguments in a pull request. Is this what you are asking for?
> 
> That would be awesome, thanks!
> 
> -- 
> -Dave
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170214/10b575c8/attachment.html>