[swift-evolution] Strings in Swift 4
Olivier Tardieu
tardieu at us.ibm.com
Tue Feb 14 19:14:43 CST 2017
As suggested, I created a pull request for the String manifesto adding an
unsafe String API discussion.
https://github.com/apple/swift/pull/7479
I included in the comments a tentative implementation in Swift 3.
https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e
I focused for now on the most essential capabilities that, hopefully, are
not too controversial.
Regards,
Olivier
dabrahams at apple.com wrote on 01/31/2017 02:23:49 PM:
> From: Dave Abrahams <dabrahams at apple.com>
> To: Olivier Tardieu/Watson/IBM at IBMUS
> Cc: Ben Cohen <ben_cohen at apple.com>, swift-evolution <swift-
> evolution at swift.org>
> Date: 01/31/2017 02:24 PM
> Subject: Re: [swift-evolution] Strings in Swift 4
> Sent by: dabrahams at apple.com
>
>
> on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:
>
> > Thanks for the clarifications.
> > More comments below.
> >
> > dabrahams at apple.com wrote on 01/24/2017 05:50:59 PM:
> >
> >> Maybe it wasn't clear from the document, but the intention is that
> >> String would be able to use any model of Unicode as a backing store,
and
> >> that you could easily build unsafe models of Unicode... but also that
> >> you could use your unsafe model of Unicode directly, in string-ish
ways.
> >
> > I see. If I understand correctly, it will be possible for instance to
> > implement an unsafe model of Unicode with a UInt8 code unit and a
> > maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits
of
> > Unicode scalars.
>
> Eh... I think you'd just use an unsafe Latin-1 for that; why waste a
> bit?
>
> Here's an example (work very much in-progress):
> https://github.com/apple/swift/blob/
>
9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877
>
>
> >> > A lot of machine processing of strings continues to deal with 8-bit
> >> > quantities (even 7-bit quantities, not UTF-8). Swift strings are
> >> > not very good at that. I see progress in the manifesto but nothing
> >> > to really close the performance gap with C. That's where "unsafe"
> >> > mechanisms could come into play.
> >>
> >> extendedASCII is supposed to address that. Given a smart enough
> >> optimizer, it should be possible to become competitive with C even
> >> without using unsafe constructs. However, we recognize the
importance
> >> of being able to squeeze out that last bit of performance by dropping
> >> down to unsafe storage.
> >
> > I doubt a 32-bit encoding can bridge the performance gap with C in
> > particular because wire protocols will continue to favor compact
> > encodings. Incoming strings will have to be expanded to the
> > extendedASCII representation before processing and probably compacted
> > afterwards. So while this may address the needs of computationally
> > intensive string processing tasks, this does not help simple parsing
> > tasks on simple strings.
>
> I'm pretty sure it does; we're not going to change representations
>
> extendedASCII doesn't require anything to actually be expanded to
> 32-bits per code unit, except *maybe* in a register, and then only if
> the optimizer isn't smart enough to eliminate zero-extension followed by
> comparison with a known narrow value. You can always
>
> latin1.lazy.map { UInt32($0) }
>
> to produce 32-bit code units. All the common encodings are ASCII
> supersets, so this will “just work” for those. The only places where it
> becomes more complicated is in encodings like Shift-JIS (which might not
> even be important enough to support as a String backing-storage format).
>
> >
> >> > To guarantee Unicode correctness, a C string must be validated or
> >> > transformed to be considered a Swift string.
> >>
> >> Not really. You can do error-correction on the fly. However, I
think
> >> pre-validation is often worthwhile because once you know something is
> >> valid it's much cheaper to decode correctly (especially for UTF-8).
> >
> > Sure. Eager vs. lazy validation is a valuable distinction, but what I
am
> > after here is side-stepping validation altogether. I understand now
that
> > user-defined encodings will make side-stepping validation possible.
>
> Right.
>
> >
> >> > If I understand the C String interop section correctly, in Swift 4,
> >> > this should not force a copy, but traversing the string is still
> >> > required.
> >>
> >> *What* should not force a copy?
> >
> > I would like to have a constructor that takes a pointer to a
> > null-terminated sequence of bytes (or a sequence of bytes and a
length)
> > and turns it into a Swift string without allocation of a new backing
store
> > for the string and without copying the bytes in the sequence from one
> > place in memory to another.
>
> We probably won't expose this at the top level of String, but you should
> be able to construct an UnsafeCString (which is-a Unicode) and then, if
> you really need the String type, construct a String from that:
>
> String(UnsafeCString(ntbs))
>
> That would not do any copying.
>
> > I understand this may require the programmer to handle memory
> > management for the backing store.
> >
> >> > I hope I am correct about the no-copy thing, and I would also like
to
> >> > permit promoting C strings to Swift strings without validation.
This
> >> > is obviously unsafe in general, but I know my strings... and I care
> >> > about performance. ;)
> >>
> >> We intend to support that use-case. That's part of the reason for
the
> >> ValidUTF8 and ValidUTF16 encodings you see here:
> >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> >> core/Unicode2.swift#L598
> >> and here:
> >> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
> >> core/Unicode2.swift#L862
> >
> > OK
> >
> >> > More importantly, it is not possible to mutate bytes in a Swift
string
> >> > at will. Again it makes sense from the point of view of always
> >> > correct Unicode sequences. But it does not for machine processing
of
> >> > C strings with C-like performance. Today, I can cheat using a
> >> > "_public" API for this, i.e., myString._core. _baseAddress!. This
> >> > should be doable from an official "unsafe" API.
> >>
> >> We intend to support that use-case.
> >>
> >> > Memory safety is also at play here, as well as ownership. A proper
> >> > API could guarantee the backing store is writable for instance,
that
> >> > it is not shared. A memory-safe but not unicode-safe API could do
> >> > bounds checks.
> >> >
> >> > While low-level C string processing can be done using unsafe memory
> >> > buffers with performance, the lack of bridging with "real" Swift
> >> > strings kills the deal. No literals syntax (or costly coercions),
> >> > none of the many useful string APIs.
> >> >
> >> > To illustrate these points here is a simple experiment: code
written
> >> > to synthesize an http date string from a bunch of integers. There
are
> >> > four versions of the code going from nice high-level Swift code to
> >> > low-level C-like code. (Some of this code is also about avoiding
ARC
> >> > overheads, and string interpolation overheads, hence the four
> >> > versions.)
> >> >
> >> > On my macbook pro (swiftc -O), the performance is as follows:
> >> >
> >> > interpolation + func: 2.303032365s
> >> > interpolation + array: 1.224858418s
> >> > append: 0.918512377s
> >> > memcpy: 0.182104674s
> >> >
> >> > While the benchmarking could be done more carefully, I think the
main
> >> > observation is valid. The nice code is more than 10x slower than
the
> >> > C-like code. Moreover, the ugly-but-still-valid-Swift code is
still
> >> > about 5x slower than the C like code. For some applications, e.g.
web
> >> > servers, this kind of numbers matter...
> >> >
> >> > Some of the proposed improvements would help with this, e.g., small
> >> > strings optimization, and maybe changes to the concatenation
> >> > semantics. But it seems to me that a big performance gap will
remain.
> >> > (Concatenation even with strncat is significantly slower than
memcpy
> >> > for fixed-size strings.)
> >> >
> >> > I believe there is a need and an opportunity for a fast "less safe"
> >> > String API. I hope it will be on the roadmap soon.
> >>
> >> I think it's already in the roadmap...the one that's in my head. If
you
> >> want to submit a PR with amendments to the manifesto, that'd be
great.
> >> Also thanks very much for the example below; we'll definitely
> >> be referring to it as we proceed forward.
> >
> > Here is a gist for the example code:
> > https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5
> >
> > I can sketch key elements of an unsafe String API and some motivating
> > arguments in a pull request. Is this what you are asking for?
>
> That would be awesome, thanks!
>
> --
> -Dave
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170214/10b575c8/attachment.html>
More information about the swift-evolution
mailing list