<font size=2 face="sans-serif">As suggested, I created a pull request
for the String manifesto adding an unsafe String API discussion.</font><br><a href=https://github.com/apple/swift/pull/7479><font size=2 color=blue face="sans-serif">https://github.com/apple/swift/pull/7479</font></a><br><br><font size=2 face="sans-serif">I included in the comments a tentative
implementation in Swift 3.</font><br><a href=https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e><font size=2 color=blue face="sans-serif">https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e</font></a><br><br><font size=2 face="sans-serif">I focused for now on the most essential
capabilities that, hopefully, are not too controversial.</font><br><br><font size=2 face="sans-serif">Regards,</font><br><br><font size=2 face="sans-serif">Olivier</font><br><br><br><tt><font size=2>dabrahams@apple.com wrote on 01/31/2017 02:23:49 PM:<br><br>> From: Dave Abrahams <dabrahams@apple.com></font></tt><br><tt><font size=2>> To: Olivier Tardieu/Watson/IBM@IBMUS</font></tt><br><tt><font size=2>> Cc: Ben Cohen <ben_cohen@apple.com>, swift-evolution
<swift-<br>> evolution@swift.org></font></tt><br><tt><font size=2>> Date: 01/31/2017 02:24 PM</font></tt><br><tt><font size=2>> Subject: Re: [swift-evolution] Strings in Swift
4</font></tt><br><tt><font size=2>> Sent by: dabrahams@apple.com</font></tt><br><tt><font size=2>> <br>> <br>> on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com>
wrote:<br>> <br>> > Thanks for the clarifications.<br>> > More comments below.<br>> ><br>> > dabrahams@apple.com wrote on 01/24/2017 05:50:59 PM:<br>> ><br>> >> Maybe it wasn't clear from the document, but the intention
is that<br>> >> String would be able to use any model of Unicode as a backing
store, and<br>> >> that you could easily build unsafe models of Unicode... but
also that<br>> >> you could use your unsafe model of Unicode directly, in string-ish
ways.<br>> ><br>> > I see. If I understand correctly, it will be possible for instance
to <br>> > implement an unsafe model of Unicode with a UInt8 code unit and
a <br>> > maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest
bits of <br>> > Unicode scalars.<br>> <br>> Eh... I think you'd just use an unsafe Latin-1 for that; why waste
a<br>> bit?<br>> <br>> Here's an example (work very much in-progress):<br>> </font></tt><a href=https://github.com/apple/swift/blob/><tt><font size=2>https://github.com/apple/swift/blob/</font></tt></a><tt><font size=2><br>> 9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877<br>> <br>> <br>> >> > A lot of machine processing of strings continues to
deal with 8-bit<br>> >> > quantities (even 7-bit quantities, not UTF-8). Swift
strings are<br>> >> > not very good at that. I see progress in the manifesto
but nothing<br>> >> > to really close the performance gap with C. That's
where "unsafe"<br>> >> > mechanisms could come into play.<br>> >> <br>> >> extendedASCII is supposed to address that. Given a
smart enough<br>> >> optimizer, it should be possible to become competitive with
C even<br>> >> without using unsafe constructs. However, we recognize
the importance<br>> >> of being able to squeeze out that last bit of performance
by dropping<br>> >> down to unsafe storage.<br>> ><br>> > I doubt a 32-bit encoding can bridge the performance gap with
C in<br>> > particular because wire protocols will continue to favor compact<br>> > encodings. Incoming strings will have to be expanded to
the<br>> > extendedASCII representation before processing and probably compacted<br>> > afterwards. So while this may address the needs of computationally<br>> > intensive string processing tasks, this does not help simple
parsing<br>> > tasks on simple strings.<br>> <br>> I'm pretty sure it does; we're not going to change representations<br>> <br>> extendedASCII doesn't require anything to actually be expanded to<br>> 32-bits per code unit, except *maybe* in a register, and then only
if<br>> the optimizer isn't smart enough to eliminate zero-extension followed
by<br>> comparison with a known narrow value. You can always<br>> <br>> latin1.lazy.map { UInt32($0) }<br>> <br>> to produce 32-bit code units. All the common encodings are ASCII<br>> supersets, so this will “just work” for those. The only places
where it<br>> becomes more complicated is in encodings like Shift-JIS (which might
not<br>> even be important enough to support as a String backing-storage format).<br>> <br>> ><br>> >> > To guarantee Unicode correctness, a C string must be
validated or <br>> >> > transformed to be considered a Swift string.<br>> >> <br>> >> Not really. You can do error-correction on the fly.
However, I think<br>> >> pre-validation is often worthwhile because once you know
something is<br>> >> valid it's much cheaper to decode correctly (especially for
UTF-8).<br>> ><br>> > Sure. Eager vs. lazy validation is a valuable distinction, but
what I am <br>> > after here is side-stepping validation altogether. I understand
now that <br>> > user-defined encodings will make side-stepping validation possible.<br>> <br>> Right.<br>> <br>> ><br>> >> > If I understand the C String interop section correctly,
in Swift 4,<br>> >> > this should not force a copy, but traversing the string
is still<br>> >> > required. <br>> >> <br>> >> *What* should not force a copy?<br>> ><br>> > I would like to have a constructor that takes a pointer to a
<br>> > null-terminated sequence of bytes (or a sequence of bytes and
a length) <br>> > and turns it into a Swift string without allocation of a new
backing store <br>> > for the string and without copying the bytes in the sequence
from one <br>> > place in memory to another. <br>> <br>> We probably won't expose this at the top level of String, but you
should<br>> be able to construct an UnsafeCString (which is-a Unicode) and then,
if<br>> you really need the String type, construct a String from that:<br>> <br>> String(UnsafeCString(ntbs))<br>> <br>> That would not do any copying.<br>> <br>> > I understand this may require the programmer to handle memory<br>> > management for the backing store.<br>> ><br>> >> > I hope I am correct about the no-copy thing, and I would
also like to<br>> >> > permit promoting C strings to Swift strings without
validation. This<br>> >> > is obviously unsafe in general, but I know my strings...
and I care<br>> >> > about performance. ;)<br>> >> <br>> >> We intend to support that use-case. That's part of
the reason for the<br>> >> ValidUTF8 and ValidUTF16 encodings you see here:<br>> >> </font></tt><a href="https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/"><tt><font size=2>https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/</font></tt></a><tt><font size=2><br>> >> core/Unicode2.swift#L598<br>> >> and here:<br>> >> </font></tt><a href="https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/"><tt><font size=2>https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/</font></tt></a><tt><font size=2><br>> >> core/Unicode2.swift#L862<br>> ><br>> > OK<br>> ><br>> >> > More importantly, it is not possible to mutate bytes
in a Swift string<br>> >> > at will. Again it makes sense from the point of
view of always<br>> >> > correct Unicode sequences. But it does not for
machine processing of<br>> >> > C strings with C-like performance. Today, I can
cheat using a<br>> >> > "_public" API for this, i.e., myString._core.
_baseAddress!. This<br>> >> > should be doable from an official "unsafe"
API.<br>> >> <br>> >> We intend to support that use-case.<br>> >> <br>> >> > Memory safety is also at play here, as well as ownership.
A proper<br>> >> > API could guarantee the backing store is writable for
instance, that<br>> >> > it is not shared. A memory-safe but not unicode-safe
API could do<br>> >> > bounds checks.<br>> >> ><br>> >> > While low-level C string processing can be done using
unsafe memory<br>> >> > buffers with performance, the lack of bridging with
"real" Swift<br>> >> > strings kills the deal. No literals syntax (or
costly coercions),<br>> >> > none of the many useful string APIs.<br>> >> ><br>> >> > To illustrate these points here is a simple experiment:
code written<br>> >> > to synthesize an http date string from a bunch of integers.
There are<br>> >> > four versions of the code going from nice high-level
Swift code to<br>> >> > low-level C-like code. (Some of this code is also
about avoiding ARC<br>> >> > overheads, and string interpolation overheads, hence
the four<br>> >> > versions.)<br>> >> ><br>> >> > On my macbook pro (swiftc -O), the performance is as
follows:<br>> >> ><br>> >> > interpolation + func: 2.303032365s<br>> >> > interpolation + array: 1.224858418s<br>> >> > append:
0.918512377s<br>> >> > memcpy:
0.182104674s<br>> >> ><br>> >> > While the benchmarking could be done more carefully,
I think the main<br>> >> > observation is valid. The nice code is more than
10x slower than the<br>> >> > C-like code. Moreover, the ugly-but-still-valid-Swift
code is still<br>> >> > about 5x slower than the C like code. For some
applications, e.g. web<br>> >> > servers, this kind of numbers matter...<br>> >> ><br>> >> > Some of the proposed improvements would help with this,
e.g., small<br>> >> > strings optimization, and maybe changes to the concatenation<br>> >> > semantics. But it seems to me that a big performance
gap will remain.<br>> >> > (Concatenation even with strncat is significantly slower
than memcpy<br>> >> > for fixed-size strings.)<br>> >> ><br>> >> > I believe there is a need and an opportunity for a fast
"less safe"<br>> >> > String API. I hope it will be on the roadmap soon.<br>> >> <br>> >> I think it's already in the roadmap...the one that's in my
head. If you<br>> >> want to submit a PR with amendments to the manifesto, that'd
be great.<br>> >> Also thanks very much for the example below; we'll definitely<br>> >> be referring to it as we proceed forward.<br>> ><br>> > Here is a gist for the example code:<br>> > </font></tt><a href=https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5><tt><font size=2>https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5</font></tt></a><tt><font size=2><br>> ><br>> > I can sketch key elements of an unsafe String API and some motivating
<br>> > arguments in a pull request. Is this what you are asking for?<br>> <br>> That would be awesome, thanks!<br>> <br>> -- <br>> -Dave<br>> <br></font></tt><BR>