[swift-evolution] [Draft proposal] Faster/lower-level external String initialization
moiseev at apple.com
Tue Jan 12 13:57:59 CST 2016
We looked at the CString APIs as part of API Naming Guidelines application effort.
You can see the results here: https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b
The main idea is to turn static factories into initializers and make init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code units.
Haven’t looked at your proposal in details, but I think that if we add a new String.decodeCString that accepts an UnsafeBufferPointer instead of an UnsafePointer (and does not have to call _swift_stdlib_strlen), that would solve the problem. Unless I’m missing something.
> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
> Given the initial positive response, I've taken a crack both at
> implementation and converting the request to a proposal. The proposal
> draft is located at:
> The code is located at:
> The proposal is reproduced below:
> # Expose code unit initializers on String
> * Proposal:
> * Author: [Zachary Waldowski](https://github.com/zwaldowski)
> * Status: **Awaiting review**
> * Review manager: TBD
> ## Introduction
> Going back and forth from Strings to their byte representations is an
> important part of solving many problems, including object
> serialization, binary file formats, wire/network interfaces, and
> cryptography. Swift has such utilities, currently only exposed through
> See swift-evolution
> ## Motivation
> In developing a parser, a coworker did the yeoman's work of benchmarking
> Swift's Unicode types. He swore up and down that
> was the fastest way he found. I, stubborn and noobish as I am, was
> skeptical that a better way couldn't be wrought from Swift's
> After reading through stdlib source and doing my own testing, this is no
> tale. `fromCString` is essentially the only public-facing user of
> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> role of
> both efficient and safe initialization-by-buffer-copy.
> Of course, `fromCString` isn't a silver bullet; it has to have a null
> requiring a copy of the origin buffer if one needs to be added (as is
> case with formats that specify the length up front, or unstructured
> that use unescaped double quotes as the terminator). It also prevents
> the string itself from containing the null character.
> # Proposed solution
> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> public API:
> init?<Input: CollectionType, Encoding: UnicodeCodecType where
> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> encoding: Encoding.Type)
> And, for consistency with
> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> UnicodeCodecType where Encoding.CodeUnit ==
> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
> ## Detailed design
> See [full
> This is a fairly straightforward renaming of the internal APIs.
> The initializer, its labels, and their order were chosen to match other
> initializers in the stdlib. "Sequence" was removed, as it was a
> "input" was kept as a generic name in order to allow for future
> The static initializer made the same changes, but was otherwise kept as
> factory function due to its multiple return values.
> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> internal use. I assume it wouldn't be good to expose publicly because,
> lack of a better phrase, we only "trust" the stdlib to accurately know
> wellformedness of their code units. Since it is a simple call through,
> use could be elided throughout the stdlib.
> ## Impact on existing code
> This is an additive change to the API.
> ## Alternatives considered
> * A protocol-oriented API.
> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> clear this method would be related to string processing, and would
> some kind of bounding (like `where Generator.Element:
> UnsignedIntegerType`), but
> that would be introducing a type bound that doesn't exist on
> * Do nothing.
> This seems suboptimal. For many use cases, `String` lacking this
> constructor is
> a limiting factor on performance for many kinds of pure-Swift
> * Make the `NSString` [bridge
> After reading the bridge code, I don't really know why it's slower.
> Maybe it's
> a bug.
> * Make `String.append(_:)`
> I don't completely understand the growth strategy of `_StringCore`, but
> it doesn't seem to exhibit the documented amortized `O(1)`, even when
> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> noted that
> it seems like `reserveCapacity` acts like a no-op.
> Zachary Waldowski
> zach at waldowski.me
> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>> Going back and forth from Strings to their byte representations is an
>> important part of solving many problems, including object
>> serialization, binary file formats, wire/network interfaces, and
>> In developing such a parser, a coworker did the yeoman's work of
>> Swift's Unicode types. He swore up and down that
>> String.Type.fromCString(_:) 
>> was the fastest way he found. I, stubborn and noobish as I am, was
>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>> After reading through stdlib source and doing my own testing, this is no
>> tale. fromCString  is essentially the only public user of
>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>> both efficient and safe initialization-by-buffer-copy.
>> Of course, fromCString isn't a silver bullet; it has to have a null
>> requiring a copy of the origin buffer if one needs to be added (as is
>> case with formats that specify the length up front, or unstructured
>> that use unescaped double quotes as the terminator). It also prevents
>> the string
>> itself from containing the null character.
>> I'd like to see _fromCodeUnitSequence  become public API as (just
>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>> If that
>> can't happen, an alternative to fromCString that doesn't use strlen
>> would be
>> nice, and we can just eat the performance hit on other code unit
>> I can't really think of a reason why it's not exposed yet, so I'm led to
>> I'm just missing something major, and not that a reason doesn't exist.
>> There's also discussion to be had of if API is needed. Try as I might, I
>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>> anything close to the same speed.  Profiling indicates that I keep
>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>> referenced, or it's a bug, or what, but it's consistently slower than
>> an Array of the bytes and performing fromCString on it. Similar story
>> crossing the NSString bridge, which is even stranger. 
>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>> this can be turned into a proposal.
>> Zachary Waldowski
>> zach at waldowski.me
> swift-evolution mailing list
> swift-evolution at swift.org
More information about the swift-evolution