[swift-evolution] [Draft proposal] Faster/lower-level external String initialization

Charles Kissinger crk at akkyra.com
Tue Jan 12 13:18:35 CST 2016


Zach,

Thanks very much for writing up this proposal! This will be a very valuable addition to the standard library for some of us. My comments are below:

> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
> 
> Given the initial positive response, I've taken a crack both at
> implementation and converting the request to a proposal. The proposal
> draft is located at:
> 
>    https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
> 
> The code is located at:
> 
>    https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
> 
> The proposal is reproduced below:
> 
> # Expose code unit initializers on String
> 
> * Proposal:
> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
> * Author: [Zachary Waldowski](https://github.com/zwaldowski)
> * Status: **Awaiting review**
> * Review manager: TBD
> 
> ## Introduction
> 
> Going back and forth from Strings to their byte representations is an
> important part of solving many problems, including object
> serialization, binary file formats,

binary *and* text file formats!

> wire/network interfaces, and
> cryptography. Swift has such utilities, currently only exposed through
> `String.Type.fromCString(_:)`.
> 
> See swift-evolution
> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).
> 
> ## Motivation
> 
> In developing a parser, a coworker did the yeoman's work of benchmarking
> Swift's Unicode types. He swore up and down that
> `String.Type.fromCString(_:)`
> ([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
> was the fastest way he found. I, stubborn and noobish as I am, was
> skeptical that a better way couldn't be wrought from Swift's
> `UnicodeCodecType`s.
> 
> After reading through stdlib source and doing my own testing, this is no
> wives'
> tale. `fromCString` is essentially the only public-facing user of
> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> role of
> both efficient and safe initialization-by-buffer-copy.

It might be worth mentioning here in the Motivation section that String.append(_: UnicodeScalar) is not a viable alternative in many cases because it has much slower performance. (I know it is discussed below under alternatives.)

> 
> Of course, `fromCString` isn't a silver bullet; it has to have a null
> sentinel,
> requiring a copy of the origin buffer if one needs to be added (as is
> the
> case with formats that specify the length up front, or unstructured
> payloads
> that use unescaped double quotes as the terminator). It also prevents
> the string itself from containing the null character.

This also means that something as fundamental as parsing sub-strings out of an NSData object requires copying to intermediate buffers or the use of much slower character-by-character appends.

Another limitation is that `fromCString` only works with UTF8 (or ASCII) encoding.

It is worth mentioning also that the implementation of fromCString() involves a string length calculation (call to strlen()). In many cases that length has already been calculated in the client code. The proposed solution has the potential of being at least slightly faster because the strlen call is not needed. Maybe this should go in the Proposed Solution section.

> 
> # Proposed solution
> 
> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> public API:
> 
> ```swift
> init?<Input: CollectionType, Encoding: UnicodeCodecType where
> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> encoding: Encoding.Type)
> ```
> 
> And, for consistency with
> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
> 
> ```swift
> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> UnicodeCodecType where Encoding.CodeUnit ==
> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
> 

These two functions seem like a good approach. The only alternatives I can think of are either to have a `withRepair: Bool` parameter to the initializer (possibly with a default value) or to make the initializer a type method instead for complete consistency with fromCString() and fromCStringRepairingIllFormedUTF8().

It would be nice to get some feedback from someone at Apple as to why fromCString() was implemented as a type method instead of a failable initializer. Presumably it was because there is both a repairing and a failable, non-repairing version. 

> ## Detailed design
> 
> See [full
> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).
> 
> This is a fairly straightforward renaming of the internal APIs.
> 
> The initializer, its labels, and their order were chosen to match other
> non-cast
> initializers in the stdlib. "Sequence" was removed, as it was a
> misnomer.
> "input" was kept as a generic name in order to allow for future
> refinements.
> 
> The static initializer made the same changes, but was otherwise kept as
> a
> factory function due to its multiple return values.
> 
> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> for
> internal use. I assume it wouldn't be good to expose publicly because,
> for
> lack of a better phrase, we only "trust" the stdlib to accurately know
> the
> wellformedness of their code units. Since it is a simple call through,
> its
> use could be elided throughout the stdlib.
> 
> ## Impact on existing code
> 
> This is an additive change to the API.
> 
> ## Alternatives considered
> 
> * A protocol-oriented API.
> 
> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> really
> clear this method would be related to string processing, and would
> require
> some kind of bounding (like `where Generator.Element:
> UnsignedIntegerType`), but
> that would be introducing a type bound that doesn't exist on
> 
> * Do nothing.
> 
> This seems suboptimal. For many use cases, `String` lacking this
> constructor is
> a limiting factor on performance for many kinds of pure-Swift
> implementations.

And performance is extremely important in many file parsing scenarios because the size of the input files is unpredictable (and often large!).

> * Make the `NSString` [bridge
> faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).
> 
> After reading the bridge code, I don't really know why it's slower.
> Maybe it's
> a bug.
> 
> * Make `String.append(_:)`
> [faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).
> 
> I don't completely understand the growth strategy of `_StringCore`, but
> it doesn't seem to exhibit the documented amortized `O(1)`, even when
> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> noted that
> it seems like `reserveCapacity` acts like a no-op.

Even if the performance problems here are fixed, relying on String.append() would still lead to more verbose code than the proposed direct initializer or factory function.

> ----
> 
> Cheers,
> Zachary Waldowski
> zach at waldowski.me
> 
> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>> Going back and forth from Strings to their byte representations is an
>> important part of solving many problems, including object
>> serialization, binary file formats, wire/network interfaces, and
>> cryptography.
>> 
>> In developing such a parser, a coworker did the yeoman's work of
>> benchmarking
>> Swift's Unicode types. He swore up and down that
>> String.Type.fromCString(_:) [0]
>> was the fastest way he found. I, stubborn and noobish as I am, was
>> skeptical
>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>> 
>> After reading through stdlib source and doing my own testing, this is no
>> wives'
>> tale. fromCString [1] is essentially the only public user of
>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>> of
>> both efficient and safe initialization-by-buffer-copy.
>> 
>> Of course, fromCString isn't a silver bullet; it has to have a null
>> sentinel,
>> requiring a copy of the origin buffer if one needs to be added (as is
>> the
>> case with formats that specify the length up front, or unstructured
>> payloads
>> that use unescaped double quotes as the terminator). It also prevents
>> the string
>> itself from containing the null character.
>> 
>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>> If that
>> can't happen, an alternative to fromCString that doesn't use strlen
>> would be
>> nice, and we can just eat the performance hit on other code unit
>> sequences.
>> 
>> I can't really think of a reason why it's not exposed yet, so I'm led to
>> believe
>> I'm just missing something major, and not that a reason doesn't exist.
>> ;-)
>> 
>> There's also discussion to be had of if API is needed. Try as I might, I
>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>> have
>> anything close to the same speed. [3] Profiling indicates that I keep
>> hitting
>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>> referenced, or it's a bug, or what, but it's consistently slower than
>> creating
>> an Array of the bytes and performing fromCString on it. Similar story
>> with
>> crossing the NSString bridge, which is even stranger. [4]
>> 
>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>> whether
>> this can be turned into a proposal.
>> 
>> [0]:
>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
>> [1]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>> [2]:
>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>> [3]:
>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
>> [4]:
>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
>> 
>> Cheers,
>> Zachary Waldowski
>> zach at waldowski.me
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution



More information about the swift-evolution mailing list