[swift-evolution] [Draft proposal] Faster/lower-level external String initialization

Zach Waldowski zach at waldowski.me
Tue Jan 12 14:33:54 CST 2016


Though Max's follow-up might call into question the need for the
proposal (in a perfect world I'd like to see this in 2.2), I've
addressed your comments. Thanks!

-- 
Zach Waldowski
zach at waldowski.me

On Tue, Jan 12, 2016, at 02:18 PM, Charles Kissinger wrote:
> Zach,
> 
> Thanks very much for writing up this proposal! This will be a very
> valuable addition to the standard library for some of us. My comments are
> below:
> 
> > On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
> > 
> > Given the initial positive response, I've taken a crack both at
> > implementation and converting the request to a proposal. The proposal
> > draft is located at:
> > 
> >    https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
> > 
> > The code is located at:
> > 
> >    https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
> > 
> > The proposal is reproduced below:
> > 
> > # Expose code unit initializers on String
> > 
> > * Proposal:
> > [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
> > * Author: [Zachary Waldowski](https://github.com/zwaldowski)
> > * Status: **Awaiting review**
> > * Review manager: TBD
> > 
> > ## Introduction
> > 
> > Going back and forth from Strings to their byte representations is an
> > important part of solving many problems, including object
> > serialization, binary file formats,
> 
> binary *and* text file formats!
> 
> > wire/network interfaces, and
> > cryptography. Swift has such utilities, currently only exposed through
> > `String.Type.fromCString(_:)`.
> > 
> > See swift-evolution
> > [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).
> > 
> > ## Motivation
> > 
> > In developing a parser, a coworker did the yeoman's work of benchmarking
> > Swift's Unicode types. He swore up and down that
> > `String.Type.fromCString(_:)`
> > ([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
> > was the fastest way he found. I, stubborn and noobish as I am, was
> > skeptical that a better way couldn't be wrought from Swift's
> > `UnicodeCodecType`s.
> > 
> > After reading through stdlib source and doing my own testing, this is no
> > wives'
> > tale. `fromCString` is essentially the only public-facing user of
> > `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> > role of
> > both efficient and safe initialization-by-buffer-copy.
> 
> It might be worth mentioning here in the Motivation section that
> String.append(_: UnicodeScalar) is not a viable alternative in many cases
> because it has much slower performance. (I know it is discussed below
> under alternatives.)
> 
> > 
> > Of course, `fromCString` isn't a silver bullet; it has to have a null
> > sentinel,
> > requiring a copy of the origin buffer if one needs to be added (as is
> > the
> > case with formats that specify the length up front, or unstructured
> > payloads
> > that use unescaped double quotes as the terminator). It also prevents
> > the string itself from containing the null character.
> 
> This also means that something as fundamental as parsing sub-strings out
> of an NSData object requires copying to intermediate buffers or the use
> of much slower character-by-character appends.
> 
> Another limitation is that `fromCString` only works with UTF8 (or ASCII)
> encoding.
> 
> It is worth mentioning also that the implementation of fromCString()
> involves a string length calculation (call to strlen()). In many cases
> that length has already been calculated in the client code. The proposed
> solution has the potential of being at least slightly faster because the
> strlen call is not needed. Maybe this should go in the Proposed Solution
> section.
> 
> > 
> > # Proposed solution
> > 
> > I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> > public API:
> > 
> > ```swift
> > init?<Input: CollectionType, Encoding: UnicodeCodecType where
> > Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> > encoding: Encoding.Type)
> > ```
> > 
> > And, for consistency with
> > `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
> > exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
> > 
> > ```swift
> > static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> > UnicodeCodecType where Encoding.CodeUnit ==
> > Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
> > 
> 
> These two functions seem like a good approach. The only alternatives I
> can think of are either to have a `withRepair: Bool` parameter to the
> initializer (possibly with a default value) or to make the initializer a
> type method instead for complete consistency with fromCString() and
> fromCStringRepairingIllFormedUTF8().
> 
> It would be nice to get some feedback from someone at Apple as to why
> fromCString() was implemented as a type method instead of a failable
> initializer. Presumably it was because there is both a repairing and a
> failable, non-repairing version. 
> 
> > ## Detailed design
> > 
> > See [full
> > implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).
> > 
> > This is a fairly straightforward renaming of the internal APIs.
> > 
> > The initializer, its labels, and their order were chosen to match other
> > non-cast
> > initializers in the stdlib. "Sequence" was removed, as it was a
> > misnomer.
> > "input" was kept as a generic name in order to allow for future
> > refinements.
> > 
> > The static initializer made the same changes, but was otherwise kept as
> > a
> > factory function due to its multiple return values.
> > 
> > `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> > for
> > internal use. I assume it wouldn't be good to expose publicly because,
> > for
> > lack of a better phrase, we only "trust" the stdlib to accurately know
> > the
> > wellformedness of their code units. Since it is a simple call through,
> > its
> > use could be elided throughout the stdlib.
> > 
> > ## Impact on existing code
> > 
> > This is an additive change to the API.
> > 
> > ## Alternatives considered
> > 
> > * A protocol-oriented API.
> > 
> > Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> > really
> > clear this method would be related to string processing, and would
> > require
> > some kind of bounding (like `where Generator.Element:
> > UnsignedIntegerType`), but
> > that would be introducing a type bound that doesn't exist on
> > 
> > * Do nothing.
> > 
> > This seems suboptimal. For many use cases, `String` lacking this
> > constructor is
> > a limiting factor on performance for many kinds of pure-Swift
> > implementations.
> 
> And performance is extremely important in many file parsing scenarios
> because the size of the input files is unpredictable (and often large!).
> 
> > * Make the `NSString` [bridge
> > faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).
> > 
> > After reading the bridge code, I don't really know why it's slower.
> > Maybe it's
> > a bug.
> > 
> > * Make `String.append(_:)`
> > [faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).
> > 
> > I don't completely understand the growth strategy of `_StringCore`, but
> > it doesn't seem to exhibit the documented amortized `O(1)`, even when
> > `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> > noted that
> > it seems like `reserveCapacity` acts like a no-op.
> 
> Even if the performance problems here are fixed, relying on
> String.append() would still lead to more verbose code than the proposed
> direct initializer or factory function.
> 
> > ----
> > 
> > Cheers,
> > Zachary Waldowski
> > zach at waldowski.me
> > 
> > On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
> >> Going back and forth from Strings to their byte representations is an
> >> important part of solving many problems, including object
> >> serialization, binary file formats, wire/network interfaces, and
> >> cryptography.
> >> 
> >> In developing such a parser, a coworker did the yeoman's work of
> >> benchmarking
> >> Swift's Unicode types. He swore up and down that
> >> String.Type.fromCString(_:) [0]
> >> was the fastest way he found. I, stubborn and noobish as I am, was
> >> skeptical
> >> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
> >> 
> >> After reading through stdlib source and doing my own testing, this is no
> >> wives'
> >> tale. fromCString [1] is essentially the only public user of
> >> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
> >> of
> >> both efficient and safe initialization-by-buffer-copy.
> >> 
> >> Of course, fromCString isn't a silver bullet; it has to have a null
> >> sentinel,
> >> requiring a copy of the origin buffer if one needs to be added (as is
> >> the
> >> case with formats that specify the length up front, or unstructured
> >> payloads
> >> that use unescaped double quotes as the terminator). It also prevents
> >> the string
> >> itself from containing the null character.
> >> 
> >> I'd like to see _fromCodeUnitSequence [2] become public API as (just
> >> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
> >> If that
> >> can't happen, an alternative to fromCString that doesn't use strlen
> >> would be
> >> nice, and we can just eat the performance hit on other code unit
> >> sequences.
> >> 
> >> I can't really think of a reason why it's not exposed yet, so I'm led to
> >> believe
> >> I'm just missing something major, and not that a reason doesn't exist.
> >> ;-)
> >> 
> >> There's also discussion to be had of if API is needed. Try as I might, I
> >> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
> >> have
> >> anything close to the same speed. [3] Profiling indicates that I keep
> >> hitting
> >> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
> >> referenced, or it's a bug, or what, but it's consistently slower than
> >> creating
> >> an Array of the bytes and performing fromCString on it. Similar story
> >> with
> >> crossing the NSString bridge, which is even stranger. [4]
> >> 
> >> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
> >> whether
> >> this can be turned into a proposal.
> >> 
> >> [0]:
> >> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
> >> [1]:
> >> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
> >> [2]:
> >> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
> >> [3]:
> >> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
> >> [4]:
> >> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
> >> 
> >> Cheers,
> >> Zachary Waldowski
> >> zach at waldowski.me
> > _______________________________________________
> > swift-evolution mailing list
> > swift-evolution at swift.org
> > https://lists.swift.org/mailman/listinfo/swift-evolution
> 


More information about the swift-evolution mailing list