[swift-evolution] [Draft proposal] Faster/lower-level external String initialization
Charles Kissinger
crk at akkyra.com
Tue Jan 12 14:59:41 CST 2016
> On Jan 12, 2016, at 11:57 AM, Max Moiseev via swift-evolution <swift-evolution at swift.org> wrote:
>
> Hi Zach,
>
> We looked at the CString APIs as part of API Naming Guidelines application effort.
> You can see the results here: https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b
>
> The main idea is to turn static factories into initializers and make init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code units.
>
> Haven’t looked at your proposal in details, but I think that if we add a new String.decodeCString that accepts an UnsafeBufferPointer instead of an UnsafePointer (and does not have to call _swift_stdlib_strlen), that would solve the problem. Unless I’m missing something.
That would solve my particular problems anyway. Will a proposal still be required for this to happen?
-CK
>
> regards,
> max
>
>> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
>>
>> Given the initial positive response, I've taken a crack both at
>> implementation and converting the request to a proposal. The proposal
>> draft is located at:
>>
>> https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
>>
>> The code is located at:
>>
>> https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
>>
>> The proposal is reproduced below:
>>
>> # Expose code unit initializers on String
>>
>> * Proposal:
>> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
>> * Author: [Zachary Waldowski](https://github.com/zwaldowski)
>> * Status: **Awaiting review**
>> * Review manager: TBD
>>
>> ## Introduction
>>
>> Going back and forth from Strings to their byte representations is an
>> important part of solving many problems, including object
>> serialization, binary file formats, wire/network interfaces, and
>> cryptography. Swift has such utilities, currently only exposed through
>> `String.Type.fromCString(_:)`.
>>
>> See swift-evolution
>> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).
>>
>> ## Motivation
>>
>> In developing a parser, a coworker did the yeoman's work of benchmarking
>> Swift's Unicode types. He swore up and down that
>> `String.Type.fromCString(_:)`
>> ([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
>> was the fastest way he found. I, stubborn and noobish as I am, was
>> skeptical that a better way couldn't be wrought from Swift's
>> `UnicodeCodecType`s.
>>
>> After reading through stdlib source and doing my own testing, this is no
>> wives'
>> tale. `fromCString` is essentially the only public-facing user of
>> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
>> role of
>> both efficient and safe initialization-by-buffer-copy.
>>
>> Of course, `fromCString` isn't a silver bullet; it has to have a null
>> sentinel,
>> requiring a copy of the origin buffer if one needs to be added (as is
>> the
>> case with formats that specify the length up front, or unstructured
>> payloads
>> that use unescaped double quotes as the terminator). It also prevents
>> the string itself from containing the null character.
>>
>> # Proposed solution
>>
>> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
>> public API:
>>
>> ```swift
>> init?<Input: CollectionType, Encoding: UnicodeCodecType where
>> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
>> encoding: Encoding.Type)
>> ```
>>
>> And, for consistency with
>> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
>> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
>>
>> ```swift
>> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
>> UnicodeCodecType where Encoding.CodeUnit ==
>> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
>>
>> ## Detailed design
>>
>> See [full
>> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).
>>
>> This is a fairly straightforward renaming of the internal APIs.
>>
>> The initializer, its labels, and their order were chosen to match other
>> non-cast
>> initializers in the stdlib. "Sequence" was removed, as it was a
>> misnomer.
>> "input" was kept as a generic name in order to allow for future
>> refinements.
>>
>> The static initializer made the same changes, but was otherwise kept as
>> a
>> factory function due to its multiple return values.
>>
>> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
>> for
>> internal use. I assume it wouldn't be good to expose publicly because,
>> for
>> lack of a better phrase, we only "trust" the stdlib to accurately know
>> the
>> wellformedness of their code units. Since it is a simple call through,
>> its
>> use could be elided throughout the stdlib.
>>
>> ## Impact on existing code
>>
>> This is an additive change to the API.
>>
>> ## Alternatives considered
>>
>> * A protocol-oriented API.
>>
>> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
>> really
>> clear this method would be related to string processing, and would
>> require
>> some kind of bounding (like `where Generator.Element:
>> UnsignedIntegerType`), but
>> that would be introducing a type bound that doesn't exist on
>>
>> * Do nothing.
>>
>> This seems suboptimal. For many use cases, `String` lacking this
>> constructor is
>> a limiting factor on performance for many kinds of pure-Swift
>> implementations.
>>
>> * Make the `NSString` [bridge
>> faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).
>>
>> After reading the bridge code, I don't really know why it's slower.
>> Maybe it's
>> a bug.
>>
>> * Make `String.append(_:)`
>> [faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).
>>
>> I don't completely understand the growth strategy of `_StringCore`, but
>> it doesn't seem to exhibit the documented amortized `O(1)`, even when
>> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
>> noted that
>> it seems like `reserveCapacity` acts like a no-op.
>>
>> ----
>>
>> Cheers,
>> Zachary Waldowski
>> zach at waldowski.me
>>
>> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>>> Going back and forth from Strings to their byte representations is an
>>> important part of solving many problems, including object
>>> serialization, binary file formats, wire/network interfaces, and
>>> cryptography.
>>>
>>> In developing such a parser, a coworker did the yeoman's work of
>>> benchmarking
>>> Swift's Unicode types. He swore up and down that
>>> String.Type.fromCString(_:) [0]
>>> was the fastest way he found. I, stubborn and noobish as I am, was
>>> skeptical
>>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>>>
>>> After reading through stdlib source and doing my own testing, this is no
>>> wives'
>>> tale. fromCString [1] is essentially the only public user of
>>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>>> of
>>> both efficient and safe initialization-by-buffer-copy.
>>>
>>> Of course, fromCString isn't a silver bullet; it has to have a null
>>> sentinel,
>>> requiring a copy of the origin buffer if one needs to be added (as is
>>> the
>>> case with formats that specify the length up front, or unstructured
>>> payloads
>>> that use unescaped double quotes as the terminator). It also prevents
>>> the string
>>> itself from containing the null character.
>>>
>>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>>> If that
>>> can't happen, an alternative to fromCString that doesn't use strlen
>>> would be
>>> nice, and we can just eat the performance hit on other code unit
>>> sequences.
>>>
>>> I can't really think of a reason why it's not exposed yet, so I'm led to
>>> believe
>>> I'm just missing something major, and not that a reason doesn't exist.
>>> ;-)
>>>
>>> There's also discussion to be had of if API is needed. Try as I might, I
>>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>>> have
>>> anything close to the same speed. [3] Profiling indicates that I keep
>>> hitting
>>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>>> referenced, or it's a bug, or what, but it's consistently slower than
>>> creating
>>> an Array of the bytes and performing fromCString on it. Similar story
>>> with
>>> crossing the NSString bridge, which is even stranger. [4]
>>>
>>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>>> whether
>>> this can be turned into a proposal.
>>>
>>> [0]:
>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
>>> [1]:
>>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>>> [2]:
>>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>>> [3]:
>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
>>> [4]:
>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
>>>
>>> Cheers,
>>> Zachary Waldowski
>>> zach at waldowski.me
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
More information about the swift-evolution
mailing list