[swift-evolution] [Draft proposal] Faster/lower-level external String initialization

Tue Jan 12 18:54:33 CST 2016

Zach, Charles. I’ll try to reply to both of you in one shot.

As @gribozavr pointed out in a private conversation, `UnsafeBufferPointer` conforms to CollectionType, so we can generalize String.decodeCString to accept a CollectionType and constrain it precisely as you, Zach, did in your proposal.
(I remember there were some troubles with the fact that CChar is Int8 (signed) and UTF8.CodeUnit is UInt8, but that might not affect this new method).

I don’t quite understand what you mean by `custom code-unit level transforms’, but maybe having a CollectionType can address that.

As for the proposal. This does not have to wait until Swift 3. The change I pointed at was a side effect of revisiting all the APIs in stdlib. So if you guys feel strongly about this change (and I think you do, otherwise you wouldn’t go as far as writing a proposal document), you can take what’s in the swift-3-api-guidelines branch, implement the new method we’ve discussed, add some ‘deprecation’ magic to make it compatible with Swift 2.1 and run it through the evolution process.

max

> On Jan 12, 2016, at 12:22 PM, Zach Waldowski <zach at waldowski.me> wrote:
> 
> Max,
> 
> Seems like a fantastic change, if indeed the move is made from
> UnsafePointer to UnsafeBufferPointer! That still doesn't cover the case
> where you'd be doing code-unit level transforms (i.e., for custom
> encoding schemes in some formats, like the Unicode escapes in JSON), but
> that can probably also be done at the String level after-the-fact.
> 
> Awesome change, though! It'd be a shame to have to wait until 3.0 for it
> to land.
> 
> -- 
> Zach Waldowski
> zach at waldowski.me
> 
> On Tue, Jan 12, 2016, at 02:57 PM, Max Moiseev wrote:
>> Hi Zach,
>> 
>> We looked at the CString APIs as part of API Naming Guidelines
>> application effort.
>> You can see the results here:
>> https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b
>> 
>> The main idea is to turn static factories into initializers and make
>> init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code
>> units.
>> 
>> Haven’t looked at your proposal in details, but I think that if we add a
>> new String.decodeCString that accepts an UnsafeBufferPointer instead of
>> an UnsafePointer (and does not have to call _swift_stdlib_strlen), that
>> would solve the problem. Unless I’m missing something.
>> 
>> regards,
>> max
>> 
>>> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
>>> 
>>> Given the initial positive response, I've taken a crack both at
>>> implementation and converting the request to a proposal. The proposal
>>> draft is located at:
>>> 
>>>   https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
>>> 
>>> The code is located at:
>>> 
>>>   https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
>>> 
>>> The proposal is reproduced below:
>>> 
>>> # Expose code unit initializers on String
>>> 
>>> * Proposal:
>>> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
>>> * Author: [Zachary Waldowski](https://github.com/zwaldowski)
>>> * Status: **Awaiting review**
>>> * Review manager: TBD
>>> 
>>> ## Introduction
>>> 
>>> Going back and forth from Strings to their byte representations is an
>>> important part of solving many problems, including object
>>> serialization, binary file formats, wire/network interfaces, and
>>> cryptography. Swift has such utilities, currently only exposed through
>>> `String.Type.fromCString(_:)`.
>>> 
>>> See swift-evolution
>>> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).
>>> 
>>> ## Motivation
>>> 
>>> In developing a parser, a coworker did the yeoman's work of benchmarking
>>> Swift's Unicode types. He swore up and down that
>>> `String.Type.fromCString(_:)`
>>> ([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
>>> was the fastest way he found. I, stubborn and noobish as I am, was
>>> skeptical that a better way couldn't be wrought from Swift's
>>> `UnicodeCodecType`s.
>>> 
>>> After reading through stdlib source and doing my own testing, this is no
>>> wives'
>>> tale. `fromCString` is essentially the only public-facing user of
>>> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
>>> role of
>>> both efficient and safe initialization-by-buffer-copy.
>>> 
>>> Of course, `fromCString` isn't a silver bullet; it has to have a null
>>> sentinel,
>>> requiring a copy of the origin buffer if one needs to be added (as is
>>> the
>>> case with formats that specify the length up front, or unstructured
>>> payloads
>>> that use unescaped double quotes as the terminator). It also prevents
>>> the string itself from containing the null character.
>>> 
>>> # Proposed solution
>>> 
>>> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
>>> public API:
>>> 
>>> ```swift
>>> init?<Input: CollectionType, Encoding: UnicodeCodecType where
>>> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
>>> encoding: Encoding.Type)
>>> ```
>>> 
>>> And, for consistency with
>>> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
>>> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
>>> 
>>> ```swift
>>> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
>>> UnicodeCodecType where Encoding.CodeUnit ==
>>> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
>>> 
>>> ## Detailed design
>>> 
>>> See [full
>>> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).
>>> 
>>> This is a fairly straightforward renaming of the internal APIs.
>>> 
>>> The initializer, its labels, and their order were chosen to match other
>>> non-cast
>>> initializers in the stdlib. "Sequence" was removed, as it was a
>>> misnomer.
>>> "input" was kept as a generic name in order to allow for future
>>> refinements.
>>> 
>>> The static initializer made the same changes, but was otherwise kept as
>>> a
>>> factory function due to its multiple return values.
>>> 
>>> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
>>> for
>>> internal use. I assume it wouldn't be good to expose publicly because,
>>> for
>>> lack of a better phrase, we only "trust" the stdlib to accurately know
>>> the
>>> wellformedness of their code units. Since it is a simple call through,
>>> its
>>> use could be elided throughout the stdlib.
>>> 
>>> ## Impact on existing code
>>> 
>>> This is an additive change to the API.
>>> 
>>> ## Alternatives considered
>>> 
>>> * A protocol-oriented API.
>>> 
>>> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
>>> really
>>> clear this method would be related to string processing, and would
>>> require
>>> some kind of bounding (like `where Generator.Element:
>>> UnsignedIntegerType`), but
>>> that would be introducing a type bound that doesn't exist on
>>> 
>>> * Do nothing.
>>> 
>>> This seems suboptimal. For many use cases, `String` lacking this
>>> constructor is
>>> a limiting factor on performance for many kinds of pure-Swift
>>> implementations.
>>> 
>>> * Make the `NSString` [bridge
>>> faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).
>>> 
>>> After reading the bridge code, I don't really know why it's slower.
>>> Maybe it's
>>> a bug.
>>> 
>>> * Make `String.append(_:)`
>>> [faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).
>>> 
>>> I don't completely understand the growth strategy of `_StringCore`, but
>>> it doesn't seem to exhibit the documented amortized `O(1)`, even when
>>> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
>>> noted that
>>> it seems like `reserveCapacity` acts like a no-op.
>>> 
>>> ----
>>> 
>>> Cheers,
>>> Zachary Waldowski
>>> zach at waldowski.me
>>> 
>>> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
>>>> Going back and forth from Strings to their byte representations is an
>>>> important part of solving many problems, including object
>>>> serialization, binary file formats, wire/network interfaces, and
>>>> cryptography.
>>>> 
>>>> In developing such a parser, a coworker did the yeoman's work of
>>>> benchmarking
>>>> Swift's Unicode types. He swore up and down that
>>>> String.Type.fromCString(_:) [0]
>>>> was the fastest way he found. I, stubborn and noobish as I am, was
>>>> skeptical
>>>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
>>>> 
>>>> After reading through stdlib source and doing my own testing, this is no
>>>> wives'
>>>> tale. fromCString [1] is essentially the only public user of
>>>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
>>>> of
>>>> both efficient and safe initialization-by-buffer-copy.
>>>> 
>>>> Of course, fromCString isn't a silver bullet; it has to have a null
>>>> sentinel,
>>>> requiring a copy of the origin buffer if one needs to be added (as is
>>>> the
>>>> case with formats that specify the length up front, or unstructured
>>>> payloads
>>>> that use unescaped double quotes as the terminator). It also prevents
>>>> the string
>>>> itself from containing the null character.
>>>> 
>>>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
>>>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
>>>> If that
>>>> can't happen, an alternative to fromCString that doesn't use strlen
>>>> would be
>>>> nice, and we can just eat the performance hit on other code unit
>>>> sequences.
>>>> 
>>>> I can't really think of a reason why it's not exposed yet, so I'm led to
>>>> believe
>>>> I'm just missing something major, and not that a reason doesn't exist.
>>>> ;-)
>>>> 
>>>> There's also discussion to be had of if API is needed. Try as I might, I
>>>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
>>>> have
>>>> anything close to the same speed. [3] Profiling indicates that I keep
>>>> hitting
>>>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
>>>> referenced, or it's a bug, or what, but it's consistently slower than
>>>> creating
>>>> an Array of the bytes and performing fromCString on it. Similar story
>>>> with
>>>> crossing the NSString bridge, which is even stranger. [4]
>>>> 
>>>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
>>>> whether
>>>> this can be turned into a proposal.
>>>> 
>>>> [0]:
>>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
>>>> [1]:
>>>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
>>>> [2]:
>>>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
>>>> [3]:
>>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
>>>> [4]:
>>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
>>>> 
>>>> Cheers,
>>>> Zachary Waldowski
>>>> zach at waldowski.me
>>> _______________________________________________
>>> swift-evolution mailing list
>>> swift-evolution at swift.org
>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>