[swift-evolution] [Draft proposal] Faster/lower-level external String initialization

Zach Waldowski zach at waldowski.me
Wed Jan 13 03:14:58 CST 2016


Max -

Great! Looking through it again, big +1 in favor of a
`repairIllFormedSequences: true` being the normal path.

I'm trying now to suss out the full gamut of methods that are needed, so
I can adapt the proposal + stdlib while backporting the changes from
3.0.

I'm in favor of two inits + decodeCString, with the latter sort of
becoming a "primitive". Just figuring out the best permutations of
those…

I might've mis-parsed the meaning of "'deprecation' magic". What's the
best path forward in the near-term? Would `decodeCString` be the only
one that becomes generic? Or, phrased differently, should there still be
`UnsafePointer<CChar>`+`strlen` versions?

-- 
Zach Waldowski
zach at waldowski.me

On Tue, Jan 12, 2016, at 07:54 PM, Max Moiseev wrote:
> Zach, Charles. I’ll try to reply to both of you in one shot.
> 
> As @gribozavr pointed out in a private conversation,
> `UnsafeBufferPointer` conforms to CollectionType, so we can generalize
> String.decodeCString to accept a CollectionType and constrain it
> precisely as you, Zach, did in your proposal.
> (I remember there were some troubles with the fact that CChar is Int8
> (signed) and UTF8.CodeUnit is UInt8, but that might not affect this new
> method).
> 
> I don’t quite understand what you mean by `custom code-unit level
> transforms’, but maybe having a CollectionType can address that.
> 
> As for the proposal. This does not have to wait until Swift 3. The change
> I pointed at was a side effect of revisiting all the APIs in stdlib. So
> if you guys feel strongly about this change (and I think you do,
> otherwise you wouldn’t go as far as writing a proposal document), you can
> take what’s in the swift-3-api-guidelines branch, implement the new
> method we’ve discussed, add some ‘deprecation’ magic to make it
> compatible with Swift 2.1 and run it through the evolution process.
> 
> max
> 
> 
> > On Jan 12, 2016, at 12:22 PM, Zach Waldowski <zach at waldowski.me> wrote:
> > 
> > Max,
> > 
> > Seems like a fantastic change, if indeed the move is made from
> > UnsafePointer to UnsafeBufferPointer! That still doesn't cover the case
> > where you'd be doing code-unit level transforms (i.e., for custom
> > encoding schemes in some formats, like the Unicode escapes in JSON), but
> > that can probably also be done at the String level after-the-fact.
> > 
> > Awesome change, though! It'd be a shame to have to wait until 3.0 for it
> > to land.
> > 
> > -- 
> > Zach Waldowski
> > zach at waldowski.me
> > 
> > On Tue, Jan 12, 2016, at 02:57 PM, Max Moiseev wrote:
> >> Hi Zach,
> >> 
> >> We looked at the CString APIs as part of API Naming Guidelines
> >> application effort.
> >> You can see the results here:
> >> https://github.com/apple/swift/commit/f4aaece75e97379db6ba0a1fdb1da42c231a1c3b
> >> 
> >> The main idea is to turn static factories into initializers and make
> >> init(cString:) do ‘most probably the right thing’, i.e. repair UTF8 code
> >> units.
> >> 
> >> Haven’t looked at your proposal in details, but I think that if we add a
> >> new String.decodeCString that accepts an UnsafeBufferPointer instead of
> >> an UnsafePointer (and does not have to call _swift_stdlib_strlen), that
> >> would solve the problem. Unless I’m missing something.
> >> 
> >> regards,
> >> max
> >> 
> >>> On Jan 11, 2016, at 1:56 PM, Zach Waldowski via swift-evolution <swift-evolution at swift.org> wrote:
> >>> 
> >>> Given the initial positive response, I've taken a crack both at
> >>> implementation and converting the request to a proposal. The proposal
> >>> draft is located at:
> >>> 
> >>>   https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md
> >>> 
> >>> The code is located at:
> >>> 
> >>>   https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units
> >>> 
> >>> The proposal is reproduced below:
> >>> 
> >>> # Expose code unit initializers on String
> >>> 
> >>> * Proposal:
> >>> [SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
> >>> * Author: [Zachary Waldowski](https://github.com/zwaldowski)
> >>> * Status: **Awaiting review**
> >>> * Review manager: TBD
> >>> 
> >>> ## Introduction
> >>> 
> >>> Going back and forth from Strings to their byte representations is an
> >>> important part of solving many problems, including object
> >>> serialization, binary file formats, wire/network interfaces, and
> >>> cryptography. Swift has such utilities, currently only exposed through
> >>> `String.Type.fromCString(_:)`.
> >>> 
> >>> See swift-evolution
> >>> [thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).
> >>> 
> >>> ## Motivation
> >>> 
> >>> In developing a parser, a coworker did the yeoman's work of benchmarking
> >>> Swift's Unicode types. He swore up and down that
> >>> `String.Type.fromCString(_:)`
> >>> ([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
> >>> was the fastest way he found. I, stubborn and noobish as I am, was
> >>> skeptical that a better way couldn't be wrought from Swift's
> >>> `UnicodeCodecType`s.
> >>> 
> >>> After reading through stdlib source and doing my own testing, this is no
> >>> wives'
> >>> tale. `fromCString` is essentially the only public-facing user of
> >>> `String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
> >>> role of
> >>> both efficient and safe initialization-by-buffer-copy.
> >>> 
> >>> Of course, `fromCString` isn't a silver bullet; it has to have a null
> >>> sentinel,
> >>> requiring a copy of the origin buffer if one needs to be added (as is
> >>> the
> >>> case with formats that specify the length up front, or unstructured
> >>> payloads
> >>> that use unescaped double quotes as the terminator). It also prevents
> >>> the string itself from containing the null character.
> >>> 
> >>> # Proposed solution
> >>> 
> >>> I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
> >>> public API:
> >>> 
> >>> ```swift
> >>> init?<Input: CollectionType, Encoding: UnicodeCodecType where
> >>> Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
> >>> encoding: Encoding.Type)
> >>> ```
> >>> 
> >>> And, for consistency with
> >>> `String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
> >>> exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:
> >>> 
> >>> ```swift
> >>> static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
> >>> UnicodeCodecType where Encoding.CodeUnit ==
> >>> Input.Generator.Element>(input: Input, encoding: Encoding.Type)```
> >>> 
> >>> ## Detailed design
> >>> 
> >>> See [full
> >>> implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).
> >>> 
> >>> This is a fairly straightforward renaming of the internal APIs.
> >>> 
> >>> The initializer, its labels, and their order were chosen to match other
> >>> non-cast
> >>> initializers in the stdlib. "Sequence" was removed, as it was a
> >>> misnomer.
> >>> "input" was kept as a generic name in order to allow for future
> >>> refinements.
> >>> 
> >>> The static initializer made the same changes, but was otherwise kept as
> >>> a
> >>> factory function due to its multiple return values.
> >>> 
> >>> `String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
> >>> for
> >>> internal use. I assume it wouldn't be good to expose publicly because,
> >>> for
> >>> lack of a better phrase, we only "trust" the stdlib to accurately know
> >>> the
> >>> wellformedness of their code units. Since it is a simple call through,
> >>> its
> >>> use could be elided throughout the stdlib.
> >>> 
> >>> ## Impact on existing code
> >>> 
> >>> This is an additive change to the API.
> >>> 
> >>> ## Alternatives considered
> >>> 
> >>> * A protocol-oriented API.
> >>> 
> >>> Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
> >>> really
> >>> clear this method would be related to string processing, and would
> >>> require
> >>> some kind of bounding (like `where Generator.Element:
> >>> UnsignedIntegerType`), but
> >>> that would be introducing a type bound that doesn't exist on
> >>> 
> >>> * Do nothing.
> >>> 
> >>> This seems suboptimal. For many use cases, `String` lacking this
> >>> constructor is
> >>> a limiting factor on performance for many kinds of pure-Swift
> >>> implementations.
> >>> 
> >>> * Make the `NSString` [bridge
> >>> faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).
> >>> 
> >>> After reading the bridge code, I don't really know why it's slower.
> >>> Maybe it's
> >>> a bug.
> >>> 
> >>> * Make `String.append(_:)`
> >>> [faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).
> >>> 
> >>> I don't completely understand the growth strategy of `_StringCore`, but
> >>> it doesn't seem to exhibit the documented amortized `O(1)`, even when
> >>> `reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
> >>> noted that
> >>> it seems like `reserveCapacity` acts like a no-op.
> >>> 
> >>> ----
> >>> 
> >>> Cheers,
> >>> Zachary Waldowski
> >>> zach at waldowski.me
> >>> 
> >>> On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
> >>>> Going back and forth from Strings to their byte representations is an
> >>>> important part of solving many problems, including object
> >>>> serialization, binary file formats, wire/network interfaces, and
> >>>> cryptography.
> >>>> 
> >>>> In developing such a parser, a coworker did the yeoman's work of
> >>>> benchmarking
> >>>> Swift's Unicode types. He swore up and down that
> >>>> String.Type.fromCString(_:) [0]
> >>>> was the fastest way he found. I, stubborn and noobish as I am, was
> >>>> skeptical
> >>>> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
> >>>> 
> >>>> After reading through stdlib source and doing my own testing, this is no
> >>>> wives'
> >>>> tale. fromCString [1] is essentially the only public user of
> >>>> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
> >>>> of
> >>>> both efficient and safe initialization-by-buffer-copy.
> >>>> 
> >>>> Of course, fromCString isn't a silver bullet; it has to have a null
> >>>> sentinel,
> >>>> requiring a copy of the origin buffer if one needs to be added (as is
> >>>> the
> >>>> case with formats that specify the length up front, or unstructured
> >>>> payloads
> >>>> that use unescaped double quotes as the terminator). It also prevents
> >>>> the string
> >>>> itself from containing the null character.
> >>>> 
> >>>> I'd like to see _fromCodeUnitSequence [2] become public API as (just
> >>>> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
> >>>> If that
> >>>> can't happen, an alternative to fromCString that doesn't use strlen
> >>>> would be
> >>>> nice, and we can just eat the performance hit on other code unit
> >>>> sequences.
> >>>> 
> >>>> I can't really think of a reason why it's not exposed yet, so I'm led to
> >>>> believe
> >>>> I'm just missing something major, and not that a reason doesn't exist.
> >>>> ;-)
> >>>> 
> >>>> There's also discussion to be had of if API is needed. Try as I might, I
> >>>> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
> >>>> have
> >>>> anything close to the same speed. [3] Profiling indicates that I keep
> >>>> hitting
> >>>> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
> >>>> referenced, or it's a bug, or what, but it's consistently slower than
> >>>> creating
> >>>> an Array of the bytes and performing fromCString on it. Similar story
> >>>> with
> >>>> crossing the NSString bridge, which is even stranger. [4]
> >>>> 
> >>>> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
> >>>> whether
> >>>> this can be turned into a proposal.
> >>>> 
> >>>> [0]:
> >>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
> >>>> [1]:
> >>>> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
> >>>> [2]:
> >>>> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
> >>>> [3]:
> >>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
> >>>> [4]:
> >>>> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
> >>>> 
> >>>> Cheers,
> >>>> Zachary Waldowski
> >>>> zach at waldowski.me
> >>> _______________________________________________
> >>> swift-evolution mailing list
> >>> swift-evolution at swift.org
> >>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >> 
> 


More information about the swift-evolution mailing list