[swift-evolution] [Draft proposal] Faster/lower-level external String initialization

Zach Waldowski zach at waldowski.me
Mon Jan 11 15:56:24 CST 2016


Given the initial positive response, I've taken a crack both at
implementation and converting the request to a proposal. The proposal
draft is located at:

    https://github.com/zwaldowski/swift-evolution/blob/string-from-code-units/proposals/0000-string-from-code-units.md

The code is located at:

    https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units

The proposal is reproduced below:

# Expose code unit initializers on String

* Proposal:
[SE-NNNN](https://github.com/apple/swift-evolution/blob/master/proposals/NNNN-string-from-code-units.md)
* Author: [Zachary Waldowski](https://github.com/zwaldowski)
* Status: **Awaiting review**
* Review manager: TBD

## Introduction

Going back and forth from Strings to their byte representations is an
important part of solving many problems, including object
serialization, binary file formats, wire/network interfaces, and
cryptography. Swift has such utilities, currently only exposed through
`String.Type.fromCString(_:)`.

See swift-evolution
[thread](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160104/005951.html).

## Motivation

In developing a parser, a coworker did the yeoman's work of benchmarking
Swift's Unicode types. He swore up and down that
`String.Type.fromCString(_:)`
([use](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift))
was the fastest way he found. I, stubborn and noobish as I am, was
skeptical that a better way couldn't be wrought from Swift's
`UnicodeCodecType`s.

After reading through stdlib source and doing my own testing, this is no
wives'
tale. `fromCString` is essentially the only public-facing user of
`String.Type._fromCodeUnitSequence(_:input:)`, which serves the exact
role of
both efficient and safe initialization-by-buffer-copy.

Of course, `fromCString` isn't a silver bullet; it has to have a null
sentinel,
requiring a copy of the origin buffer if one needs to be added (as is
the
case with formats that specify the length up front, or unstructured
payloads
that use unescaped double quotes as the terminator). It also prevents
the string itself from containing the null character.

# Proposed solution

I'd like to expose `String.Type._fromCodeUnitSequence(_:input:)` as
public API:

```swift
init?<Input: CollectionType, Encoding: UnicodeCodecType where
Encoding.CodeUnit == Input.Generator.Element>(codeUnits input: Input,
encoding: Encoding.Type)
```

And, for consistency with
`String.Type.fromCStringRepairingIllFormedUTF8(_:)`,
exposing `String.Type._fromCodeUnitSequenceWithRepair(_:input:)`:

```swift
static func fromCodeUnitsWithRepair<Input: CollectionType, Encoding:
UnicodeCodecType where Encoding.CodeUnit ==
Input.Generator.Element>(input: Input, encoding: Encoding.Type)```

## Detailed design

See [full
implementation](https://github.com/apple/swift/compare/master...zwaldowski:string-from-code-units).

This is a fairly straightforward renaming of the internal APIs.

The initializer, its labels, and their order were chosen to match other
non-cast
initializers in the stdlib. "Sequence" was removed, as it was a
misnomer.
"input" was kept as a generic name in order to allow for future
refinements.

The static initializer made the same changes, but was otherwise kept as
a
factory function due to its multiple return values.

`String.Type._fromWellFormedCodeUnitSequence(_:input:)` was kept as-is
for
internal use. I assume it wouldn't be good to expose publicly because,
for
lack of a better phrase, we only "trust" the stdlib to accurately know
the
wellformedness of their code units. Since it is a simple call through,
its
use could be elided throughout the stdlib.

## Impact on existing code

This is an additive change to the API.

## Alternatives considered

* A protocol-oriented API.

Some kind of `func decode<Encoding>(_:)` on `SequenceType`. It's not
really
clear this method would be related to string processing, and would
require
some kind of bounding (like `where Generator.Element:
UnsignedIntegerType`), but
that would be introducing a type bound that doesn't exist on

* Do nothing.

This seems suboptimal. For many use cases, `String` lacking this
constructor is
a limiting factor on performance for many kinds of pure-Swift
implementations.

* Make the `NSString` [bridge
faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift).

After reading the bridge code, I don't really know why it's slower.
Maybe it's
a bug.

* Make `String.append(_:)`
[faster](https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift).

I don't completely understand the growth strategy of `_StringCore`, but
it doesn't seem to exhibit the documented amortized `O(1)`, even when
`reserveCapacity(_:)` is used. In the pre-proposal discussion, a user
noted that
it seems like `reserveCapacity` acts like a no-op.

----

Cheers,
Zachary Waldowski
zach at waldowski.me

On Fri, Jan 8, 2016, at 03:21 PM, Zach Waldowski wrote:
> Going back and forth from Strings to their byte representations is an
> important part of solving many problems, including object
> serialization, binary file formats, wire/network interfaces, and
> cryptography.
> 
> In developing such a parser, a coworker did the yeoman's work of
> benchmarking
> Swift's Unicode types. He swore up and down that
> String.Type.fromCString(_:) [0]
> was the fastest way he found. I, stubborn and noobish as I am, was
> skeptical
> that a better way couldn't be wrought from Swift's UnicodeCodecTypes.
> 
> After reading through stdlib source and doing my own testing, this is no
> wives'
> tale. fromCString [1] is essentially the only public user of
> String.Type._fromCodeUnitSequence(_:input:), which serves the exact role
> of
> both efficient and safe initialization-by-buffer-copy.
> 
> Of course, fromCString isn't a silver bullet; it has to have a null
> sentinel,
> requiring a copy of the origin buffer if one needs to be added (as is
> the
> case with formats that specify the length up front, or unstructured
> payloads
> that use unescaped double quotes as the terminator). It also prevents
> the string
> itself from containing the null character.
> 
> I'd like to see _fromCodeUnitSequence [2] become public API as (just
> spittballing here) String.init?<Collection, Codec>(codeUnits:encoding:).
> If that
> can't happen, an alternative to fromCString that doesn't use strlen
> would be
> nice, and we can just eat the performance hit on other code unit
> sequences.
> 
> I can't really think of a reason why it's not exposed yet, so I'm led to
> believe
> I'm just missing something major, and not that a reason doesn't exist.
> ;-)
> 
> There's also discussion to be had of if API is needed. Try as I might, I
> can't seem to get the reserveCapacity/append(UnicodeScalar) workflow to
> have
> anything close to the same speed. [3] Profiling indicates that I keep
> hitting
> _StringBuffer.grow. I don't know if that means the buffer isn't uniquely
> referenced, or it's a bug, or what, but it's consistently slower than
> creating
> an Array of the bytes and performing fromCString on it. Similar story
> with
> crossing the NSString bridge, which is even stranger. [4]
> 
> Anyway, I wanted to stir up discussion, see if I'm way off base and/or
> whether
> this can be turned into a proposal.
> 
> [0]:
> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-fromcstring-swift
> [1]:
> https://github.com/apple/swift/blob/master/stdlib/public/core/CString.swift#L18-L31
> [2]:
> https://github.com/apple/swift/blob/master/stdlib/public/core/String.swift#L134-L150
> [3]:
> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-unicodescalar-swift
> [4]:
> https://gist.github.com/zwaldowski/5f1a1011ea368e1c833e#file-nsstring-swift
> 
> Cheers,
> Zachary Waldowski
> zach at waldowski.me


More information about the swift-evolution mailing list