[swift-evolution] [pitch] Substring performance affordances
Dave Abrahams
dabrahams at apple.com
Sun Jul 2 23:55:43 CDT 2017
on Wed Jun 28 2017, Ben Cohen <swift-evolution at swift.org> wrote:
> Hi swift-evolution,
>
> Below is a short pitch for some performance improvements to be made to String to accommodate
> Substrings.
>
> As outlined in SE-0163, the more general question of implicit conversion from Substring to String
> was deferred pending feedback on the initial implementation. To date, the feedback we’ve received
> hasn’t suggested that such an implicit conversion is necessary – that migrations from 3.2 to 4.0
> haven’t led to an extensive need to perform Substring->String conversion. Any further input, either
> on or off list, about this particular aspect of migration would be very gratefully received.
>
> # Substring performance affordances
>
> * Proposal: [SE-NNNN](NNNN-substring-affordances.md)
> * Authors: [Ben Cohen](https://github.com/airspeedswift <https://github.com/airspeedswift>)
> * Review Manager: TBD
> * Status: **Awaiting review**
>
> ## Introduction
>
> This proposal modifies a small number of methods in the standard library that
> are commonly used with the `Substring` type:
>
> - Modify the `init` on floating point and integer types, to construct them
> from `StringProtocol` rather than `String`.
+1. This just makes good sense.
>
> - Change `join` to be an extension `where Element: StringProtocol`
I don't see how this is implementable unless we ignore the May 10
core team resolution that RangeReplaceableCollection conformance be
removed from StringProtocol, or we add RangeReplaceableCollection to the
constraints on Element. We had good reasons for the resolution, so I
think we'd need to do the latter.
That said, I presume this is being called "joined()" per
https://github.com/apple/swift/pull/10734... I'm concerned that we
haven't sufficiently considered the forest of joined() overloads we're
expanding. For all other sequences, joined() produces a lazy result. I
realize that joining strings to produce a string is a really important
operation, but
String(myStrings.joined())
is not so onerous as to make it an obvious choice to add this overload
to StringProtocol, especially since the other ones will still appear on
String thanks to its Sequence conformance. String's conformance to
Collection gets confusing when these kinds of things happen. Instead,
we should be considering dropping the eager overload from String for
Swift 4. Or, if we think it's important enough to avoid the explicit
conversion to string, maybe we should be thinking about making the other
joined()s eager.
> - Add extensions to `Dictionary where Key == String` and `Set where
> Element == String` to test for presence of a `Substring`.
Here's why I don't think we should do this:
* As noted in your introduction, we deliberately avoided taking steps
that would make Substrings transparently work where Strings are now
accepted so that we'd find out whether the impedance mismatch was
going to be an issue. We've had very little time to get information
about String/Substring interop issues that people might face and
doing this now would mute the signal.
* The proposal's stated motivation is to work around a performance
problem, and it does so by adding a very specific, special-purpose
piece of API. But the performance issues are actually more general:
- Sets and dictionaries offer no way to look up a key and, when the
key isn't found, use the bucket information computed by the lookup
for a subsequent insertion.
extension Set {
func toggle(_ k: Element) {
let p = s.lookup(k) // <==== for example
if p.found {
s.remove(at: p.index)
}
else {
s.insert(k, at: p)
}
}
}
- Substring offers no way to create a corresponding String that avoids
a copy when necessary for performance reasons:
for w: Substring in words {
result += transform(
String(maintainingExcessStorageBySharing: w)) // <== for example
}
If transform is an API that—as we recommend by default—operates on
String rather than Substring or StringProtocol, and is known not to
make long-term copies of its argument, users currently have no way
to avoid the performance cost of a copy. It's not something people
should use every day, but when you need it, you need it.
Together these APIs can be used to solve the Substring performance
problem with Set<String>/Dictionary<String,V> insertion, and a whole
raft of others besides.
let p = s.lookup(String(maintainingExcessStorageBySharing: w))
if !p.found { s.insert(String(w), at: p)
* The proposed API adds overloads, which harms usability, and having
these particular overloads is IMO not clearly the right answer in the
long term. Removing APIs later is expensive in labor, harmful to
users, and sometimes difficult technically.
* Handling these cases transparently sets up an expectation that other
such cases should be handled transparently as well. The proposal
itself raises many such possibilities, and the implications seem to be
lots of API complexity creep and things that magically work in some
cases but not in others. Even if we refuse to extend this pattern to
other APIs, failure to meet expectations and the resulting confusion
is a bad user experience.
In sum, we should be quite sure of this API before we add it, and I, at
least, am not. If we *do* decide to add it, we should at the same time
establish some general principles that drive its addition so that we
have a consistent basis on which to make the same kind of decision about
other APIs.
Thanks for listening,
Dave
> ## Motivation
>
> Swift 4 introduced `Substring` as the slice type for `String`. Previously,
> `String` had been its own slice type, but this leads to issues where string
> buffers can be unexpectedly retained. This approach was adopted instead of the
> alternative of having the slicing operation make a copy. A copying slicing
> operation would have negative performance consequences, and would also conflict
> with the requirement that `Collection` be sliceable in constant time. In cases
> where an API requires a `String`, the user must construct a new `String` from a
> `Substring`. This can be thought of as a "deferral" of the copy that was
> avoided at the time of the slice.
>
> There are a few places in the standard library where it is notably inefficient
> to force a copy of a substring in order to use it with a string: performing
> lookups in hashed containers, joining substrings, and converting substrings to
> integers. In particular, these operations are likely to be used inside a loop
> over a number of substrings extracted from a string. For example, suppose you
> had a string of key/value pairs, where the values were integers and you wanted
> to sum them by key. You would be forced to convert both the `Substring` keys
> and values to `String` to do this.
>
> ## Proposed solution
>
> Add the following to the standard library:
>
> ```swift
> extension FixedWidthInteger {
> public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
> }
>
> extension Float/Double/Float80 {
> public init?<S : StringProtocol>(_ text: S, radix: Int = 10)
> }
>
> extension Sequence where Element: StringProtocol {
> public func joined(separator: String = "") -> String
> }
>
> extension Dictionary where Key == String {
> public subscript(key: Substring) -> Value? { get set }
> public subscript(key: Substring, default defaultValue: @autoclosure () -> Value) -> Value { get set
> }
> }
>
> extension Set where Element == String {
> public func contains(_ member: Substring) -> Bool
> public func index(of member: Substring) -> Index?
> public mutating func insert(_ newMember: Substring) -> (inserted: Bool, memberAfterInsert: Element)
> public mutating func remove(_ member: Substring) -> Element?
> }
> ```
>
> These additions are deliberately narrow in scope. They are _not_ intended to
> solve a general problem of being able to interchange substrings for strings (or
> more generally slices for collections) generically in different APIs. See the
> alternatives considered section for more on this.
>
> ## Source compatibility
>
> No impact, these are either additive (in case of hashed containers) or
> generalize an existing API to a protocol (in case of numeric
> conversion/joining).
>
> ## Effect on ABI stability
>
> The hashed container changes are additive so no impact. The switch from conrete
> to generic types for the numeric conversions needs to be made before ABI
> stability.
>
> ## Alternatives considered
>
> While they have a convenience benefit as well, this is not the primary goal of
> these additions, but a side-effect of helping avoid a performance problem. In
> many other cases, the performance issues can be avoided via modified use e.g.
> `Sequence.contains` of a `Substring` in a sequence of strings can be written as
> `sequence.contains { $0 == substring }` .
>
> These changes are limited in scope, and further additions could be considered
> in the future. For example, should the `Dictionary.init(grouping:by:) where Key
> == String` operation be enhanced to similarly take a sequence of substrings?
> There is a long tail of these cases, and the need to keep unnecessary overloads
> to a minimum, avoiding typechecker work and code bloat, must be weighed against
> the likelyhood that string copies will be a performance problems.
>
> There is a more general problem of interoperating between collections and
> slices. In the future, there may be other affordances for converting/comparing
> them. For example, it might be desirable to require equatable collections to
> have equatable slices, and to automatically provide default implementations of
> `==` that efficiently compare a collection to its default slice. These
> enhancements rely on features such as conditional conformance, and so may be
> worth considering in later versions of Swift but are not an option currently.
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
--
-Dave
More information about the swift-evolution
mailing list