[swift-dev] State of String: ABI & Performance
Michael Ilseman
milseman at apple.com
Wed Jan 17 14:18:26 CST 2018
(Replying out-of-order)
> Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has coded a lot of string processing over the years and understands at least some of the tradeoffs.
Wouldn’t it be amazing to be able to write a high performance lexer with all the awesomeness of Swift and without String getting in your way? :-)
> ^ String should NOT be an Any!
<snip>
> Bridging is either “always eager” or “possibly lazy”, there is nothing usefully in between. Lazy bridging means that every operations down to the lowly ascii parsing operations need to check for the slow path. I understand that this may not be possible to eliminate on apple platforms, but it would be great to not add the existential generalization since that would affect other platforms that don’t have the legacy api concern.
<snip>
> Yes, StringRef style algorithms are a big deal, as I mentioned in my previous email, but it is also unclear if this will really be a win when shoehorned into String. The benefit of StringRef is that it is a completely trivial type (both in the SIL sense but also in the implementation sense) and all the primitive ops get inlined. Given the “all things to all people” design of String, I’m very much afraid that trying to shoehorn this into the String currency type will fail to provide significant wins and thus lead to having a separate StringRef style type anyway. Providing a StringRef style projection type that is trivial (in the Swift sense) that knows in its static type that it never owns memory seems like the best path.
>
> By point of comparison, C++ has std::string (yes, sure, with lots of issues) but they still introduced StringRef nee std::string_view instead of wedging it in.
I left the “UnmangedString” section open for debate on SE, but for the purposes of talking about ABI design (i.e. what is possible to expose in the future), let’s just say we’ll eventually have an `UnsafeString<CodeUnit>`, which is a pointer, length, and packed performance flags that provides both a random access view of its code units as well as full-Unicode operations.
Please bear with me as I try to reason through what String should be incorporating the current reality of source compatibility. If you have any design insight or historical context to fill in, please do so!
In one extreme corner of this multi-dimensional design space we have a proliferation of string types such as UTF8String, UTF16String, and even combinations of performance flags such as TrivialUTF8String, CanonicalSingleScalarGraphemeSmallString, etc. This proliferation encodes all these would-be-dynamic-branches in the type, which is great for performance. But, this hits APIs hard, where APIs either accept total chaos by taking concrete types, or try to generalize/erase over these myriad types.
On the other extreme end is AnyString, an existential that can be backed by anything that can vend code units in some fashion. The advantage of such a type is API unification: no matter what form someone has, they can always call your API without copying the underlying data. The downside is that everyone’s in the same slow boat.
Somewhere between these two extremes will be String. The question is, what details does String treat as dynamic properties? Source compatibility dictates that String abstracts away the underlying encoding, branching inside every API that performs a read of its code units. There will always be performance-critical use cases that wish to avoid these branches, i.e. branch at the top rather than throughout. A more appropriate type is needed for this, such as UTF16String, UTF8String, and UnsafeString<CodeUnit>. Treating encoding as a dynamic property is great for APIs, as it avoids an encoding schism.
If we do small string optimizations for the String type, then every String could either be small (a value) or large (a reference). Every use of large strings will be slightly penalized. To a very small extent, even performance flags slightly penalizes strings that doesn’t set them. These are worth doing nonetheless as the benefit is so large when they apply and the penalty is paltry relative to cost of the operations in this slow path.
(...continued after quote)
> Right. It is actually a bit more complicated than that though. Consider the Swift 4 generation of strings and arrays: because of lazy bridging each primitive operation includes the dynamic branching sequence. Beyond the dynamic performance impact of this, we end up having a really unfortunate set of tradeoffs to choose between:
>
> 1. We can leave the operations outlined in the standard library. Particularly for Array which benefits from being specialized on its element type, this is a huge perf loss.
>
> 2. We can inline the operation, but we end up inlining both sides of the operation, causing massive code bloat.
>
> 3. We can partially inline just the presumed fastpath, and leave the slow path out of line.
>
> The problem with this when you bring it to string is that the “fastpath” would naturally be the small string case. If you make *it* have many levels of branching underneath it, not only do you slow down the fast case, but you also cause intense code bloat if you want to inline core operations into user code.
>
<snip>
>> The “high-order bit” for performance is whether a string is small-or-not, as avoiding an allocation and, just as importantly, managing the memory with ARC is skipped. Beyond that, we’re debating smaller performance deltas, which are still important. This reasoning also influences where we choose to stick our resilience barriers, which can be relaxed in the future.
>
> You’re making an argument about aggregate use cases on average, but it is unwise to ignore the common performance critical cases that do actually want to look at the bytes.
That is a very good point, and it brings up some of the tradeoffs we need to balance in a data-driven fashion. This reminds me of the classic throughput vs latency view of performance. We have many, many strings on our systems and anything that reduces overall resource usage is extremely valuable. On the other hand, some operations need to be as fast as possible and shouldn’t be heavily penalized for some aggregate win.
String with small string optimizations have “value” forms and “reference” forms, from a runtime and ABI perspective. Evaluating adding more value forms involves answering these questions:
1. What is the potential gain for accommodating this additional value form? This is probably a “throughput” gain.
2. What is the cost to the fast path, and to the slow path, of accommodating this additional value form? That is, how does this hurt “latency”?
Let’s say “fast path” means the small UTF-8 form, and “slow path” is the out-of-line encoding-abstracted form. The cost to the fast path could be reduced if we recognize that one particular value form (e.g. small UTF-8 string) is worth checking first and all other value forms can be handled by a call to some non-inlined function. For the fast path, it’s the difference between checking if the top bit is 1 vs the top n bits being 1, which is a couple extra instructions (on x86_64 at least).
The cost to the slow path is checking and branching for other small forms. This may be significant or insignificant, depending on the operation. E.g. meaningless compared to general grapheme breaking, but could be significant for `endIndex`.
As to String being able to accommodate “Any”-like usage, it would not impact the fast path (not a value form). It would have a similar impact on the slow path as multiple value forms, and would need similar justification.
> Consider the http header parsing challenges the swift on server workgroup faced that led them to use a C implementation, for example.
That’s very unfortunate. Highly-performance-sensitive use cases have largely been neglected by String, and it’s very important to consider their needs in the ABI. Do you know of a retrospective or post-mortem of the attempt?
HTTP header parsing probably wouldn’t want to use String, which dynamically branches on reads to accommodate both UTF-8 and UTF-16. Similarly, the headers wouldn’t be small, so no need for that extra branch around inlined code there either. A “UTF8String" type would be more appropriate. They further might be known-ASCII, which would imply trivially-encoded-and-canonical UTF8String (maybe even single-scalar-grapheme if they’re known not to have Windows-style newlines), so no need to check-and-branch—just execute the fast path. For parsed portions, they probably want UnsafeString<UInt8>, or perhaps some variant that encoded perf flags too.
I think the future is bright if we get the right ABI.
>
>>> Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should not be an Any!
>
> I still don’t understand why this would even be considered, do you have any more rationale to explain that?
>
<snip>
>> As to a 5-bit encoding, again, all details pending more experimentation. Its importance may be diminished by more efficient, low-level String construction interfaces. The difference between 15 and 24 characters could be big, considering that formatted addresses and Doubles fit in that range. We’ll see.
>
> I still can’t imagine any workload that would care about this, and cannot see why punishing the normal cases would be worth any cost. Can you?
I do not at this time have detailed rationale. FWIW, my plan is to roll out the small UTF-8 representation first and only add the others if there’s sufficient data and justification (i.e. cost/benefit analysis). I’m especially interested in whether more efficient ways to construct and interpolate Strings would address the perceived need for this form.
>>>> Final details are still in exploration. If the construction and interpretation of small bit patterns can remain behind a resilience barrier, new forms could be added in the future. However, the performance impact of this would need to be carefully evaluated.
>>>
>>> I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases.
>>>
>>
>> We have not yet done the investigation, so all details could change. This is my (perhaps flawed!) reasoning as to why, in general, it may be useful to have multiple small forms. Again, this will change as we learn more.
>>
>> Strings are created and copied (i.e. a value-copy/retain) more often than they are read (and strings are read far more often than they are modified).
>
> If you believe it is highly biased towards this, then this strongly argues for a single word representation...
>
… depending on how far along the diminishing returns curve 7 vs 15 UTF-8 code units is. The killer feature of single-word String is that e.g. Array<String> is denser and since growth happens via move/take operations, grow operations are also quicker. However, an Array copy-from-a-write is subject to the tradeoff of small vs large. For Strings as arguments / return values, the win for single-word String is conditional on how many other arguments / return values there are and our calling convention (which IIRC has devoted many registers to arguments / return values). There’s also the existential inline buffer size, which I think is worth spinning off into a separate discussion.
>> 4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.
>
> This seems weakly motivated: large strings can store end index in the heap allocation.
>
Right. I was hoping that large strings can use the extra bits to recover some of the (albeit small) regression from accommodating a small form, and endIndex has a higher relative hit. But, it would probably be better to use them for something else and not try to “anchor” ourselves to large-string micro benchmarks.
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.
>
> What is the expected existential inline buffer size going to wind up being? We sized it to 3 words specifically to fit string and array. It would be great to shrink that to 2 or 1 words.
>
(Bob replied to this).
Another angle is that Substring is currently String + a 2-word range. We think it should be shrunk to String + a 1-word index-mapping offset. 2-word String would let Substring fit in the existing inline buffer. The question is whether the inline buffer should shrink with String, or if there are cases such as Substring that are still important.
>>> - array has exactly the same sort of concern, why is this a string-specific problem?
>>
>> I think it’s interesting as slices may emerge more often in practice on strings than arrays, e.g. for presenting parsing results. Also, applications that care about this level of performance may custom-allocate all their string data contiguously (e.g. llvm::BumpPtr). I believe that projects such as llbuild have done this, forcing themselves into an API schism between String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In theory all of this could also apply to array, but it seems to happen more for strings.
>
> I see String and Array as directly analogous in the space. The *only* difference IMO is when you get to algorithms that interpret the bytes because of frickin’ unicode.
<snip>
> Yes, I think that your point about Unsafe is important, which is even more reason not to wedge it into String. I really do understand the tradeoff of how you can’t pass a “Ref” to something that takes a String, but we already extensively debated that topic and accepted it w.r.t. Slice types. This seems directly analogous.
>
The analogy between String vs Substring and Array vs Slice is interesting and makes sense. I’ll have to think about this a bit more, and whether String vs Substring might differ in practice.
>>>> * Unification of String and Substring’s various Views.
>>>> * Some form of opaque string, such as an existential, which can be used as an extension point.
>>>> * More compact/performant indices.
>>>
>>> What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string.
>>>
>>
>> I don’t think that will be feasible in general, although perhaps it could happen for most common occurrences.
>
>
>>>
>>> func hexToInt(_ c : UInt8) -> UInt8 {
>>> switch c {
>>> case charToByte("0")...charToByte("9"): return c - charToByte("0")
>>> case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
>>> case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
>>> default: fatalError("invalid hexadecimal character")
>>> }
>>> }
>>>
>>
>> Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what I did recently when I wanted something similar, using unicode scalar literal ranges:
>
> We resisted adding single quoted strings for characters early on because we felt that it was a marginal case and might want to use them for additional quoting characters (like scripting languages like Perl that allow either single or double quotes) or for multi-line strings. However, now that multiline strings are settled and now that it is clear that we’re not going to add multiple redundant ways to quote strings, I think it is worth considering whether we should introduce ExpressibleByCharacterLiteral types and whether UInt8 should be one (along with Character of course) and tie it to single quotes.
>
> -Chris
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20180117/9af0784a/attachment.html>
More information about the swift-dev
mailing list