[swift-dev] State of String: ABI & Performance

Mon Jan 15 13:20:35 CST 2018


> On Jan 11, 2018, at 9:46 PM, Chris Lattner via swift-dev <swift-dev at swift.org> wrote:
> 
>>> 
>>> Finally, what tradeoffs do you see between a 1-word vs 2-word string?  Are we really destined to have 2-words?  That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat.
>> 
>> <repeat disclaimer about final details being down to real data>. Some arguments in favor of 2-word, presented roughly in order of impact:
> 
> Understood.  I don’t have a strong opinion on 1 vs 2 words, either are dramatically better than 3 :-).  I’m glad you’re carefully evaluating the tradeoff.
> 
>> 1. This allows the String type to accommodate llvm::StringRef-style usages. This is pretty broad usage: “mmap a file and treat its contents as a String”, “store all my contents in an llvm::BumpPtr which outlives uses”, un-owned slices, etc. One word String would greatly limit this to only whole-string nul-terminated cases.
> 
> Yes, StringRef style algorithms are a big deal, as I mentioned in my previous email, but it is also unclear if this will really be a win when shoehorned into String.  The benefit of StringRef is that it is a completely trivial type (both in the SIL sense but also in the implementation sense) and all the primitive ops get inlined.  Given the “all things to all people” design of String, I’m very much afraid that trying to shoehorn this into the String currency type will fail to provide significant wins and thus lead to having a separate StringRef style type anyway.  Providing a StringRef style projection type that is trivial (in the Swift sense) that knows in its static type that it never owns memory seems like the best path.
> 
> By point of comparison, C++ has std::string (yes, sure, with lots of issues) but they still introduced StringRef nee std::string_view instead of wedging it in.
> 
>> 2. Two-word String fits more small strings. Exactly where along the diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the data set. One example is NSString, which (according to reasoning at https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>) considered it important enough to have 6- and 5- bit reduced ASCII character sets to squeeze up to 11-length strings in a word. 15 code unit small strings would be a super-set of tagged NSStrings, meaning we could bridge them eagerly in-line, while 7 code unit small strings would be a subset (and also a strong argument against eagerly bridging them). 
> 
> Agreed, this is a big deal.
> 
>> If you have access to any interesting data sets and can report back some statistics, that would be immensely helpful!
> 
> Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has coded a lot of string processing over the years and understands at least some of the tradeoffs.
> 
>> 3. More bits available to reserve for future-proofing, etc., though many of these could be stored in the header.
>> 
>> 4. The second word can cache useful information from large strings. `endIndex` is a very frequently requested computed property and it could be stored directly in-line rather than loaded from memory (though perhaps a load happens anyways in a subsequent read of the string). Alternatively, we could store the grapheme count or some other piece of information that we’d otherwise have to recompute. More experimentation needed here.
> 
> This seems weakly motivated: large strings can store end index in the heap allocation.
> 
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a common heap alignment, stack alignment, vector-width, double-word-load width, etc.. 1-word Strings may be under-utilizing available resources, that is the second word will often be there for use anyways. The main case where this is not true and 1-word shines is aggregates of String.
> 
> What is the expected existential inline buffer size going to wind up being?  We sized it to 3 words specifically to fit string and array.  It would be great to shrink that to 2 or 1 words.
> 

We are planning to reevaluate the size of the inline buffer based on experimental performance data, but we can’t do that in a useful way until the size of String has been settled.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20180115/8b969777/attachment.html>