[swift-dev] Resilient dynamic dispatch ABI. Notes and mini-proposal.

Sat Feb 4 04:35:24 CST 2017

> On Feb 3, 2017, at 9:37 PM, John McCall <rjmccall at apple.com> wrote:
> 
>>> IV. The function that performs the lookup:
>>>  IV1) is parameterized by an isa
>>>  IV2) is not parameterized by an isa
>>> IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).
>> 
>> In my first message I was trying to accomplish IV1. But IV2 is simpler
>> and I can't see a fundamental advantage to IV1.
> 
> Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)
> by passing in the isa of either the superclass or the current class.  IV2 means
> that the dispatch function is always based on the isa from the object, so those
> dispatch schemes need something else to implement them.
> 
>> Why would it need a lookup chain?
> 
> Code size, because you might not want to inline the isa load at every call site.
> So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
> that just loads the isa and calls the IV1 function (defined by the class).

Right. Looks like I wrote the opposite of what I meant. The important thing to me is that the vtable offset load + check is issued in parallel with the isa load. I was originally pushing IV2 for this reason, but now think that optimization could be entirely lazy via a client-side cache.

>>> V. For any particular function or piece of information, it can be accessed:
>>>  V1) directly through a symbol
>>>  V2) through a class-specific table
>>>  V3) through a hierarchy-specific table (e.g. the class object)
>>> V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
>>> V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
>>> V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
>>> If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.
>> 
>> I had proposed V2 as an option, but am strongly leaning toward V1 for
>> ABI simplicity and lower static costs (why generate vtables and offset
>> tables?)
> 
> V1 doesn't remove the need for tables, it just hides them from the ABI.

I like that it makes the offset tables lazy and optional. They don’t even need to be complete.

>>> So I think your alternatives were:
>>> 1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch.  We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time.  Has an open question about super dispatch.
>>> 2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table.  Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2.  Again, we could also use V2 for the global offset to reduce the symbol-table costs.
>>> 3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table).  At least I think this is right?  The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?
>> 
>> I'll pick the following option as a starting point because it constrains the ABI the least in
>> terms of static costs and potential directions for optimization:
>> 
>> "I2; (II1+II2); III2; IV1; V1"
>> 
>> method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)
>> 
>> (where both modules would need to opt into the vtable_offset.)
> 
> Wait, remind me what this &vtable_offset is for at this point?  Is it basically just a client-side cache?  I can't figure out what it's doing for us.

It’s a client side cache that can be checked in parallel with the `isa` load. The resolver is not required to provide an offset, and the client does not need cache all the method offsets. It does burn an extra register, but gains the ability to implement vtable dispatch entirely on the client side.

You might be thinking of caching the method entry itself and checking `isa` within `resolveMethod`. I didn’t mention that possibility because the cost of calling the non-local `resolveMethod` function followed by an indirect call largely defeats the purpose of something like an inline-cache.

>> I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.
> 
> That's a lot of stuff to materialize at every call site.  It makes calls
> into something like a 10 instruction sequence on ARM64, ignoring
> the actual formal arguments:
> 
>   %raw_isa = load %object                        // 1 instruction
>   %isa_mask = load @swift_isaMask                // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
>   %isa = and %raw_isa, %isa_mask                 // 1
>   %method_index = 13                             // 1
>   %cache = @local.A.foo.cache                    // 2: not necessarily within ±1MB
>   %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
>   call %method(...)                              // 1
> 
> On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq
> is ±2GB, which is Good Enough for the standard code model, but of course it still
> expands to roughly the same amount of code.
> 
> Even without vtable_offset, it's a lot of code to inline.
> 
> So we'd almost certainly want a client-side resolver function that handled
> the normal case.  Is that what you mean when you say II1+II2?  So the local
> resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction
> call sequence, which I think is equivalent to Objective-C, and that function
> would do this sequence:
> 
> define @local_resolveMethodAddress(%object, %method_index)
>   %raw_isa = load %object                        // 1 instruction
>   %isa_mask = load @swift_isaMask                // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
>   %isa = and %raw_isa, %isa_mask                 // 1
>   %cache_table = @local.A.cache_table            // 2: not necessarily within ±1MB
>   %cache = add %cache_table, %method_index * 8   // 1
>   tailcall @A.resolveMethod(%isa, %method_index, %cache)  // 1
> 
> John.

Yes, exactly, except we haven’t even done any client-side vtable optimization yet.

To me the point of the local cache is to avoid calling @A.resolveMethod in the common case. So we need another load-compare-and-branch, which makes the local helper 12-13 instructions. Then you have the vtable load itself, so that’s 13-14 instructions. You would be saving on dynamic instructions but paying with 4 extra static instructions per class.

It would be lame if we can't force @local.A.cache_table to be ±1MB relative to the helper.

Inlining the cache table address might be worthwhile because %method_index would then be an immediate and hoisted to the top of the function.

-Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20170204/17bd16aa/attachment.html>