[swift-dev] Resilient dynamic dispatch ABI. Notes and mini-proposal.

Fri Feb 3 23:37:37 CST 2017

> On Feb 3, 2017, at 7:06 PM, Andrew Trick <atrick at apple.com> wrote:
>> On Feb 3, 2017, at 3:12 PM, John McCall <rjmccall at apple.com <mailto:rjmccall at apple.com>> wrote:
>> 
>> I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:
> 
> That's great (sorry I used up the arabic+letter naming convention earlier in the thread)...
> 
>> I. A call site:
>>  I1) inlines the lookup
>>  I2) calls a function that performs the lookup and returns a function pointer
>>  I3) calls a function that performs the lookup and calls the function
>> I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted.  I1 requires at least some details of the lookup to be ABI.
> 
> Exactly.
> 
>> II. The function that performs the lookup:
>> II1) is emitted lazily
>> II2) is emitted eagerly
>> II2 can use static information to implement the lookup inline without affecting the ABI.  II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.
> 
> Sure, but I was thinking of it as client side vs. class-definition
> side.  My suggestions explored the tradeoff between II1 and II2, where
> the client code could optionally optimize for vtable dispatch by
> emitting some lazy lookup helpers with a fall-back to the eagerly
> emitted lookup.

That's how I see it, too.

>> III. The function that performs the lookup:
>>  III1) is method-specific
>>  III2) is class-specific
>>  III3) is part of the language runtime
>> III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition.  III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).
> 
> III1 is the sort of optimization that can be selectively added later
> if we're willing to version the ABI.

Well, at the cost of leaving us with a legacy exported function.

> I suppose by that logic we should start with III3. We could never
> ditch the runtime/compiler support, but newer code could move toward
> III2 through ABI versioning.

It would depend on whether we build features on top of this.  For example,
I could see a reflection system assuming that all methods could be dispatched
via the global runtime function, if we had one.

> I actually prefer to start with III2 because the static overhead is
> not significantly worse than III3 and it can be optimized without ABI
> migration.

If it's between III2 and III3, yeah, I think I agree.

>> IV. The function that performs the lookup:
>>  IV1) is parameterized by an isa
>>  IV2) is not parameterized by an isa
>> IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).
> 
> In my first message I was trying to accomplish IV1. But IV2 is simpler
> and I can't see a fundamental advantage to IV1.

Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)
by passing in the isa of either the superclass or the current class.  IV2 means
that the dispatch function is always based on the isa from the object, so those
dispatch schemes need something else to implement them.

> Why would it need a lookup chain?

Code size, because you might not want to inline the isa load at every call site.
So, for a normal dispatch, you'd have an IV2 function (defined client-side?)
that just loads the isa and calls the IV1 function (defined by the class).

>> V. For any particular function or piece of information, it can be accessed:
>>  V1) directly through a symbol
>>  V2) through a class-specific table
>>  V3) through a hierarchy-specific table (e.g. the class object)
>> V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
>> V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
>> V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
>> If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.
> 
> I had proposed V2 as an option, but am strongly leaning toward V1 for
> ABI simplicity and lower static costs (why generate vtables and offset
> tables?)

V1 doesn't remove the need for tables, it just hides them from the ABI.

>> So I think your alternatives were:
>> 1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch.  We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time.  Has an open question about super dispatch.
>> 2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table.  Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2.  Again, we could also use V2 for the global offset to reduce the symbol-table costs.
>> 3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table).  At least I think this is right?  The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?
> 
> I'll pick the following option as a starting point because it constrains the ABI the least in
> terms of static costs and potential directions for optimization:
> 
> "I2; (II1+II2); III2; IV1; V1"
> 
> method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)
> 
> (where both modules would need to opt into the vtable_offset.)

Wait, remind me what this &vtable_offset is for at this point?  Is it basically just a client-side cache?  I can't figure out what it's doing for us.

> I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.

That's a lot of stuff to materialize at every call site.  It makes calls
into something like a 10 instruction sequence on ARM64, ignoring
the actual formal arguments:

  %raw_isa = load %object                        // 1 instruction
  %isa_mask = load @swift_isaMask                // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask                 // 1
  %method_index = 13                             // 1
  %cache = @local.A.foo.cache                    // 2: not necessarily within ±1MB
  %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1
  call %method(...)                              // 1

On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq
is ±2GB, which is Good Enough for the standard code model, but of course it still
expands to roughly the same amount of code.

Even without vtable_offset, it's a lot of code to inline.

So we'd almost certainly want a client-side resolver function that handled
the normal case.  Is that what you mean when you say II1+II2?  So the local
resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction
call sequence, which I think is equivalent to Objective-C, and that function
would do this sequence:

define @local_resolveMethodAddress(%object, %method_index)
  %raw_isa = load %object                        // 1 instruction
  %isa_mask = load @swift_isaMask                // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it
  %isa = and %raw_isa, %isa_mask                 // 1
  %cache_table = @local.A.cache_table            // 2: not necessarily within ±1MB
  %cache = add %cache_table, %method_index * 8   // 1
  tailcall @A.resolveMethod(%isa, %method_index, %cache)  // 1

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20170204/96fe9fec/attachment.html>