[swift-dev] Resilient dynamic dispatch ABI. Notes and mini-proposal.

Fri Feb 3 17:12:16 CST 2017

> On Feb 3, 2017, at 2:55 PM, Andrew Trick <atrick at apple.com> wrote:
>> On Feb 3, 2017, at 10:58 AM, John McCall <rjmccall at apple.com> wrote:
>> 
>>> On Feb 2, 2017, at 9:57 PM, Andrew Trick via swift-dev <swift-dev at swift.org> wrote:
>>> ---
>>> #1. (thunk export) The simplest, most flexible way to expose dispatch
>>> across resilience boundaries is by exporting a single per-method entry
>>> point. Future compilers could improve dispatch and gradually expose
>>> more ABI details.
>> 
>> Probably the most important such case is that many of these "dispatch" symbols
>> for a non-open class could simply be aliases to the actual method definition.
>> 
>> Note that open classes have to support super-dispatch, not just ordinary dynamic
>> dispatch; that's something that needs to be discussed at the same time, since it
>> may affect the trade-offs.  #1 has some serious disadvantages here; the others
>> are likely fine, since their mechanisms are all easily parameterized by the isa.
>> 
>>> ---
>>> #3. (method index) This is an alternative that I've alluded to before,
>>> but was not discussed in yesterday's meeting. One that makes a
>>> tradeoff between exporting symbols vs. exposing vtable layout. I want
>>> to focus on direct cost of the ABI support and flexibility of this
>>> approach vs. approach #1 without arguing over how to micro-optimize
>>> various dispatching schemes. Here's how it works:
>>> 
>>> The ABI specifies a sort function for public methods that gives each
>>> one a per-class index. Version availability takes sort precedence, so
>>> public methods can be added without affecting other
>>> indices. [Apparently this is the same approach we're taking with
>>> witness tables].
>>> 
>>> As with #2 this avoids locking down the vtable format for now--in the
>>> future we'll likely optimize it further. To avoid locking all methods
>>> into the vtable mechanism, the offset can be tagged. The alternative
>>> dispatch mechanism for tagged offsets will be hidden within the
>>> class-defining framework.
>> 
>> As a note, I went down a garden path here — I didn't realize at first that #3
>> wasn't an option on its own, but really just forematter for options #3a and #3b.
>> The paragraph about the sort, especially coming after #2, made me think that
>> you were talking about the option — let's call it #2b — where the class only
>> exports an offset to the start of its v-table and an availability-sort allows the
>> caller to hard-code a fixed offset to add to that.  This greatly cuts down on the
>> number of symbols required to be exported, but of course it does force all
>> the methods to actually end up in the v-table.
> 
> Right, I didn't go down that path initially because, as Joe pointed
> out, the vtable offsets aren't statically knowable because of the
> resilient base class problem. But, of course the vtable base offset
> could be exported and we can rely on the class being realized before
> it's invoked. This is great if we're willing to commit to vtable
> dispatch.
> 
>>> This avoids the potential explosion of exported symbols--it's limited
>>> to one per public class. It avoids explosion of metadata by allowing
>>> alternative dispatch for some subset of methods. These tradeoffs can
>>> be explored in the future, independent of the ABI.
>>> 
>>> ---
>>> #3a. (offset table export) A single per-class entry point provides a
>>> pointer to an offset table. [It can be optionally cached on the client
>>> side].
>> 
>> Why would this need to be a function?  Just to allow its resolution to be lazy?
>> It seems to me that the class definition always knows the size of this table.
> 
> I don't know what you're asking. In #3b, the offset is resolved by a
> class-exported function.

You said "A single per-class entry point provides a pointer to an offset table",
which suggests that, well, there's a function that provides (i.e. returns) a pointer
to an offset table.  I'm asking why accessing the offset table requires a function
call instead of just accessing a global variable.  If you're worried about initializing it,
I think it's clearly better to rely on it having been initialized as part of initializing the
class metadata, which means that (optimization aside) call sites can always just
assume it's been initialized by the point of call.

Now, that does create a more complex optimization problem because the access
isn't arbitrarily hoistable, the same way that hoisting an ivar offset load isn't arbitrarily
hoistable, but, well, as that example shows, this is an optimization problem we
already have and need to get better at, and it's not a total blocker.

>>> method_index = immediate
>>> { // common per-class method lookup
>>>  isa = load[obj]
>>>  isa = isa & @isa_mask
>>>  offset = load[@class_method_table + method_index]
>>>  if (isVtableOffset(offset))
>>>    method_entry = load[isa + offset]
>>>  else
>>>    method_entry = @resolveMethodAddress(isa, @class_method_table, method_index)
>>> }
>>> call method_entry
>> 
>> I hope the proposal is that the stuff in braces is a locally-emitted function,
>> not something inlined into every call site.
> 
> This approach is based on generating local per-class "nocallersave" helper functions
> for method lookup (all the stuff in curly braces).

Okay, good.

>> A sort of #4 is that this function could be exported by the class definition.
>> It would be passed an isa and a method index and could resolve it however it likes.
> 
> Yeah, let's call that option #4. I like that from an ABI perspective, but didn't go there
> because it seems strictly worse than #1 in terms of runtime cost.

Well, it doesn't require per-method symbols, but yes, it definitely gives up some static
optimizations relative to #1.  Most of these do.

I think we can generalize this discussion a bit by describing some mostly-independent axes of variation:

I. A call site:
  I1) inlines the lookup
  I2) calls a function that performs the lookup and returns a function pointer
  I3) calls a function that performs the lookup and calls the function
I3 minimizes code size at the call site, but it prevents the lookup (or components thereof) from being hoisted.  I1 requires at least some details of the lookup to be ABI.

II. The function that performs the lookup:
 II1) is emitted lazily
 II2) is emitted eagerly
II2 can use static information to implement the lookup inline without affecting the ABI.  II1 cannot; it must either rely on ABI assumptions or just recursively use another strategy, in which case the function is probably just a call-site code-size optimization.

III. The function that performs the lookup:
  III1) is method-specific
  III2) is class-specific
  III3) is part of the language runtime
III1 requires a lot more functions, but if emitted with the class definition, it allows the lookup to be statically specialized based on the method called, so e.g. if a non-open method is never overridden, it can resolved to a specific function pointer immediately, possibly by just aliasing a symbol to the function definition.  III3 minimizes the number of functions required but doesn't allow us to evolve dispatch beyond what's supported by the current language runtime and doesn't allow class-specific optimizations (e.g. if the class definition statically knows the offset to its v-table).

IV. The function that performs the lookup:
  IV1) is parameterized by an isa
  IV2) is not parameterized by an isa
IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).

V. For any particular function or piece of information, it can be accessed:
  V1) directly through a symbol
  V2) through a class-specific table
  V3) through a hierarchy-specific table (e.g. the class object)
V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.
V2 allows stable assignment of fixed indexes to entries because of availability-sorting.
V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.
If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.

So I think your alternatives were:
1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch.  We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time.  Has an open question about super dispatch.
2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table.  Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2.  Again, we could also use V2 for the global offset to reduce the symbol-table costs.
3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table).  At least I think this is right?  The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?

John.