<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Feb 3, 2017, at 9:37 PM, John McCall <<a href="mailto:rjmccall@apple.com" class="">rjmccall@apple.com</a>> wrote:</div><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">IV. The function that performs the lookup:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""> IV1) is parameterized by an isa</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""> IV2) is not parameterized by an isa</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">IV1 allows the same function to be used for super-dispatch but requires extra work to be inlined at the call site (possibly requiring a chain of resolution function calls).</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""></div></blockquote><div class=""><br class=""></div><div class="">In my first message I was trying to accomplish IV1. But IV2 is simpler</div><div class="">and I can't see a fundamental advantage to IV1.</div></div></div></div></blockquote><div class=""><br class=""></div><div class="">Well, you can use IV1 to implement super dispatch (+ sibling dispatch, if we add it)</div><div class="">by passing in the isa of either the superclass or the current class. IV2 means</div><div class="">that the dispatch function is always based on the isa from the object, so those</div><div class="">dispatch schemes need something else to implement them.</div><br class=""><blockquote type="cite" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><div class=""> Why would it need a lookup chain?</div></div></div></blockquote><div class=""><br class=""></div>Code size, because you might not want to inline the isa load at every call site.</div><div class="">So, for a normal dispatch, you'd have an IV2 function (defined client-side?)</div><div class="">that just loads the isa and calls the IV1 function (defined by the class).</div></div></div></blockquote><div><br class=""></div><div>Right. Looks like I wrote the opposite of what I meant. The important thing to me is that the vtable offset load + check is issued in parallel with the isa load. I was originally pushing IV2 for this reason, but now think that optimization could be entirely lazy via a client-side cache.</div><br class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">V. For any particular function or piece of information, it can be accessed:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""> V1) directly through a symbol</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""> V2) through a class-specific table</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class=""> V3) through a hierarchy-specific table (e.g. the class object)</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">V1 requires more global symbols, especially if the symbol is per-method, but doesn't have any index-computation problems, and it's generally a bit more efficient.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">V2 allows stable assignment of fixed indexes to entries because of availability-sorting.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">V3 does not; it requires some ability to (at least) slide indexes of entries because of changes elsewhere in the hierarchy.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">If there are multiple instantiations of a table (perhaps with different information, like a v-table), V2 and V3 can be subject to table bloat.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""></div></blockquote><div class=""><br class=""></div><div class="">I had proposed V2 as an option, but am strongly leaning toward V1 for</div><div class="">ABI simplicity and lower static costs (why generate vtables and offset</div><div class="">tables?)</div></div></div></div></blockquote><div class=""><br class=""></div><div class="">V1 doesn't remove the need for tables, it just hides them from the ABI.</div></div></div></div></blockquote><div><br class=""></div><div>I like that it makes the offset tables lazy and optional. They don’t even need to be complete.</div><br class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">So I think your alternatives were:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">1. I3, II2, III1, IV2, V1 (for the dispatch function): a direct call to a per-method global function that performs the dispatch. We could apply V2 to this to decrease the number of global symbols required, but at the cost of inflating the call site and requiring a global variable whose address would have to be resolved at load time. Has an open question about super dispatch.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">2. I1, V3 (for the v-table), V1 (for the global offset): a load of a per-method global variable giving an offset into the v-table. Joe's suggestion adds a helper function as a code-size optimization that follows I2, II1, III1, IV2. Again, we could also use V2 for the global offset to reduce the symbol-table costs.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">3. I2, II2, III2, IV1, V2 (for the class offset / dispatch mechanism table). At least I think this is right? The difference between 3a and 3b seems to be about initialization, but maybe also shifts a lot of code-generation to the call site?</span></div></blockquote></div><br class=""><div class=""><div class="">I'll pick the following option as a starting point because it constrains the ABI the least in</div><div class="">terms of static costs and potential directions for optimization:</div><div class=""><br class=""></div><div class="">"I2; (II1+II2); III2; IV1; V1"</div><div class=""><br class=""></div><div class="">method_entry = resolveMethodAddress_ForAClass(isa, method_index, &vtable_offset)</div><div class=""><br class=""></div><div class="">(where both modules would need to opt into the vtable_offset.)</div></div></div></div></blockquote><div class=""><br class=""></div>Wait, remind me what this &vtable_offset is for at this point? Is it basically just a client-side cache? I can't figure out what it's doing for us.</div></div></div></blockquote><div><br class=""></div><div>It’s a client side cache that can be checked in parallel with the `isa` load. The resolver is not required to provide an offset, and the client does not need cache all the method offsets. It does burn an extra register, but gains the ability to implement vtable dispatch entirely on the client side.</div><div><br class=""></div><div>You might be thinking of caching the method entry itself and checking `isa` within `resolveMethod`. I didn’t mention that possibility because the cost of calling the non-local `resolveMethod` function followed by an indirect call largely defeats the purpose of something like an inline-cache.</div><br class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><div class="">I think any alternative would need to be demonstrably better in terms of code size or dynamic dispatch cost.</div></div></div></div></blockquote><div class=""><br class=""></div>That's a lot of stuff to materialize at every call site. It makes calls</div><div class="">into something like a 10 instruction sequence on ARM64, ignoring</div><div class="">the actual formal arguments:</div><div class=""><br class=""></div><div class=""><font face="Menlo" class=""> %raw_isa = load %object // 1 instruction</font></div><div class=""><font face="Menlo" class=""> %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it</font></div><div class=""><font face="Menlo" class=""> %isa = and %raw_isa, %isa_mask // 1</font></div><div class=""><font face="Menlo" class=""> %method_index = 13 // 1</font></div><div class=""><font face="Menlo" class=""> %cache = @local.A.foo.cache // 2: not necessarily within ±1MB</font></div><div class=""><font face="Menlo" class=""> %method = call @A.resolveMethod(%isa, %method_index, %cache) // 1</font></div><div class=""><font face="Menlo" class=""> call %method(...) // 1</font></div><div class=""><br class=""></div><div class="">On x86-64, it'd just be 8 instructions because the immediate range for leaq/movq</div><div class="">is ±2GB, which is Good Enough for the standard code model, but of course it still</div><div class="">expands to roughly the same amount of code.</div><div class=""><br class=""></div><div class="">Even without vtable_offset, it's a lot of code to inline.</div><div class=""><br class=""></div><div class="">So we'd almost certainly want a client-side resolver function that handled</div><div class="">the normal case. Is that what you mean when you say II1+II2? So the local</div><div class="">resolver would be I2; II1; III2; IV2; V1, which leaves us with a three-instruction</div><div class="">call sequence, which I think is equivalent to Objective-C, and that function</div><div class="">would do this sequence:</div><div class=""><font face="Menlo" class=""><br class=""></font></div><div class=""><font face="Menlo" class="">define @local_resolveMethodAddress(%object, %method_index)</font></div><div class=""><font face="Menlo" class=""> %raw_isa = load %object // 1 instruction</font></div><div class=""><font face="Menlo" class=""> %isa_mask = load @swift_isaMask // 3: 2 to materialize address from GOT (not necessarily with ±1MB), 1 to load from it</font></div><div class=""><font face="Menlo" class=""> %isa = and %raw_isa, %isa_mask // 1</font></div><div class=""><font face="Menlo" class=""> %cache_table = @local.A.cache_table // 2: not necessarily within ±1MB</font></div><div class=""><font face="Menlo" class=""> %cache = add %cache_table, %method_index * 8 // 1</font></div><div class=""><font face="Menlo" class=""> tailcall @A.resolveMethod</font><span style="font-family: Menlo;" class="">(%isa, %method_index, %cache)</span><span style="font-family: Menlo;" class=""> // 1</span></div><div class=""><br class=""></div><div class="">John.</div></div></div></blockquote><br class=""></div><div>Yes, exactly, except we haven’t even done any client-side vtable optimization yet.</div><div><br class=""></div><div>To me the point of the local cache is to avoid calling @A.resolveMethod in the common case. So we need another load-compare-and-branch, which makes the local helper 12-13 instructions. Then you have the vtable load itself, so that’s 13-14 instructions. You would be saving on dynamic instructions but paying with 4 extra static instructions per class.</div><div><br class=""></div><div>It would be lame if we can't force <span style="font-family: Menlo;" class="">@local.A.cache_table to be </span><span style="font-family: Menlo;" class="">±1MB relative to the helper.</span></div><div><br class=""></div><div>Inlining the cache table address might be worthwhile because %method_index would then be an immediate and hoisted to the top of the function.</div><div><br class=""></div><div>-Andy</div><div><br class=""></div><br class=""></body></html>