[swift-dev] Continuation-based versus nested-function-based materializeForSet

Wed Nov 2 16:54:05 CDT 2016

> On Nov 1, 2016, at 3:32 PM, John McCall <rjmccall at apple.com> wrote:
> 
>> On Oct 31, 2016, at 12:22 PM, Joe Groff via swift-dev <swift-dev at swift.org> wrote:
>> We currently abstract over mutable property accesses using what I’ll call a continuation-based model–the materializeForSet accessor is called before an inout access, and returns a continuation callback that must be called when the inout access is completed. I think a nested-function-based model, if designed correctly, has the potential to be more efficient and easier to implement.
> 
> Well, I'm not totally convinced about either of those two points, but it's definitely something we need to talk about.
> 
> Let me try to break this down.  I think I'm going to repeat you a lot in different words.
> 
> We have a coroutine-esque problem:
>  0. Outer is executing.
>  1. Outer calls Inner.
>  2. Inner executes for a while.
>  3. Inner passes control back to Outer without abandoning its execution context.
>  4. Outer executes for awhile.
>  5. Outer passes control back to Inner, potentially having changed its own execution context.
>  6. Inner executes for awhile.
>  7. Inner finally abandons its execution context and passes control back to Outer.
>  8. Outer is executing.
> 
> Note that there might be multiple static points for (3) and (5), each requiring a different resumption point (and creating different context changes) at (5) and (7) respectively.
> 
> (I'm going to use the terms "standard spilling", "subfunction spilling", and "transfer spilling" without defining them.  I'll come back to this.)
> 
> I. Lowerings
> 
> Currently, as you say, we do this with a sort of CPS conversion where Inner is split into multiple sub-functions.  The initial function is passed a (necessarily fixed-size, opaque to Outer) buffer; at (3), it saves its context into that buffer and returns a sub-function that Outer will invoke (passing the same buffer) at point (5); at (5), that sub-function restores its context from that buffer; and finally the sub-function returns normally at (7).  This has two advantages: Outer can reason locally about the data flow from (0) to (4) to (8), and Inner can return a null pointer at (3) to indicate that there's no work to do in (6) and thus save an indirect branch (at the cost of a local conditional branch).  However, if Inner has interesting context to save at (3), that requires extra code to manage that data flow, and the amount of data to save can overflow the buffer and therefore require malloc.  Total branching cost: the original call at (1), the easily-predicted return at (3), the conditional branch at (5), the possible indirect call at (5), and the easily-predicted return at (7).  Total save/restore cost: standard spilling at (1-3) and (5-7), transfer spilling at (3-5), possible malloc.
> 
> The simplest alternative proposal is a callback conversion, extracting (4) from Outer as a sub-function which is passed to Inner.  Inner now gets an ordinary stack frame to save its context, so its data flow and control flow are easier for the backend to reason about, and malloc is not required.  However, the data/control flow in Outer is significantly obscured; most importantly, different static points for (5) may need to resume to different instructions at (7) (which the continuation lowering implicitly handles by duplicating the call in (5)), and Outer's context will be modified at various points from (1-7).  Total branching cost: the original call at (1), the indirect call at (3), the easily-predicted return at (5), the easily-predicted return at (7), the possible switch-out at (7).  Total save/restore cost: subfunction spilling at (1-7), standard spilling at (3-5), only stack allocations.
> 
> Note that the call at (5) needs to duplicated for continuation lowering in exactly those cases where we need a switch-out at (7) in callback conversion.
> 
> Your observation is that, if Outer needs to perform a nested access with this sequence at (4), a naive callback conversion will have a lot of branches.  Let the outer access be (A), with its steps being (A.0-8), and the inner access be (B), with its steps being (B.0-8), so we have (A.4) := (B.0-8).  If the nesting is simple — i.e. there is no code in B.0 or B.8 — and the initiation sequences are compatible — i.e. you can convince (A.3) to directly make the original call for (B.1) — then it is unnecessary to extract a sub-function for (A.4), and instead (A.1) can just set up the chain of calls.  Total branching cost: the original call at (A.1), the indirect call at (A.3 / B.1), the indirect call at (B.3), the easily-predicted return at (B.5), the easily-predicted return at (B.7 / A.5), the easily-predicted return at (A.7), the possible switch-out at (A.7).  Total save/restore cost: subfunction spilling at (A.1-7), standard spilling at (A.3-5), standard spilling at (B.3-5), only stack allocations.  But it does have some additional set-up overhead, and it expects a lot from the indirect call at (3); in particular, to get this to work well in practice, we really have to change all entrypoints for (1) so that they're callable in exactly the same way, i.e. taking a context argument, a "next callback" pointer, and an array of extra argument data for the types, indices, whatever.
> 
> II. Spilling
> 
> "Standard spilling" means ordinary compiler behavior to save values across a call; it includes putting values in callee-save registers.  The compiler has a lot of intelligence about this.
> 
> "Subfunction spilling" is like standard spilling except the values must be accessible to a sub-function.  The compiler has some very limited intelligence for this; I'm confident that that intelligence can grow over time, but it's not much right now.  This precludes using callee-save registers unless, like you suggest, we promise to preserve them when making the callback; and note that "preserve" here really means "expect our caller and our callee to have a private contract about these registers that we shouldn't disturb" — in particular, our caller might set the register, our callee might expect to see that value *and change it*, and our caller might expect to see that changed value.  This is completely implementable, but it's not how the backend is really engineered right now.
> 
> "Transfer spelling" means saving values across a shift in execution contexts, given only a buffer.  The compiler is not well-engineered for this; we'd probably always be lowering to sub-functions early on, and doing so will block a lot of value optimizations in LLVM.
> 
> III. Implementation
> 
> I think SIL should clearly work with structured, unified Outer and Inner functions here, freely able to move operations between them without having to worry about what goes where.  That means that we have to do some sort of lowering post-SIL (or at least in some low-level late lowering, rather than in SILGen).  I expect that that alone will greatly improve our current implementation.
> 
> In order to do this late lowering, we'll need some structural restrictions on SIL that should be in line with many of the other restrictions we're imposing with Semantic SIL.  Given that, I don't think that it's particularly harder to extract a continuation vs. extracting an inner sub-function.
> 
> IV. Summary
> 
> So to sum up:
> 
> Our current continuation-style lowering is good when:
>  - there's a lot of data flow from 0 to 4 or from 4 to 8,
>  - there's complex control flow from 4 to 8, or
>  - the inner function has no code in 6, which is common for stored/addressed accesses.
> 
> A callback-style lowering is good when:
>  - there's a lot of data flow from 2 to 6 or
>  - a zero-allocation implementation is required.
> 
> A chained callback-style lowering is good when:
>  - callback-style lowerings are good and
>  - simply nested accesses are common enough to pay for the slight additional overhead.
> 
> Honestly, I have a hard time seeing the callback trade-offs as being the right ones in general.  Removing the malloc is required for true systems programming, but when that guarantee isn't needed, I think we can provide a large enough buffer to make allocation basically a non-factor.

OK, if you're not convinced that chained callbacks are a win, then it's kind of a wash. I agree that we'd want to represent the caller and callee as independent coroutines in SIL either way, so the tradeoffs come down to how we lower to readily-available LLVM abstractions. Either implementation approach's tradeoffs could be mitigated with enough implementation effort: We could avoid mallocing in the continuation-style implementation by manipulating the stack pointer across calls to save space for extra state, and on the other hand, for the callback-style lowering, we could increase the 0-to-4-to-8 bandwidth by saving more registers when switching into the callee's context. One possibly small advantage I think the callback style still holds is the ability for (3) to be a tail call when the callee has no code in 6 (the stored/addressed case), which saves a test and branch the caller needs to perform if we represent that as a nullable continuation return in the continuation-style model.

-Joe