[swift-evolution] Contextualizing async coroutines

Fri Sep 1 17:13:33 CDT 2017

-Pierre

> On Aug 31, 2017, at 9:12 PM, Joe Groff <jgroff at apple.com> wrote:
> 
> 
> 
> On Aug 31, 2017, at 7:50 PM, Pierre Habouzit <phabouzit at apple.com <mailto:phabouzit at apple.com>> wrote:
> 
>>> On Aug 31, 2017, at 11:35 AM, Joe Groff via swift-evolution <swift-evolution at swift.org <mailto:swift-evolution at swift.org>> wrote:
>>> 
>>> The coroutine proposal as it stands essentially exposes raw delimited continuations. While this is a flexible and expressive feature in the abstract, for the concrete purpose of representing asynchronous coroutines, it provides weak user-level guarantees about where their code might be running after being resumed from suspension, and puts a lot of pressure on APIs to be well-behaved in this respect. And if we're building toward actors, where async actor methods should be guaranteed to run "in the actor", I think we'll *need* something more than the bare-bones delimited continuation approach to get there. I think the proposal's desire to keep coroutines independent of a specific runtime model is a good idea, but I also think there are a couple possible modifications we could add to the design to make it easier to reason about what context things run in for any runtime model that benefits from async/await:
>>> 
>>> # Coroutine context
>>> 
>>> Associating a context value with a coroutine would let us thread useful information through the execution of the coroutine. This is particularly useful for GCD, so you could attach a queue, QoS, and other attributes to the coroutine, since these aren't reliably available from the global environment. It could be a performance improvement even for things like per-pthread queues, since coroutine context should be cheaper to access than pthread_self. 
>> 
>>> [...]
>> 
>> 
>> YES!
>> 
>> We need that. You're very focused on performance and affinity and whatnot here, but knowing where the completion will run upfront is critical for priority inheritance purposes.
>> 
>> This is exactly the spirit of the mail I just wrote in reply to Chris a bit earlier tonight. Execution context matters to the OS, a lot.
>> 
>> The OS needs to know two things:
>> - where is the precursor of this coroutine (which work is preventing the coroutine to execute)
>> - where will the coroutine go (which for GCD is critical because the OS lazily attributes threads, so any typical OS primitive to raise an existing thread priority doesn't work)
>> 
>> In other words, a coroutine needs:
>> - various tags (QoS, logging context, ...)
>> - precursors / reverse dependencies
>> - where it will execute (whether it's a dispatch queue or a runloop is completely irrelevant though).
>> 
>> 
>> And then if you do it that way when the precursor fires and allows for your coroutine to be scheduled, then it can actually schedule it right away on the right execution context and minimize context switches (which are way worse than shared mutable state for your performance).
>> 
>> 
>>> # `onResume` hooks
>>> 
>>> Relying on coroutine context alone still leaves responsibility wholly on suspending APIs to pay attention to the coroutine context and schedule the continuation correctly. You'd still have the expression problem when coroutine-spawning APIs from one framework interact with suspending APIs from another framework that doesn't understand the spawning framework's desired scheduling policy. We could provide some defense against this by letting the coroutine control its own resumption with an "onResume" hook, which would run when a suspended continuation is invoked instead of immediately resuming the coroutine. That would let the coroutine-aware dispatch_async example from above do something like this, to ensure the continuation always ends up back on the correct queue:
>>> 
>>> extension DispatchQueue {
>>> func `async`(_ body: () async -> ()) {
>>>   dispatch_async(self, {
>>>     beginAsync(
>>>       context: self,
>>>       body: { await body() },
>>>       onResume: { continuation in
>>>         // Defensively hop to the right queue
>>>         dispatch_async(self, continuation)
>>>       }
>>>     )
>>>   })
>>> }
>>> }
>>> 
>>> This would let spawning APIs provide a stronger guarantee that the spawned coroutine is always executing as if scheduled by a specific queue/actor/event loop/HWND/etc., even if later suspended by an async API working in a different paradigm. This would also let you more strongly associate a coroutine with a future object representing its completion:
>>> 
>>> class CoroutineFuture<T> {
>>> enum State {
>>>   case busy // currently running
>>>   case suspended(() -> ()) // suspended
>>>   case success(T) // completed with success
>>>   case failure(Error) // completed with error
>>> }
>>> 
>>> var state: State = .busy
>>> 
>>> init(_ body: () async -> T) {
>>> 
>>>   beginAsync(
>>>     body: {
>>>       do {
>>>         self.state = .success(await body())
>>>       } catch {
>>>         self.state = .failure(error)
>>>       }
>>>     },
>>>     onResume: { continuation in
>>>       assert(self.state == .busy, "already running?!")
>>>       self.state = .suspended(continuation)
>>>     }
>>>   }
>>> }
>>> 
>>> // Return the result of the future, or try to make progress computing it
>>> func poll() throws -> T? {
>>>   switch state {
>>>   case .busy:
>>>     return nil
>>>   case .suspended(let cont):
>>>     cont()
>>>     switch state {
>>>     case .success(let value):
>>>       return value
>>>     case .failure(let error):
>>>       throw error
>>>     case .busy, .suspended:
>>>       return nil
>>>     }
>>>   case .success(let value):
>>>     return value
>>>   case .error(let error):
>>>     throw error
>>> }
>>> }
>>> 
>>> 
>>> A downside of this design is that it incurs some cost from defensive rescheduling on the continuation side, and also prevents writing APIs that intentionally change context across an `await`, like a theoretical "goToMainThread()" function (though you could do that by spawning a semantically-independent coroutine associated with the main thread, which might be a better design anyway).
>> 
>> Given the limitations, I'm very skeptical. Also in general suspending/resuming work is very difficult to handle for a runtime (implementation wise), has large memory costs, and breaks priority inversion avoidance. dispatch_suspend()/dispatch_resume() is one of the banes of my existence when it comes to dispatch API surface. It only makes sense for dispatch source "I don't want to receive these events anymore for a while" is a perfectly valid thing to say or do. But suspending a queue or work is ripping the carpet from under the feet of the OS as you just basically make all work that is depending on the suspended one invisible and impossible to reason about.
> 
> Sorry, I was using the term 'suspend' somewhat imprecisely. I was specifically referring to an operation that semantically pauses the coroutine and gives you its continuation closure, to be handed off as a completion handler or something of that sort, not something that would block the thread or suspend the queue. Execution would return back up the non-async layer at the point this happens. 

I wasn't worried about the "blocking the thread" but worried about an anonymous suspend/resume like dispatch has where you can't predict who will do the "resume".
I'm fine with a token that has a unique owner that will call it, because tracking who reference is will tell you who will resume and who needs to receive an override in a priority aware world if needed.

I think I'd rather have this operation called yield than suspend, because it's obvious you can't yield twice, and that it is a decision the actor does from its own code, and not an external entity that has the power to suspend you from the outside (like dispatch queues). That's clearly what threw me off here ;)

> 
> -Joe
> 
>> 
>> The proper way to do something akin to suspension is really to "fail" your operation with a "You need to redrive me later", or implement an event monitoring system inside the subsystem providing the Actor that wants suspension to have the client handle the redrive/monitoring, this way the priority relationship is established and the OS can reason about it. Said another way, the Actor should fail with an error that gives you some kind of "resume token" that the requestor can hold and redrive according to his own rules and in a way that it is clear he's the waiter. Most of the time suspension() is a waiting-on-behalf-of relationship and this is a bad thing to build (except in priority homogenous environments, which iOS/macOS are *not*).
>> 
>> Also implementing the state you described requires more synchronization than you want to be useful: if you want to take action after observing a state, then you really really really don't want that state to change while you perform the consequence. the "on$Event" hook approach (which dispatch uses for dispatch sources e.g.) is much better because the ordering and serialization is provided by the actor itself. The only states that are valid to expose as a getter are states that you cannot go back from: succes, failure, error, canceled are all perfectly fine states to expose as getters because they only change state once. .suspended/.busy is not such a thing.
>> 
>> FWIW dispatch sources, and more importantly dispatch mach channels (which is the private interface that is used to implement XPC Connections) have a design that try really really really hard to not fall into any these pitfalls, are priority inheritance friendly, execute on *distributed* execution contexts, and have a state machine exposed through "on$Event" callbacks. We should benefit from the many years of experience that are condensed in these implementations when thinking about Actors and the primitives they provide.
>> 
>> -Pierre

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170901/fa4c3c73/attachment.html>