[swift-evolution] [Concurrency] async/await + actors

Sun Sep 3 14:44:44 CDT 2017

> On Sep 3, 2017, at 10:26 AM, Chris Lattner via swift-evolution <swift-evolution at swift.org> wrote:
> 
> On Sep 2, 2017, at 11:09 PM, Pierre Habouzit <phabouzit at apple.com <mailto:phabouzit at apple.com>> wrote:
>>> On Sep 2, 2017, at 12:19 PM, Pierre Habouzit <pierre at habouzit.net <mailto:pierre at habouzit.net>> wrote:
>>>>>> What do you mean by this?
>>>>> 
>>>>> My understanding is that GCD doesn’t currently scale to 1M concurrent queues / tasks.
>>>> 
>>>> It completely does provided these 1M queues / tasks are organized on several well known independent contexts.
>>> 
>>> Ok, I stand corrected.  My understanding was that you could run into situations where you get stack explosions, fragment your VM and run out of space, but perhaps that is a relic of 32-bit systems.
>> 
>> a queue on 64bit systems is 128 bytes (nowadays). Provided you have that amount of VM available to you (1M queues is 128M after all) then you're good.
>> If a large amount of them fragments the VM beyond this is a malloc/VM bug on 64bit systems that are supposed to have enough address space.
> 
> Right, I was referring to the fragmentation you get by having a large number of 2M stacks allocated for each kernel thread.  I recognize the queues themselves are small.

Yes, this is still poor, if every single queue needs to make a thread at the same time, we've lost. Whatever abstraction you pick on top it'll be true: physical OS threads (pthreads or workqueue threads) are inherently expensive and don't scale well.

>> What doesn't scale is asking for threads, not having queues.
> 
> Right, agreed.
> 
>>> Agreed, to be clear, I have no objection to building actors on top of (perhaps enhanced) GCD queues.  In fact I *hope* that this can work, since it leads to a naturally more incremental path forward, which is therefore much more likely to actually happen.
>> 
>> Good :)
> 
> I meant to be pretty clear about that all along, but perhaps I missed the mark.  In any case, I’ve significantly revised the “scalable runtime” section of the doc to reflect some of this discussion, please let me know what you think:
> https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime <https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#scalable-runtime>
I'm much happier with this new wording which is IMO a very fair assessment. thanks for rewording it.

> 
>> 
>>>> My currently not very well formed opinion on this subject is that GCD queues are just what you need with these possibilities:
>>>> - this Actor queue can be targeted to other queues by the developer when he means for these actor to be executed in an existing execution context / locking domain,
>>>> - we disallow Actors to be directly targeted to GCD global concurrent queues ever
>>>> - for the other ones we create a new abstraction with stronger and better guarantees (typically limiting the number of possible threads servicing actors to a low number, not greater than NCPU).
>>> 
>>> Is there a specific important use case for being able to target an actor to an existing queue?  Are you looking for advanced patterns where multiple actors (each providing disjoint mutable state) share an underlying queue? Would this be for performance reasons, for compatibility with existing code, or something else?
>> 
>> Mostly for interaction with current designs where being on a given bottom serial queue gives you the locking context for resources naturally attached to it.
> 
> Ok.  I don’t understand the use-case well enough to know how we should model this.  For example, is it important for an actor to be able to change its queue dynamically as it goes (something that sounds really scary to me) or can the “queue to use” be specified at actor initialization time?

I think I need to read more on actors, because the same way you're not an OS runtime expert, I'm not (or rather no longer, I started down that path a lifetime ago) a language expert at all, and I feel like I need to understand your world better to try to explain this part better to you.

> 
>>> One plausible way to model this is to say that it is a “multithreaded actor” of some sort, where the innards of the actor allow arbitrary number of client threads to call into it concurrently.  The onus would be on the implementor of the NIC or database to implement the proper synchronization on the mutable state within the actor.
>> 
>> I think what you said made sense.
> 
> Ok, I captured this in yet-another speculative section:
> https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency <https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782#intra-actor-concurrency>
Great. BTW I agree 100% with:

That said, this is definitely a power-user feature, and we should understand, build, and get experience using the basic system before considering adding something like this.

Private concurrent queues are not a success in dispatch and cause several issues, these queues are second class citizens in GCD in terms of feature they support, and building something with concurrency *within* is hard. I would keep it as "that's where we'll go some day" but not try to attempt it until we've build the simpler (or rather less hard) purely serial case first.

> 
> 
>> But it wasn't what I meant. I was really thinking at sqlite where the database is strongly serial (you can't use it in a multi-threaded way well, or rather you can but it has a big lock inside). It is much better to interact with that dude on the same exclusion context all the time. What I meant is really having some actors that have a "strong affinity" with a given execution context which eases the task of the actor scheduler.
> 
> Ah ok.  Yes, I think that wrapping a “subsystem with a big lock” in an actor is a very natural thing to do, just as much as it makes sense to wrap a non-threadsafe API in an actor.  Any internal locking would be subsumed by the outer actor queue, but that’s ok - all the lock acquires would be uncontended and fast :)
> 
>> Another problem I haven't touched either is kernel-issued events (inbound IPC from other processes, networking events, etc...). Dispatch for the longest time used an indirection through a manager thread for all such events, and that had two major issues:
>> 
>> - the thread hops it caused, causing networking workloads to utilize up to 15-20% more CPU time than an equivalent manually made pthread parked in kevent(), because networking pace even when busy idles back all the time as far as the CPU is concerned, so dispatch queues never stay hot, and the context switch is not only a scheduled context switch but also has the cost of a thread bring up
>> 
>> - if you deliver all possible events this way you also deliver events that cannot possibly make progress because the execution context that will handle them is already "locked" (as in busy running something else.
>> 
>> It took us several years to get to the point we presented at WWDC this year where we deliver events directly to the right dispatch queue. If you only have very anonymous execution contexts then all this machinery is wasted and unused. However, this machinery has been evaluated and saves full percents of CPU load system-wide. I'd hate for us to go back 5 years here.
> 
> I don’t have anything intelligent to say here, but it sounds like you understand the issues well :-)  I agree that giving up 5 years of progress is not appealing.

TBH our team has to explain into more depth how eventing works on Darwin, the same way we (or maybe it's just I, I don't want to disparage my colleagues here :P) need to understand Actors and what they mean better, I think the swift core team (and whoever works on concurrency) needs to be able to understand what I explained above.

> 
>> Declaring that an actor targets a given existing serial context also means that if that actor needs to make urgent progress the context in question has to be rushed, and its priority elevated. It's really hard to do the same on an anonymous global context (the way dispatch does it still is to actually enqueue stealer work that try to steal the "actor" at a higher priority. this approach is terribly wasteful).
> 
> Ok.
> 
>> It is clear today that it was a mistake and that there should have been 3 kind of queues:
>> - the global queues, which aren't real queues but represent which family of system attributes your execution context requires (mostly priorities), and we should have disallowed enqueuing raw work on it
>> - the bottom queues (which GCD since last year tracks and call "bases" in the source code) that are known to the kernel when they have work enqueued
>> - any other "inner" queue, which the kernel couldn't care less about
>> 
>> In dispatch, we regret every passing day that the difference between the 2nd and 3rd group of queues wasn't made clear in the API originally.
> 
> Well this is an opportunity to encourage the right thing to happen.  IMO it seems that the default should strongly be for an “everyday actor” to be the third case, which is what developers will reach for when building out their app abstractions.

Correct.

> We can provide some API or other system that allows sufficiently advanced developers to tie an actor into the second bucket, and we can disallow the first entirely if that is desirable (e.g. by a runtime trap if nothing else).

Yes, I would be pleased with something like that, this is exactly what I'm advocating.

>> I like to call the 2nd category execution contexts, but I can also see why you want to pass them as Actors, it's probably more uniform (and GCD did the same by presenting them both as queues). Such top-level "Actors" should be few, because if they all become active at once, they will need as many threads in your process, and this is not a resource that scales. This is why it is important to distinguish them. And like we're discussing they usually also wrap some kind of shared mutable state, resource, or similar, which inner actors probably won't do.
> 
> The thing I still don’t understand is where #2 comes from: are these system defined queues that the developer interacts with, or are these things that developers define in their code?  How does the kernel know about them if the developer defines new ones?

This is, sir, a very good question, and I think anyone working on Actors will need to understand most of this ;) You "just" need to understand most of the kevent and workq integration to answer this. It ties into what I was saying about kernel issued events above. But to try to give you a sense quickly, there are two cases, pure queues, and queues with kernel events registrations:

1) for the former case, the kernel knows just in time when the queue is woken up (the term GCD uses to describe the idle -> non empty transition for a queue), because the kernel never needs to know before that point. The GCD runtime will evaluate that the queue is rooted at "the bottom", and will make it know to the kernel then.

2) for the second case, queue hierarchies that have event monitoring setup (dispatch sources, xpc connections) are registered upfront with the kernel. The evaluation of which is the bottom queue an event is handled on is done by the GCD runtime when the first registration happens. This allows for both kernel and user spaces to the thread requests on behalf of this hierarchy and for these requests to coalesce in kernel. (this is btw awesome technology that has been very difficult in the making FWIW, but I digress).

(2) is what scares me: the kernel has stuff to deliver, and kernels don't like to hold on data on behalf of userspace forever, because this burns wired memory. This means that this doesn't quite play nice with a global anonymous pool that can be starved at any time. Especially if we're talking XPC Connections in a daemon, where super high priority clients such as the frontmost app (or even worse, SpringBoard) can ask you questions that you'd better answer as fast as you can.

The solution we recommend to solve this to developers (1st and 3rd parties), is for all your XPC connections for your clients are rooted at the same bottom queue that represents your "communication" subsystem, so you can imagine it be the "Incoming request multiplexer" Actor or something, this one is in the 2nd category and is known to the kernel so that the kernel can instantiate the Actor itself without asking permission to userspace and directly make the execution context, and rely on the scheduler to get the priorities right.

The alternative solution is for userspace to have a construct similar to a kernel scheduler in every process and have a strong tie with the kernel so that the context can be created within the shared pool, and then the userspace scheduler would take over to try to schedule this, and this is where all the discussion about phsyical threads starts: it starts to be very tempting to have "lightweight" (I really believe that for sufficiently complex code, with C interleaved, these actually are merely slightly less heavyweight) users-pace stack switching and basically make every process replicate what a kernel does. This is where I'd like for us not to go because it means solving the same problems in two places (kernel and usperspace), which requires synchronization between states and doubles your bug surface. It also requires a large amount of diagnostic tools to be able to represent these things in meaningful ways, and last, doesn't work at all with TSDs (which Swift uses extensively it its runtime, and that obj-c & C frameworks use all over the place) and most of our locking primitives (which when you take them make the locking primitive tied to the thread ID / port that took the lock, so if you have a fluid association between userspace threads and physical ones, you're in for a very bad time). All these reasons is why we took the direction with the OS of making the kernel more aware of the runtime rather than having the runtime be more like a kernel.

This is why our belief (in my larger K(ernel) & R(untime) team) is that instead of trying to paper over the fact that there's a real OS running your stuff, we need to embrace it and explain to people that everywhere in a traditional POSIX world they would have used a real pthread_create()d thread to perform the work of a given subsystem, they create one such category #2 bottom queue that represents this thread (and you make this subsystem an Actor), and that the same way in POSIX you can't quite expect a process to have thousands of threads, you can't really have thousands of such top level actors either. It doesn't mean that there can't be some anonymous pool to run stuff on it for the stuff that is less your raison d'être, it just means that the things your app that are really what it was built to do should be organized in a resource aware way to take maximum advantage of the system. Throwing a million actors to the system without more upfront organization and initial thinking by the developers is not optimizable.

FWIW for other people wanting to participate in this discussion, what I'm saying feels very Darwin centric because I use XPC as an example, but on Linux replace XPC with D-BUS and all my arguments apply the same. OS Design meets the same class of problems everywhere, I'm just taking concrete example that people who've worked at Apple understand. I don't know Windows at all, but I'd be very surprised if they don't have exactly something like this too.

>>>> You can obviously model this as "the database is an actor itself", and have the queue of other actors that only do database work target this database actor queue, but while this looks very appealing, in practice this creates a system which is hard to understand for developers. Actors talking to each other/messaging each other is fine. Actors nesting their execution inside each other is not because the next thing people will then ask from such a system is a way to execute code from the outer actor when in the context of the inner Actor, IOW what a recursive mutex is to a mutex, but for the Actor queue. This obvious has all the terrible issues of recursive locks where you think you hold the lock for the first time and expect your object invariants to be valid, except that you're really in a nested execution and see broken invariants from the outer call and this creates terribly hard bugs to find.
>>> 
>>> Yes, I understand the problem of recursive locks, but I don’t see how or why you’d want an outer actor to have an inner actor call back to it.
>> 
>> I don't see why you'd need this with dispatch queues either today, however my radar queue disagrees strongly with this statement. People want this all the time, mostly because the outer actor has a state machine and the inner actor wants it to make progress before it continues working.
>> 
>> In CFRunloop terms it's like running the runloop from your work by calling CFRunloopRun() yourself again until you can observe some event happened.
>> 
>> It's not great and problematic for tons of reasons. If actors are nested, we need a way to make sure people don't have to ever do something like that.
>> 
>> Just as a data point, my reactions to this thread yielded a private discussion off list *exactly* about someone wanting us to provide something like this for dispatch (or rather in this instance having the inner actor be able to suspend the outer one, but it's just a corollary / similar layering violation where the inner actor wants to affect the outer one in a way the outer one didn't expect).
> 
> I get that this comes up now and then, but I also don’t understand what we can do about it.  Allowing “recursive lock” style reentrancy into an actor breaks fundamental aspects of the design of actors: that you know at the start of any actor message that the actor invariants are intact.

Exactly.

> It would be possible to relax this and give up on that (perhaps rather theoretical) win, but I’d rather not.

Me neither, I hate this, and all the radars I was alluding to are either NTBFed with an explanation to the reporter on how to do the same without using recursive patterns, or sit in our queue with us being perplexed. It's just that we have to know these questions will happen.

> I think that any recursive case can be expressed through a refactoring of the code, e.g. to use async sending instead of sync calls.

This is true, but I would argue that if you reached the point where you need a refactor, then you screwed up during design first. I think what I mean here is that we should explain upfront how to not design things that lead to this problem with simple example so that people can use them to build complex systems that never fall in this trap.

>  Also, I think we should strongly encourage pure async “fire and forget” actor methods anyway - IOW, we should encourage push, not pull

I almost agree. We should strongly encourage the `pure async "account for, fire and forget" actor methods`. The `account for` is really backpressure, where you actually don't fire if the remote queue is full and instead rely on some kind of reactive pattern to pull from you. (but I know you wrote that on your proposal and you're aware of it).

The other problem is the following. I used to work on iCloud Drive, and the design of the daemon uses a pure GCD design (including its interaction with the SQLite metadata database(s) it has, which will probably make you understand why I took this example earlier) only which makes it relevant to this discussion. Everytime you propose an idea, I'm really trying to ask myself "what would we do in bird".

One problem that is interesting and that I don't know how to handle in a pure fire-and-forget world is "iCloud Account Logout". This is something that today is implemented in a very synchronous fashion, because getting this wrong means the worst thing you can do for privacy: if you have a very quick succession of login/logout/login/logout and you screw this up, then you can have cross-polination between accounts, and we take that very seriously as you'd imagine.

That being said, the solution that was chosen is that the part of the code performing the logout sends tons of cancelations of all the things in flight, starts deleting your files on disk, and waits *synchronously* on all the subsystems for them to return that they've forgotten everything about the previous account. For debugging purposes, this is the right model, because when one subsystem has a bug and doesn't complete, you can see by inspecting the blocking relationship which one it is, which is trivial with regular operating introspection tools. I don't believe that writing the same code in a purely async way would be as easy to debug.

> - since they provide much stronger guarantees in general.

It depends which guarantees you're talking about. I don't think this statement is true. Async work has good and strong properties when you write code in the "normal" priority ranges, what we refer as to "in the QoS world" on Darwin (from background up to UI work).

However, there are tons of snowflakes on any platform that can't be in that world:
- media rendering (video/audio)
- HID (touch, gesture recognition, keyboard, mouses, trackpads, ...)
- some use cases of networking (bluetooth is a very good example, you hate when your audio drops with your bluetooth headset don't you?)
- ...

And these use cases are many, and run in otherwise regular processes all the time.

Today, the only way you don't end up with massive priority inversions when one of these subsystems has to perform something that interacts with generic code (such as loading a user pref, do an IPC, etc...) is that they do everything synchronously, because blocking is how the kernel can resolve these inversions locally, without priority overhangs. Asynchronous propagation has a ton of overhang (the term we use to describe the fact that your current effective priority is tainted by a higher priority piece of work that happened in the past). Overhang outside of the QoS world is not an option: if you run at the priority of gesture recognition by accident and use a lot of CPU, you just forfeited the touch interface of your phone, you can see how people would be upset with this.

However it is a fact of life that these subsystems, have to interact with generic subsystems sometimes, and that mean they need to be able to synchronously wait on an actor, so that this actor's priority is elevated. And you can't waive this off, there are tons of legitimate reasons for very-high priorities subsystems to have to interact and wait on regular priority work.

I'll take one compelling example: logging, when your HID stack wants to report a runtime fault of some sort, it has to interact with the logging subsystem which is mostly utility work. If the HID stack logs all the time, this is wrong, obviously, but if it has to report a critical issue, then it's fair for this subsystem to have to log, and asyncing the log is not a good option as this breaks the stacks and many things you want for a report. However you really want once this log is done, to reset your HID stack and move on with your life.

I 100% agree with you that if *everything* was asynchronous and written this way, our lives would be great. I don't however think it's possible on real life operating system to write all your code this way. And this is exactly where things start to be *very* messy.

However, if you're not in one of these "extra difficult cases" you should use pure async. Which means that we still need to provide synchronous waiting, but explain very clearly when it's okay to use. I think e.g. that if you're running on the anonymous pool, you should not be allowed to synchronously wait on another actor, for all the examples I took (whether the iCloud Drive logout case, or the high priority people), the code that waits on actors is only running on a non anonymous context, and it's "fine" to block this context, because you made it as a developer in the first place, so you understand what you do to that unique serial resource. If you block the shared pool, then you can't possibly understand since you don't know what else runs there by definition.

>> It is definitely the family of problems I'm worried about. I want to make sure we have an holistic approach here because I think that recursive mutexes, recursive CFRunloop runs and similar ideas are flawed and dangerous. I want to make sure we understand which limitations we want to impose here, and by limitations I really mean layering/architecture rules, and that we document them upfront and explain how to work with them.
> 
> +1
> 
>> I think that what I'm getting at is that we can't build Actors without defining how they interact with the operating system and what guarantees of latency and liveness they have.
> 
> Agreed, thanks Pierre!
> 
> -Chris
> 
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170903/81ed71d2/attachment.html>