[swift-corelibs-dev] libdispatch roadmap and api addition proposal

Tue Dec 8 10:07:06 CST 2015

-Pierre

> On Dec 8, 2015, at 7:34 AM, Joakim Hassila via swift-corelibs-dev <swift-corelibs-dev at swift.org> wrote:
> 
> Hi Pierre,
> 
> Thanks for the good explanation, will try to respond inline below:
> 
>> On 7 dec. 2015, at 23:10, Pierre Habouzit <phabouzit at apple.com <mailto:phabouzit at apple.com>> wrote:
>> 
>> Hi Joakim, Kevin,
>> 
>> [ Full disclosure, I made that reply in rdar://problem/16436943 <rdar://problem/16436943> and your use case was slightly different IIRC but you’re right it’s a close enough problem ]
>> 
>> Dispatch internally has a notion of something that does almost that, called _dispatch_barrier_trysync_f[1]. However, it is used internally to serialize state changes on sources and queues such as setting the target queue or event handlers.
>> 
>> The problem is that this call bypasses the target queue hierarchy in its fastpath, which while it’s correct when changing the state of a given source or queue, is generally the wrong thing to do. Let’s consider this code assuming the dispatch_barrier_trysync()
>> 
>> 
>>     dispatch_queue_t outer = dispatch_queue_create("outer", NULL);
>>     dispatch_queue_t inner = dispatch_queue_create("inner", NULL);
>>     dispatch_set_target_queue(outer, inner);
>> 
>>     dispatch_async(inner, ^{
>>         // write global state protected by inner
>>     });
>>     dispatch_barrier_trysync(outer, ^{
>>         // write global state protected by inner
>>     });
>> 
>> 
>> Then if it works like the internal version we have today, the code above has a data race, which we’ll all agree is bad.
>> Or we do an API version that when the queue you do the trysync on is not targetted at a global root queue always go through async, and that is weird, because the performance characteristics would completely depend on the target queue hierarchy, which when layering and frameworks start to be at play, is a bad characteristic for a good API.
> 
> Yes, we could currently assume that we only targeted a root queue for our use case, so our implementation has this limitation (so it is not a valid general solution as you say). 
> 
> It would perhaps be a bit strange to have different performance characteristics depending on the target queue hierarchy as you say, but there are already some performance differences in actual behavior if using e.g. an overcommit queue vs a non, so perhaps another option would be to have this as an optional queue attribute instead of an additional generic API (queue attribute ’steal calling thread for inline processing of requests if the queue was empty when dispatching’) …?

My point is, adding API to dispatch is not something we do lightly. I’m not keen on an interface that only works for base queues. Mac OS and iOS code where dispatchy code is pervasive, more than 2 queue deep queues hierarchy is very common typically.

>> Or we don’t give up right away when the hierarchy is deep, but then that means that dispatch_trysync would need to be able to unwind all the locks it took, and then you have ordering issues because enqueuing that block that couldn’t run synchronously may end up being after another one and break the FIFO ordering of queues. Respecting this which is a desired property of our API and getting an efficient implementation are somehow at odds.
> 
> Yes, agree it is a desirable property of the API to retain the ordering.
> 
>> The other argument against trysync that way, is that during testing trysync would almost always go through the non contended codepath, and lead developers to not realize that they should have taken copies of variables and the like (this is less of a problem on Darwin with obj-c and ARC), but trysync running on the same thread will hide that. except that once it starts being contended in production, it’ll bite you hard with memory corruption everywhere.
> 
> Less of an issue for us as we depend on the _f interfaces throughout due to portability concerns, but fair point.
> 
>> Technically what you’re after is that bringing up a new thread is very costly and that you’d rather use the one that’s asyncing the request because it will soon give up control. The wake up of a queue isn’t that expensive, in the sense that the overhead of dispatch_sync() in terms of memory barriers and locking is more or less comparable. What’s expensive is creating a thread to satisfy this enqueue.
> 
> Yes, in fact, bringing up a new thread is so costly that we keep a pool around in the libpwq implementation. Unfortunately we would often see double-digit microsecond latency incurred by this, which is unacceptable for us, so we had to (for some configurations/special deployments) have a dedicated spin thread that will grab the next queue to work on (that cut down the latency with a factor of 10 or so) and the next thread woken from the thread pool would take over a spinner…
> 
>> In my opinion, to get the win you’re after, you’d rather want an async() version that if it wakes up the target queue hierarchy up to the root then you  want to have more resistance in bringing up a new thread to satisfy that request. Fortunately, the overcommit property of queues could be used by a thread pool to decide to apply that resistance. There are various parts of the thread pool handling (especially without kernel work queues support) that could get some love to get these exact benefits without changing the API.
> 
> That would indeed be a very interesting idea, the problem is that the thread using ‘dispatch_barrier_trysync’ is not returning to the pthread_workqueue pool to grab the next dispatch queue for processing, but is instead going back to block on a syscall (e.g. read() from a socket) - and even the latency to wake up a thread (as is commonly done now) with mutex/condition signaling is way too slow for the use case we have (thus the very ugly workaround with a spin thread for some deployments).
> 
> Essentially, for these kind of operations we really want to avoid all context switches as long as we can keep up with the rate of inbound data, and in general such dynamics would be a nice property to have - if the thread performing the async call was known to always return to the global pwq thread pool, it would be nicely solved by applying resistance as you suggest, the problem is what to do when it gets blocked and you thus get stuck.
> 
> Perhaps we have to live with the limited implementation we have for practical purposes, but I have the feeling that the behavior we are after would be useful for other use cases, perhaps the queue attribute suggested above could be another way of expressing it without introducing new dispatch API. 

I completely agree with you, but I think that the way to address this is by making the thread pool smarter, not having the developper have to sprinkle his code with dispatch_barrier_trysync() where he feels like it. Using it properly require a deep understanding of the implementation of dispatch he’s using and changes on each platform / version combination. that’s not really the kind of interface we want to build.

“overcommit” is exactly the hint you’re after as far as the queue is concerned. It means “if I’m woken up, bring up a new thread provided it doesn’t blow up the system, no matter what”. So make your queue non overcommit by targetting it manually to dispatch_get_global_queue(0, 0) (that one isn’t overcommit), and make the thread pool smarter. That’s the right way to go and the design-compatible way to do it.

If your thread block in read() then I would argue that it should use a READ dispatch source instead, that way, the source would get enqueued *after* your async and you can ping pong. Doing blocking read()s is not dispatchy at all and will cause you all sorts of problems like that one, because re-async doesn’t work for you.

-Pierre

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-corelibs-dev/attachments/20151208/ae19fdc9/attachment.html>