[swift-evolution] Preconditions aborting process in server scenarios [was: Throws? and throws!]

Wed Jan 18 12:45:33 CST 2017

> On Jan 18, 2017, at 9:06 AM, Joe Groff via swift-evolution <swift-evolution at swift.org> wrote:
>> On Jan 18, 2017, at 12:04 AM, Rien via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>>> 
>>> On 18 Jan 2017, at 08:54, Jonathan Hull via swift-evolution <swift-evolution at swift.org> wrote:
>>> 
>>> 
>>>> On Jan 17, 2017, at 7:13 PM, Dave Abrahams <dabrahams at apple.com> wrote:
>>>> 
>>>> 
>>>> on Tue Jan 17 2017, Jonathan Hull <jhull-AT-gbis.com> wrote:
>>>> 
>>>>> Bringing it back towards the initial post, what if there was a
>>>>> separation from true needs-to-take-down-the-entire-system trapping and
>>>>> things like out-of-bounds and overflow errors which could stop at
>>>>> thread/actor bounds (or in some cases even be recovered)?
>>>>> 
>>>>> The latter were the ones I was targeting with my proposal.  They live
>>>>> in this grey area, because honestly, they should be throwing errors if
>>>>> not for the performance overhead and usability issues.  
>>>> 
>>>> I fundamentally disagree with that statement.  There is value in
>>>> declaring certain program behaviors illegal, and in general for things
>>>> like out-of-bounds access and overflow no sensible recovery (where
>>>> “recovery” means something that would allow the program to continue
>>>> reliably) is possible.  
>>> 
>>> I think we do fundamentally disagree.  I know I come from a very different background (Human-Computer Interaction & Human Factors) than most people here, and I am kind of the odd man out, but I have never understood this viewpoint for anything but the most severe cases where the system itself is in danger of being compromised (certainly not for an index out of bounds).  In my mind “fail fast” is great for iterating in development builds, but once you are deploying, the user’s needs should come ahead of the programmer’s.
>>> 
>>> Shouldn’t a system be as robust as possible
>> 
>> Yes
>> 
>>> and try to minimize the fallout from any failure point?
>> 
>> That is in direct conflict with the robustness
>> Once an error is detected that is not handled by the immediate code, it must be assumed that a worst-case scenario happened. And further damage to the user can only be prevent by bringing down the app. Even if that means losing all work in progress.
>> 
>> A compromised system must be prevent from accessing any resources. Once a system is compromised the risk to the user is *much* higher than simply loosing work in progress. He might loose his job, career, etc.
> 
> That's certainly true of code that makes unaudited use of `unsafe` constructs that can violate safety without any checking. It's my hope that our normal safety checks are  thorough and fire early enough that your subprocess would crash before wide-system compromise happens. In an "actor" or similar model, even if we decide we don't want to pay for unwinding to fully clean up after the crashed actor, that crash could still at least be noted by a coordinator actor, which in your server situation could handle the problem by not accepting any new connections and letting its existing connections finish before restarting the process, or in an iOS-like mobile situation could  trigger serialization of the current user state so that the process can be transparently killed and restarted. In either situation, perhaps we'd want to "taint" actors that use unsafe constructs so that their failure can't be recovered at all.

This seems like basically the right approach to me.  It means we don't make any effort to "clean up" the failing actor — essentially, it's treated as if it were just deadlocked — which means we don't pay the pervasive code-size costs of unwinding.  That's even fairly likely to leave the process in a state that can still be usefully debugged (as opposed to unwinding stacks, which completely destroys the execution context).  But there's still an opportunity to react and try to wind up other tasks.

I'm not sure it makes any sense to call out actors that have used unsafe constructs as somehow specially unrecoverable.  If the concern is that the unsafe code may corrupt the other actors, well, that's true, but (1) that implies that you have to forbid recovery if *any* actor has used unsafe constructs, since low-level corruption can be passed between actors when they communicate normally, and (2) that's equally true of all sorts of high-level corruption that don't depend on unsafe constructs, and which the failing assertion may be the first indication of.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170118/4cfb5540/attachment.html>