[swift-evolution] [Draft] Hasher & HashVisitable

Károly Lőrentey karoly at lorentey.hu
Tue Mar 14 00:35:29 CDT 2017



>> On 2017. Mar 14., at 2:44, Xiaodi Wu via swift-evolution <swift-evolution at swift.org> wrote:
>> 
>> On Mon, Mar 13, 2017 at 8:30 PM, Tony Allevato via swift-evolution <swift-evolution at swift.org> wrote:
>> Adding the list back to this reply because I don't believe you meant to reply only to me, and I think it's an important discussion to have :)
>> 
>> 
>> On Mon, Mar 13, 2017 at 4:32 PM Vincent Esche <regexident.mailinglists at gmail.com> wrote:
>> Automatic generation of Hashable implementation code only "solves" the problem of having to implement `var hashValue`.
>> It however only works for some types. By no means for all.
>> 
>> Certainly, and I never suggested that it would work for every case. I was responding to Sean's point that the compiler should be able to do "good enough" in the common cases and I offered a way that that can be achieved.
>> 
>> 
>> Take this hypothetical implementation of a HSB color:
>> 
>> struct Color {
>> 	let hue: UInt8
>> 	let saturation: UInt8
>> 	let brightness: UInt8
>> }
>> 
>> Let the semantic of this type be this:
>> Any two colors with a `brightness` of `0` are to be considered equal regardless of their respective `hue` or `saturation`. At night all cats are gray. 
>> 
>> An auto-generated conformance impl for `Color` would most likely be wrong afaict.
>> And those kind of semantics are everywhere.
>> 
>> Of course, and in those cases, you wouldn't want to use an auto-derived Equatable or Hashable implementation. Those kinds of semantics are "everywhere" for certain definitions of "everywhere", but similarly, semantics where the hash value of a thing can be determined simply via combination of the hash values of its components are also "everywhere".
>> 
>> I wouldn't suggest that auto-deriving Hashable solves *all* problems, and similarly, your proposal doesn't help in the situation you described either. Your API provides a different way to mix the hue, saturation, and brightness components in an implementation of hashValue, but it doesn't force the user to mix them *correctly* (by excluding hue/saturation if brightness is zero).
>> 
>> So in both cases, users still have to provide custom implementations of == and hashValue. But without auto-deriving, users have to provide them even for the cases where the auto-derived implementation *would have been correct.*  Auto-deriving is about reducing the number of types that need to have custom implementations, thereby reducing the chance that a user will do it wrong.
>> 
>> When considering a type with auto-derived Hashable and Equatable, there are two ways it can be wrong:
>> 
>> 1) The auto-generated implementations of both == and hashValue don't honor the semantic contract of the type (for example, don't ignore brightness when it's zero).
>> 2) The user overrides the auto-generated implementation of one of ==/hashValue but not both, and violates the contracts between them.
>> 
>> For #1, yes, the compiler generated an incorrect implementation in that case. However, I would argue it's the developer's responsibility to fix it by overriding it if they need different semantics. As I mentioned above, without derivation, the developer could still implement it incorrectly, just as they could with your API.
>> 
>> For #2, this could be solved by requiring users to override both if they override one of them. Now they're back in the same situation as #1—they either did it right, or they did it wrong.
>> 
>>  
>> 
>> Auto-generation clearly is no panacea when it comes to hashing. Especially if it leads to invalid and unsafe default implementations.
>> 
>> Auto-deriving cannot produce "unsafe" implementations because it's defined in terms of a function operating over the hash values of its components. It can produce an implementation that does not match the intended semantics of the class that are defined outside of its component values, but that's not the compiler's job to decide; it's up to the user to provide those.
>> 
>> Regarding unsafety, it's worth noting that your ContiguouslyHashable protocol is inherently unsafe and fragile. Imagine that a user implements a struct that they make conform to ContiguouslyHashable because at the time, it's a simple fixed layout type with primitive fields. If they take that type and add a new field to it that isn't contiguously hashable (say, a class reference) and they forget to replace the ContiguouslyHashable conformance, they now have a very broken type, and that behavior isn't caught until *runtime* when things go wrong.
>> 
>> At least with derived conformances, in that situation the *compiler* would see that the type was no longer hashable and emit an error when it's used anywhere that Hashable/hashValue was expected.
>> 
>> So if safety is your motivation, ContiguouslyHashable kind of fights back against that because it gives the user a door they can open—in some cases, accidentally—that produces invalid results.
>> 
>>  
>> 
>> It would however be nice to be able to annotate types for which you want HashVisitable implementations to be generated.
>> 
>> That's one of the still-open questions from the last discussion thread on the topic; whether auto-deriving should be automatic for any type where it is possible (as it is now for enums without associated values) or whether the user should have to opt-in.
>> 
>>  
>> 
>> I actually don't see auto-generation as an alternative to a proper hashing API. But rather as an addition.
>> HashVisitable and auto-deriving would actually work great together!
>> 
>> I completely agree that these ideas complement each other very well! And I also agree that the language would do well to make it easier for users to easily do the right thing. I just feel that *the API proposed here specifically* adds too much complexity for the problem it's trying to solve.
>> 
>> I'd find it more compelling if it was simplified a bit:
>> 
>> * The standard library doesn't need to provide a number of hash implementations; it should just provide one that works "well enough" in most cases
>> * Hashable doesn't have tie itself to a visitor pattern. It's not necessarily safer because users can still mix the wrong values; in that case, I'd rather the stdlib implementation of the "good enough" hash just be a standalone mixer that the language can encourage users to use.
> 
> I agree with this assessment. If SipHash is deemed "good enough," then Swift should offer these as standalone facilities and encourage users to call them in `Hashable`. I think the API design proposed here is much too complex, but offering tried-and-true hash functions is certainly reasonable and would be an improvement over xor.

One important point about SipHash and similar secure hash functions is that they work better if all nested components of a hashable value are (recursively) appended to a single hasher, rather than having each component create its own separate hasher instance in its hashValue implementation. Initializing and finalizing the hasher state is not free; it has some nontrivial cost. So ideally only collections would create and finalize hashers; composite types that implement Hashable should not do that. 

Therefore, implementing SipHash as a separate, optional facility for use inside individual hashValue implementations (as opposed to being an outright replacement for it) would be somewhat inefficient. It also wouldn't discourage/prevent people from continuing to use xor and similar ad hoc hashing methods; people would need to discover these useful new hash utilities on their own.

> 
> Certainly also, the documentation for `Hashable` can be improved. In general, however, it's not very convincing to motivate an entire proposal for a new feature based on a documentation bug. Recognize that there are (or at least have been, in past shipped versions of Swift) code snippets in the documentation that _don't even compile_! If we accepted the logic put forward here, we should give up on Swift entirely; after all, if even the official Swift documentation can't provide code that compiles, what hope do the rest of us have?!

IIUC, the proposal suggests that HashVisitable should replace Hashable as the official hashing protocol. This is certainly a major change; but given the sorry state of not just the example given in the Hashable docs, but an alarmingly large fraction of every hand-written hashValue implementation out there (including almost all hashValues I ever made before I gave up & switched to SipHash), perhaps it'd be a good idea to make a clean break here and force people to switch to an API that's a lot easier to get right.

<rant>
Consider that hashValue is pretty much the only place in "everyday" Swift code where people need to routinely use the overflowing arithmetic operators and bitwise operators from the "Advanced Operators" section in the Swift manual. Typical hashValue implementations look like they were written in a different language than the code that surrounds them. They're full of arcane symbols, magical numbers and mystical incantations that are superstitiously repeated over and over, often with no real understanding of their original meaning or function.

It took a lot of effort to learn C-style for loops (ding!), and they were often tricky to use correctly; but it was certainly possible to learn them well enough to write decent code. In contrast, practically nobody I know is able to write a good hashValue implementation from scratch for even simple structs like CGPoint with any confidence. Not just junior programmers: nobody. One has to be some sort of wizard to do it. The hashValue API is a miserable failure.
</rant>

Automatic generation of hashValue implementations using SipHash would (mostly) resolve the usability issue, but not the efficiency thing. Perhaps it'd be acceptable to simultaneously implement autogeneration and to switch over to (simplified!) visitor-based hashing in the same compiler release—so that the migrator's fixits would simply suggest removing hashValue implementations instead of replacing them with something that looks dramatically different. 

If the boilerplate generator would include some sort of syntax to explicitly add/remove specific properties from/to the scope of hashing & equality, then only people implementing hashed collections would ever need to care about the specific hashing APIs, leaving other programmers blissfully and safely unaware of such esoteric concerns.


> 
>  
>> 
>> In fact they already do in other languages.
>> 
>> 
>> On Mon, Mar 13, 2017 at 8:51 PM, Tony Allevato via swift-evolution <swift-evolution at swift.org> wrote:
>> I'm not convinced this is the right approach: it couples the fact that a type is hashable with a specific strategy for implementing the hash value computation. 
>> 
>> Instead, the language should make efforts to avoid requiring the user to think about hashing algorithms *at all*; to answer Sean's question a couple messages up, I've proposed in the past (rough working draft) adding derivation of Equatable and Hashable for structs and enums where it's possible to automatically do so (for example, if all of the enum's associated values are Equatable/Hashable). I've been experimenting with an implementation in the compiler and I have something that works for enums, except for recursive ones. I haven't started structs yet because I think there are more open questions there, and I hope to propose enums and structs independently so that the enum one doesn't get bogged down by the struct one.
>> 
>> The core team seems to be aware of the need for this; the logic that derives Equatable/Hashable for enums without associated values also has TODO comments to handle those with associated values, and to handle structs as well. Slava Pestov mentioned a while back that they encouraged PRs on it, which is why I started, but my free time has been limited lately.
>> 
>> That being said, I *do* think there would be value in having some kind of hash "mixer" as part of the standard library and strongly encouraging through documentation that hashValue implementors use it. Getting the function right is the hard part, and if Swift both (1) took care of it for you by default in many cases and (2) made the harder cases easier by providing a high quality canned implementation, I think we'd have a much cleaner solution than what is being proposed here.
>> 
>> 
>> On Mon, Mar 13, 2017 at 12:18 PM Sean Heber via swift-evolution <swift-evolution at swift.org> wrote:
>> Is there any reason the API couldn’t be expressed as something along these lines?
>> 
>> func hash() -> [Hashable] {
>>   return [x, y]
>> }
>> 
>> l8r
>> Sean
>> 
>> 
>> > On Mar 13, 2017, at 2:15 PM, David Hart <david at hartbit.com> wrote:
>> >
>> >>
>> >> On 13 Mar 2017, at 18:54, Sean Heber via swift-evolution <swift-evolution at swift.org> wrote:
>> >>
>> >> I’m dumb when it comes to proper hashing, but it’s such a tediously common thing in my experience to need to add Hashable to some kind of a struct so I can stash it in a set or use it as a dictionary key. Is there really no way to make this all more automatic? I have to be honest - this proposal seems *harder* to understand than the way it works now.
>> >
>> > It's not really harder: just call hash on each of your type's significant values:
>> >
>> > x.hash(&hasher)
>> > y.hash(&hasher)
>> >
>> > How would you implement hashValue in a simpler way, remembering that 'x ^ y' is an incorrect implementation?
>> >
>> >> Of course the easiest would be if the language could just do this “good enough" for me using reflection or whatever and if I really did run into a problem where I wanted to do this myself, I could override something.
>> >>
>> >> Perfect is the enemy of good.
>> >>
>> >> l8r
>> >> Sean
>> >>
>> >>
>> >>> On Mar 13, 2017, at 10:38 AM, Vincent Esche via swift-evolution <swift-evolution at swift.org> wrote:
>> >>>
>> >>> Hi there,
>> >>>
>> >>> I've written up a proposal for an overhaul/replacement of the Hashable protocol and would love to hear your feedback on it!
>> >>>
>> >>> Rendered | Blog Post
>> >>>
>> >>> Cheers,
>> >>> Vincent
>> >>>
>> >>> Ps: I'd like to thank David Hart (@hartbit) for his great editorial feedback on this proposal. 👍
>> >>>
>> >>> HashVisitable
>> >>>
>> >>>   • Proposal: SE-NNNN
>> >>>   • Authors: Vincent Esche
>> >>>   • Review Manager: TBD
>> >>>   • Status: Awaiting review
>> >>> Introduction
>> >>>
>> >>> Replace the Hashable protocol by two new procotols (Hasher and HashVisitable) to improve safety, versatility and learnability.
>> >>>
>> >>> Motivation
>> >>>
>> >>> Implementing Hashable is difficult and the consequences if not done well have dire performance and safety repercussions.
>> >>>
>> >>> The documentation of Hashable lists a sample implementation of var hashValue:
>> >>>
>> >>> /// A point in an x-y coordinate system.
>> >>> struct GridPoint {
>> >>>   var x: Int
>> >>>   var y: Int
>> >>> }
>> >>>
>> >>> extension GridPoint: Hashable {
>> >>>   var hashValue: Int {
>> >>>       return x.hashValue ^ y.hashValue
>> >>>   }
>> >>>
>> >>>   static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
>> >>>       return lhs.x == rhs.x && lhs.y == rhs.y
>> >>>   }
>> >>> }
>> >>>
>> >>> Calculating the hashes of all GridPoints (given the above implementation) on a 1000 × 1000 grid …
>> >>>
>> >>> let (width, height) = (1000, 1000)
>> >>> let total = width * height
>> >>> var hashes = Set<Int>()
>> >>> for x in 0..<width {
>> >>>   for y in 0..<height {
>> >>>       hashes.insert(GridPoint(x: x, y: y).hashValue)
>> >>>   }
>> >>> }
>> >>> print("\(hashes.count) unique hashes out of a total of \(total).")
>> >>>
>> >>> … results in just 1024 unique hash values for 1_000_000 unique values.
>> >>>
>> >>> In other words: The recommended implementation causes 99.9% of values to trigger a hash collision.
>> >>>
>> >>> Out of those 1_000_000 values the median collision count was 976 with min and max being 976 and 1000respectively.
>> >>>
>> >>> The collision rate will have negative impact in algorithms which heavily use hashValue like the ones in Dictionaryand Set. Furthermore, it increases the vulnerability to DDOS attacks when exposed to the web.
>> >>>
>> >>> If even the official Swift documentation gets the implementation of hashValue wrong, then who is to expect the average Swift programmer to do any better?
>> >>>
>> >>> In contrast, running the same snippet using HashVisitable and the semi-secure Fnv1aHash (see below) results in zero collisions!
>> >>>
>> >>> Finally, the design of the Hashable protocol forces the use of one implementation without the possibility of switching between multiple hashing algorithms.
>> >>>
>> >>> Proposed solution
>> >>>
>> >>> Instead of coupling the hashing algorithm with each and every Swift type, we should provide a hashing API based on the visitor-pattern. By freeing application developers from the burden of having to implement hashing algorithms, the Standard Library can provide default ones which fulfill collision, performance and security goals. Furthermore, it would allow developers to swap to different algorithms based on the use case.
>> >>>
>> >>> Detailed design
>> >>>
>> >>> The proposal deprecates the Hashable protocol and introduces the following two:
>> >>>
>> >>> protocol Hasher
>> >>> {
>> >>>
>> >>> mutating func finish() -> Int
>> >>>
>> >>>
>> >>> mutating func write(bytes
>> >>> : UnsafeRawBufferPointer)
>> >>> }
>> >>>
>> >>>
>> >>> protocol HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H)
>> >>> }
>> >>>
>> >>> Hasher is the protocol which represents a hashing algorithm, and HashVisitable replaces Hashable. For types entirely represented by their memory layout, the following protocol would provide a default implementation:
>> >>>
>> >>> protocol ContiguouslyHashable: HashVisitable {}
>> >>>
>> >>> extension ContiguouslyHashable {
>> >>>   func hash<H: Hasher>(_ hasher: inout H) {
>> >>>       var mutableSelf = self
>> >>>       try! Swift.withUnsafeBytes(of: &mutableSelf) {
>> >>>           hasher.write(bytes: $0)
>> >>>       }
>> >>>   }
>> >>> }
>> >>>
>> >>> extension Bool : ContiguouslyHashable {}
>> >>> extension UInt8 : ContiguouslyHashable {}
>> >>> extension UInt16 : ContiguouslyHashable {}
>> >>> extension UInt32 : ContiguouslyHashable {}
>> >>> extension UInt64 : ContiguouslyHashable {}
>> >>> extension UInt : ContiguouslyHashable {}
>> >>> extension Int8 : ContiguouslyHashable {}
>> >>> extension Int16 : ContiguouslyHashable {}
>> >>> extension Int32 : ContiguouslyHashable {}
>> >>> extension Int64 : ContiguouslyHashable {}
>> >>> extension Int : ContiguouslyHashable {}
>> >>>
>> >>> The Standard-Library would then provide a set of hashing implementations specific to each purpose. A possible choice for hashing algorithms would be the reasonably fast SipHash-2-4, and the reasonably secure SipHash-4-8.
>> >>>
>> >>> FNV-1A is another popular semi-secure but blazingly fast hash algorithm, which – for the sake of demonstration – could be implemented as follows:
>> >>>
>> >>> struct Fnv1aHash
>> >>> {
>> >>>
>> >>> fileprivate var state: UInt
>> >>>
>> >>>
>> >>> init(seed: UInt
>> >>> ) {
>> >>>
>> >>> self.state = seed &+ 14695981039346656037
>> >>>
>> >>>   }
>> >>> }
>> >>>
>> >>>
>> >>> extension Fnv1aHash: Hasher
>> >>> {
>> >>>
>> >>> mutating func write(bytes
>> >>> : UnsafeRawBufferPointer) {
>> >>>
>> >>> for byte in
>> >>> bytes {
>> >>>
>> >>> self.state = (self.state ^ UInt(byte)) &* 1099511628211
>> >>>
>> >>>       }
>> >>>   }
>> >>>
>> >>> mutating func finish() -> Int
>> >>> {
>> >>>
>> >>> return unsafeBitCast(self.state, to: Int.self
>> >>> )
>> >>>   }
>> >>> }
>> >>>
>> >>> Coming back to the sample code present in the Hashable documentation, the new implementation would look like:
>> >>>
>> >>> extension GridPoint: HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H) {
>> >>>
>> >>> self.x.hash(&
>> >>> hasher)
>> >>>
>> >>> self.y.hash(&
>> >>> hasher)
>> >>>   }
>> >>> }
>> >>>
>> >>> Source compatibility
>> >>>
>> >>> Making use of "extending protocols to conform to protocols":
>> >>>
>> >>> extension Hashable: HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H) {
>> >>>
>> >>> self.hashValue.hash(&
>> >>> hasher)
>> >>>   }
>> >>> }
>> >>>
>> >>> Effect on ABI stability
>> >>>
>> >>> n/a
>> >>>
>> >>> Effect on API resilience
>> >>>
>> >>> This feature should be possible to add/remove without breaking ABI.
>> >>>
>> >>> Alternatives considered
>> >>>
>> >>> n/a
>> >>> _______________________________________________
>> >>> swift-evolution mailing list
>> >>> swift-evolution at swift.org
>> >>> https://lists.swift.org/mailman/listinfo/swift-evolution
>> >>
>> >> _______________________________________________
>> >> swift-evolution mailing list
>> >> swift-evolution at swift.org
>> >> https://lists.swift.org/mailman/listinfo/swift-evolution
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>> 
>> 
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170314/ad312293/attachment.html>


More information about the swift-evolution mailing list