[swift-evolution] [Draft] Hasher & HashVisitable

Tony Allevato tony.allevato at gmail.com
Mon Mar 13 17:29:11 CDT 2017


On Mon, Mar 13, 2017 at 3:14 PM Tony Allevato <tony.allevato at gmail.com>
wrote:

> On Mon, Mar 13, 2017 at 1:43 PM David Hart <david at hartbit.com> wrote:
>
> On 13 Mar 2017, at 20:51, Tony Allevato <tony.allevato at gmail.com> wrote:
>
> I'm not convinced this is the right approach: it couples the fact that a
> type is hashable with a specific strategy for implementing the hash value
> computation.
>
>
> I’m confused. This proposal specifically *decouples* the implementation
> of hashable from hashing strategies: that’s its main goal. Could you
> explain how you see coupling here?
>
>
> Hashable as it is currently defined says "this type provides a hashValue
> property that allows it to be used in a hashing container". The proposed
> API says "this type is hashable using a specific visitor API and hasher
> interface". The "visitor API and hasher interface" is what I mean by
> "strategy" in my original message, compared to the current Hashable which
> simply says "it's hashable and here's its hash—a number".
>
> I don't buy the argument that we need to support the same type to support
> being hashed with arbitrary hashing algorithms at runtime. Hash values in
> languages like Swift are intended to be used specifically for hashing
> collections. What value for that use case is provided by allowing a
> collection to plug in various algorithms for the same type? I don't think
> the added flexibility pays for the increased complexity.
>
> A simpler solution to increase safety/force people to consider the
> resiliency of their hash function would just be to have Hashable.hashValue
> return a concrete "HashValue" type that provides the mixing interface, with
> a concrete implementation in the standard library
>

[self-follow up] ...to clarify what I meant here, since I was trying to
juggle too many things at once: hashValue could *return* a HashValue
protocol that has a default concrete implementation in the standard
library, but which users could write conforming types for as well.

I'll admit that does go counter to my original point about coupling hashing
to a specific API, but I was trying to offer a better balance between
safety and flexibility without the added complexity of requiring the
eventual consumer of the hash code to always choose a hash algorithm.

That being said, even though it's somewhat less safe/still opens the door
to user error, I think a simple hashValue: Int interface + a canned hasher
in the standard library is more than sufficient and we don't need to
overcomplicate the general case.



> —but I think a simple interface that allows mixing (1) integer constants
> and (2) hash values of other Hashables would be more sufficient and easier
> to understand. If users need something more complex than that, then it
> stands to reason that they *should* have to do a little more work.
>
>
>
>
> Instead, the language should make efforts to avoid requiring the user to
> think about hashing algorithms *at all*; to answer Sean's question a couple
> messages up, I've proposed in the past
> <https://gist.github.com/allevato/2fd10290bfa84accfbe977d8ac07daad> (rough
> working draft) adding derivation of Equatable and Hashable for structs and
> enums where it's possible to automatically do so (for example, if all of
> the enum's associated values are Equatable/Hashable).
>
>
> This proposal does so with the *ContinuouslyHashable* protocol, which
> provides a default implementation of hashable that works for enums and
> structs that have “inclusive” memory layouts. I think this is good thing,
> because it means you have to make a conscious choice to use
> *ContinuouslyHashable* instead of relying on an automatic implementation
> which will sometimes be wrong.
>
>
> From what I can tell, ContiguouslyHashable only works with fixed layout
> types, whereas a derived implementation would work for anything that is
> composed entirely of other Hashable types by mixing the hash values of
> those components in a known way.
>
> In such a situation, why do you expect that such an automatically
> generated implementation would be "wrong"?
>
>
>
> I've been experimenting with an implementation in the compiler and I have
> something that works for enums, except for recursive ones. I haven't
> started structs yet because I think there are more open questions there,
> and I hope to propose enums and structs independently so that the enum one
> doesn't get bogged down by the struct one.
>
> The core team seems to be aware of the need for this; the logic that
> derives Equatable/Hashable for enums without associated values also has
> TODO comments to handle those with associated values, and to handle structs
> as well. Slava Pestov mentioned a while back that they encouraged PRs on
> it, which is why I started, but my free time has been limited lately.
>
> That being said, I *do* think there would be value in having some kind of
> hash "mixer" as part of the standard library and strongly encouraging
> through documentation that hashValue implementors use it. Getting the
> function right is the hard part, and if Swift both (1) took care of it for
> you by default in many cases and (2) made the harder cases easier by
> providing a high quality canned implementation, I think we'd have a much
> cleaner solution than what is being proposed here.
>
>
> On Mon, Mar 13, 2017 at 12:18 PM Sean Heber via swift-evolution <
> swift-evolution at swift.org> wrote:
>
> Is there any reason the API couldn’t be expressed as something along these
> lines?
>
> func hash() -> [Hashable] {
>   return [x, y]
> }
>
> l8r
> Sean
>
>
> > On Mar 13, 2017, at 2:15 PM, David Hart <david at hartbit.com> wrote:
> >
> >>
> >> On 13 Mar 2017, at 18:54, Sean Heber via swift-evolution <
> swift-evolution at swift.org> wrote:
> >>
> >> I’m dumb when it comes to proper hashing, but it’s such a tediously
> common thing in my experience to need to add Hashable to some kind of a
> struct so I can stash it in a set or use it as a dictionary key. Is there
> really no way to make this all more automatic? I have to be honest - this
> proposal seems *harder* to understand than the way it works now.
> >
> > It's not really harder: just call hash on each of your type's
> significant values:
> >
> > x.hash(&hasher)
> > y.hash(&hasher)
> >
> > How would you implement hashValue in a simpler way, remembering that 'x
> ^ y' is an incorrect implementation?
> >
> >> Of course the easiest would be if the language could just do this “good
> enough" for me using reflection or whatever and if I really did run into a
> problem where I wanted to do this myself, I could override something.
> >>
> >> Perfect is the enemy of good.
> >>
> >> l8r
> >> Sean
> >>
> >>
> >>> On Mar 13, 2017, at 10:38 AM, Vincent Esche via swift-evolution <
> swift-evolution at swift.org> wrote:
> >>>
> >>> Hi there,
> >>>
> >>> I've written up a proposal for an overhaul/replacement of the Hashable
> protocol and would love to hear your feedback on it!
> >>>
> >>> Rendered | Blog Post
> >>>
> >>> Cheers,
> >>> Vincent
> >>>
> >>> Ps: I'd like to thank David Hart (@hartbit) for his great editorial
> feedback on this proposal. 👍
> >>>
> >>> HashVisitable
> >>>
> >>>   • Proposal: SE-NNNN
> >>>   • Authors: Vincent Esche
> >>>   • Review Manager: TBD
> >>>   • Status: Awaiting review
> >>> Introduction
> >>>
> >>> Replace the Hashable protocol by two new procotols (Hasher and
> HashVisitable) to improve safety, versatility and learnability.
> >>>
> >>> Motivation
> >>>
> >>> Implementing Hashable is difficult and the consequences if not done
> well have dire performance and safety repercussions.
> >>>
> >>> The documentation of Hashable lists a sample implementation of var
> hashValue:
> >>>
> >>> /// A point in an x-y coordinate system.
> >>> struct GridPoint {
> >>>   var x: Int
> >>>   var y: Int
> >>> }
> >>>
> >>> extension GridPoint: Hashable {
> >>>   var hashValue: Int {
> >>>       return x.hashValue ^ y.hashValue
> >>>   }
> >>>
> >>>   static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
> >>>       return lhs.x == rhs.x && lhs.y == rhs.y
> >>>   }
> >>> }
> >>>
> >>> Calculating the hashes of all GridPoints (given the above
> implementation) on a 1000 × 1000 grid …
> >>>
> >>> let (width, height) = (1000, 1000)
> >>> let total = width * height
> >>> var hashes = Set<Int>()
> >>> for x in 0..<width {
> >>>   for y in 0..<height {
> >>>       hashes.insert(GridPoint(x: x, y: y).hashValue)
> >>>   }
> >>> }
> >>> print("\(hashes.count) unique hashes out of a total of \(total).")
> >>>
> >>> … results in just 1024 unique hash values for 1_000_000 unique values.
> >>>
> >>> In other words: The recommended implementation causes 99.9% of values
> to trigger a hash collision.
> >>>
> >>> Out of those 1_000_000 values the median collision count was 976 with
> min and max being 976 and 1000respectively.
> >>>
> >>> The collision rate will have negative impact in algorithms which
> heavily use hashValue like the ones in Dictionaryand Set. Furthermore, it
> increases the vulnerability to DDOS attacks when exposed to the web.
> >>>
> >>> If even the official Swift documentation gets the implementation of
> hashValue wrong, then who is to expect the average Swift programmer to do
> any better?
> >>>
> >>> In contrast, running the same snippet using HashVisitable and the
> semi-secure Fnv1aHash (see below) results in zero collisions!
> >>>
> >>> Finally, the design of the Hashable protocol forces the use of one
> implementation without the possibility of switching between multiple
> hashing algorithms.
> >>>
> >>> Proposed solution
> >>>
> >>> Instead of coupling the hashing algorithm with each and every Swift
> type, we should provide a hashing API based on the visitor-pattern. By
> freeing application developers from the burden of having to implement
> hashing algorithms, the Standard Library can provide default ones which
> fulfill collision, performance and security goals. Furthermore, it would
> allow developers to swap to different algorithms based on the use case.
> >>>
> >>> Detailed design
> >>>
> >>> The proposal deprecates the Hashable protocol and introduces the
> following two:
> >>>
> >>> protocol Hasher
> >>> {
> >>>
> >>> mutating func finish() -> Int
> >>>
> >>>
> >>> mutating func write(bytes
> >>> : UnsafeRawBufferPointer)
> >>> }
> >>>
> >>>
> >>> protocol HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H)
> >>> }
> >>>
> >>> Hasher is the protocol which represents a hashing algorithm, and
> HashVisitable replaces Hashable. For types entirely represented by their
> memory layout, the following protocol would provide a default
> implementation:
> >>>
> >>> protocol ContiguouslyHashable: HashVisitable {}
> >>>
> >>> extension ContiguouslyHashable {
> >>>   func hash<H: Hasher>(_ hasher: inout H) {
> >>>       var mutableSelf = self
> >>>       try! Swift.withUnsafeBytes(of: &mutableSelf) {
> >>>           hasher.write(bytes: $0)
> >>>       }
> >>>   }
> >>> }
> >>>
> >>> extension Bool : ContiguouslyHashable {}
> >>> extension UInt8 : ContiguouslyHashable {}
> >>> extension UInt16 : ContiguouslyHashable {}
> >>> extension UInt32 : ContiguouslyHashable {}
> >>> extension UInt64 : ContiguouslyHashable {}
> >>> extension UInt : ContiguouslyHashable {}
> >>> extension Int8 : ContiguouslyHashable {}
> >>> extension Int16 : ContiguouslyHashable {}
> >>> extension Int32 : ContiguouslyHashable {}
> >>> extension Int64 : ContiguouslyHashable {}
> >>> extension Int : ContiguouslyHashable {}
> >>>
> >>> The Standard-Library would then provide a set of hashing
> implementations specific to each purpose. A possible choice for hashing
> algorithms would be the reasonably fast SipHash-2-4, and the reasonably
> secure SipHash-4-8.
> >>>
> >>> FNV-1A is another popular semi-secure but blazingly fast hash
> algorithm, which – for the sake of demonstration – could be implemented as
> follows:
> >>>
> >>> struct Fnv1aHash
> >>> {
> >>>
> >>> fileprivate var state: UInt
> >>>
> >>>
> >>> init(seed: UInt
> >>> ) {
> >>>
> >>> self.state = seed &+ 14695981039346656037
> >>>
> >>>   }
> >>> }
> >>>
> >>>
> >>> extension Fnv1aHash: Hasher
> >>> {
> >>>
> >>> mutating func write(bytes
> >>> : UnsafeRawBufferPointer) {
> >>>
> >>> for byte in
> >>> bytes {
> >>>
> >>> self.state = (self.state ^ UInt(byte)) &* 1099511628211
> >>>
> >>>       }
> >>>   }
> >>>
> >>> mutating func finish() -> Int
> >>> {
> >>>
> >>> return unsafeBitCast(self.state, to: Int.self
> >>> )
> >>>   }
> >>> }
> >>>
> >>> Coming back to the sample code present in the Hashable documentation,
> the new implementation would look like:
> >>>
> >>> extension GridPoint: HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H) {
> >>>
> >>> self.x.hash(&
> >>> hasher)
> >>>
> >>> self.y.hash(&
> >>> hasher)
> >>>   }
> >>> }
> >>>
> >>> Source compatibility
> >>>
> >>> Making use of "extending protocols to conform to protocols":
> >>>
> >>> extension Hashable: HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H) {
> >>>
> >>> self.hashValue.hash(&
> >>> hasher)
> >>>   }
> >>> }
> >>>
> >>> Effect on ABI stability
> >>>
> >>> n/a
> >>>
> >>> Effect on API resilience
> >>>
> >>> This feature should be possible to add/remove without breaking ABI.
> >>>
> >>> Alternatives considered
> >>>
> >>> n/a
> >>> _______________________________________________
> >>> swift-evolution mailing list
> >>> swift-evolution at swift.org
> >>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >>
> >> _______________________________________________
> >> swift-evolution mailing list
> >> swift-evolution at swift.org
> >> https://lists.swift.org/mailman/listinfo/swift-evolution
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170313/c6134ef0/attachment.html>


More information about the swift-evolution mailing list