[swift-evolution] [Draft] Hasher & HashVisitable

Vincent Esche regexident.mailinglists at gmail.com
Tue Mar 14 03:34:54 CDT 2017


The documentation bug is not the sole motivation for the proposal.
I did in fact only find it while writing the accompanying blog post
<https://blog.definiteloops.com/ha-r-sh-visitors-8c0c3686a46f>,
in which I present a couple of additional motivations besides hash
collisions.

Namely:
- decoupling
- independence from a one-size-fits-all hasher impl
- multiple hashers per instance (for implementing Bloom Filters, etc.)
- different hashers for different use-cases (fast&weak for safe
bottlenecks, slow&secure for web-exposed APIs)
- …

I’m actually much more interested in the independence gained, than pure
performance.


On Tue, Mar 14, 2017 at 2:44 AM, Xiaodi Wu <xiaodi.wu at gmail.com> wrote:

> On Mon, Mar 13, 2017 at 8:30 PM, Tony Allevato via swift-evolution <
> swift-evolution at swift.org> wrote:
>
>> Adding the list back to this reply because I don't believe you meant to
>> reply only to me, and I think it's an important discussion to have :)
>>
>>
>> On Mon, Mar 13, 2017 at 4:32 PM Vincent Esche <
>> regexident.mailinglists at gmail.com> wrote:
>>
>> Automatic generation of Hashable implementation code only "solves" the
>> problem of having to implement `var hashValue`.
>> It however only works for some types. By no means for all.
>>
>>
>> Certainly, and I never suggested that it would work for every case. I was
>> responding to Sean's point that the compiler should be able to do "good
>> enough" in the common cases and I offered a way that that can be achieved.
>>
>>
>> Take this hypothetical implementation of a HSB color:
>>
>> struct Color {
>> let hue: UInt8
>> let saturation: UInt8
>> let brightness: UInt8
>> }
>>
>> Let the semantic of this type be this:
>> Any two colors with a `brightness` of `0` are to be considered equal
>> regardless of their respective `hue` or `saturation`. At night all cats are
>> gray.
>>
>>
>> An auto-generated conformance impl for `Color` would most likely be wrong
>> afaict.
>> And those kind of semantics are everywhere.
>>
>>
>> Of course, and in those cases, you wouldn't want to use an auto-derived
>> Equatable or Hashable implementation. Those kinds of semantics are
>> "everywhere" for certain definitions of "everywhere", but similarly,
>> semantics where the hash value of a thing can be determined simply via
>> combination of the hash values of its components are also "everywhere".
>>
>> I wouldn't suggest that auto-deriving Hashable solves *all* problems, and
>> similarly, your proposal doesn't help in the situation you described
>> either. Your API provides a different way to mix the hue, saturation, and
>> brightness components in an implementation of hashValue, but it doesn't
>> force the user to mix them *correctly* (by excluding hue/saturation if
>> brightness is zero).
>>
>> So in both cases, users still have to provide custom implementations of
>> == and hashValue. But without auto-deriving, users have to provide them
>> even for the cases where the auto-derived implementation *would have been
>> correct.*  Auto-deriving is about reducing the number of types that need to
>> have custom implementations, thereby reducing the chance that a user will
>> do it wrong.
>>
>> When considering a type with auto-derived Hashable and Equatable, there
>> are two ways it can be wrong:
>>
>> 1) The auto-generated implementations of both == and hashValue don't
>> honor the semantic contract of the type (for example, don't ignore
>> brightness when it's zero).
>> 2) The user overrides the auto-generated implementation of one of
>> ==/hashValue but not both, and violates the contracts between them.
>>
>> For #1, yes, the compiler generated an incorrect implementation in that
>> case. However, I would argue it's the developer's responsibility to fix it
>> by overriding it if they need different semantics. As I mentioned above,
>> without derivation, the developer could still implement it incorrectly,
>> just as they could with your API.
>>
>> For #2, this could be solved by requiring users to override both if they
>> override one of them. Now they're back in the same situation as #1—they
>> either did it right, or they did it wrong.
>>
>>
>>
>>
>> Auto-generation clearly is no panacea when it comes to hashing.
>> Especially if it leads to invalid and unsafe default implementations.
>>
>>
>> Auto-deriving cannot produce "unsafe" implementations because it's
>> defined in terms of a function operating over the hash values of its
>> components. It can produce an implementation that does not match the
>> intended semantics of the class that are defined outside of its component
>> values, but that's not the compiler's job to decide; it's up to the user to
>> provide those.
>>
>> Regarding unsafety, it's worth noting that your ContiguouslyHashable
>> protocol is inherently unsafe and fragile. Imagine that a user implements a
>> struct that they make conform to ContiguouslyHashable because at the time,
>> it's a simple fixed layout type with primitive fields. If they take that
>> type and add a new field to it that isn't contiguously hashable (say, a
>> class reference) and they forget to replace the ContiguouslyHashable
>> conformance, they now have a very broken type, and that behavior isn't
>> caught until *runtime* when things go wrong.
>>
>> At least with derived conformances, in that situation the *compiler*
>> would see that the type was no longer hashable and emit an error when it's
>> used anywhere that Hashable/hashValue was expected.
>>
>> So if safety is your motivation, ContiguouslyHashable kind of fights back
>> against that because it gives the user a door they can open—in some cases,
>> accidentally—that produces invalid results.
>>
>>
>>
>>
>> It would however be nice to be able to annotate types for which you want
>> HashVisitable implementations to be generated.
>>
>>
>> That's one of the still-open questions from the last discussion thread on
>> the topic; whether auto-deriving should be automatic for any type where it
>> is possible (as it is now for enums without associated values) or whether
>> the user should have to opt-in.
>>
>>
>>
>>
>> I actually don't see auto-generation as an alternative to a proper
>> hashing API. But rather as an addition.
>> HashVisitable and auto-deriving would actually work great together!
>>
>>
>> I completely agree that these ideas complement each other very well! And
>> I also agree that the language would do well to make it easier for users to
>> easily do the right thing. I just feel that *the API proposed here
>> specifically* adds too much complexity for the problem it's trying to solve.
>>
>> I'd find it more compelling if it was simplified a bit:
>>
>> * The standard library doesn't need to provide a number of hash
>> implementations; it should just provide one that works "well enough" in
>> most cases
>> * Hashable doesn't have tie itself to a visitor pattern. It's not
>> necessarily safer because users can still mix the wrong values; in that
>> case, I'd rather the stdlib implementation of the "good enough" hash just
>> be a standalone mixer that the language can encourage users to use.
>>
>
> I agree with this assessment. If SipHash is deemed "good enough," then
> Swift should offer these as standalone facilities and encourage users to
> call them in `Hashable`. I think the API design proposed here is much too
> complex, but offering tried-and-true hash functions is certainly reasonable
> and would be an improvement over xor.
>
> Certainly also, the documentation for `Hashable` can be improved. In
> general, however, it's not very convincing to motivate an entire proposal
> for a new feature based on a documentation bug. Recognize that there are
> (or at least have been, in past shipped versions of Swift) code snippets in
> the documentation that _don't even compile_! If we accepted the logic put
> forward here, we should give up on Swift entirely; after all, if even the
> official Swift documentation can't provide code that compiles, what hope do
> the rest of us have?!
>
>
>
>>
>> In fact they already do in other languages <https://is.gd/iy5770>.
>>
>>
>> On Mon, Mar 13, 2017 at 8:51 PM, Tony Allevato via swift-evolution <
>> swift-evolution at swift.org> wrote:
>>
>> I'm not convinced this is the right approach: it couples the fact that a
>> type is hashable with a specific strategy for implementing the hash value
>> computation.
>>
>> Instead, the language should make efforts to avoid requiring the user to
>> think about hashing algorithms *at all*; to answer Sean's question a couple
>> messages up, I've proposed in the past
>> <https://gist.github.com/allevato/2fd10290bfa84accfbe977d8ac07daad> (rough
>> working draft) adding derivation of Equatable and Hashable for structs and
>> enums where it's possible to automatically do so (for example, if all of
>> the enum's associated values are Equatable/Hashable). I've been
>> experimenting with an implementation in the compiler and I have something
>> that works for enums, except for recursive ones. I haven't started structs
>> yet because I think there are more open questions there, and I hope to
>> propose enums and structs independently so that the enum one doesn't get
>> bogged down by the struct one.
>>
>> The core team seems to be aware of the need for this; the logic that
>> derives Equatable/Hashable for enums without associated values also has
>> TODO comments to handle those with associated values, and to handle structs
>> as well. Slava Pestov mentioned a while back that they encouraged PRs on
>> it, which is why I started, but my free time has been limited lately.
>>
>> That being said, I *do* think there would be value in having some kind of
>> hash "mixer" as part of the standard library and strongly encouraging
>> through documentation that hashValue implementors use it. Getting the
>> function right is the hard part, and if Swift both (1) took care of it for
>> you by default in many cases and (2) made the harder cases easier by
>> providing a high quality canned implementation, I think we'd have a much
>> cleaner solution than what is being proposed here.
>>
>>
>> On Mon, Mar 13, 2017 at 12:18 PM Sean Heber via swift-evolution <
>> swift-evolution at swift.org> wrote:
>>
>> Is there any reason the API couldn’t be expressed as something along
>> these lines?
>>
>> func hash() -> [Hashable] {
>>   return [x, y]
>> }
>>
>> l8r
>> Sean
>>
>>
>> > On Mar 13, 2017, at 2:15 PM, David Hart <david at hartbit.com> wrote:
>> >
>> >>
>> >> On 13 Mar 2017, at 18:54, Sean Heber via swift-evolution <
>> swift-evolution at swift.org> wrote:
>> >>
>> >> I’m dumb when it comes to proper hashing, but it’s such a tediously
>> common thing in my experience to need to add Hashable to some kind of a
>> struct so I can stash it in a set or use it as a dictionary key. Is there
>> really no way to make this all more automatic? I have to be honest - this
>> proposal seems *harder* to understand than the way it works now.
>> >
>> > It's not really harder: just call hash on each of your type's
>> significant values:
>> >
>> > x.hash(&hasher)
>> > y.hash(&hasher)
>> >
>> > How would you implement hashValue in a simpler way, remembering that 'x
>> ^ y' is an incorrect implementation?
>> >
>> >> Of course the easiest would be if the language could just do this
>> “good enough" for me using reflection or whatever and if I really did run
>> into a problem where I wanted to do this myself, I could override something.
>> >>
>> >> Perfect is the enemy of good.
>> >>
>> >> l8r
>> >> Sean
>> >>
>> >>
>> >>> On Mar 13, 2017, at 10:38 AM, Vincent Esche via swift-evolution <
>> swift-evolution at swift.org> wrote:
>> >>>
>> >>> Hi there,
>> >>>
>> >>> I've written up a proposal for an overhaul/replacement of the
>> Hashable protocol and would love to hear your feedback on it!
>> >>>
>> >>> Rendered | Blog Post
>> >>>
>> >>> Cheers,
>> >>> Vincent
>> >>>
>> >>> Ps: I'd like to thank David Hart (@hartbit) for his great editorial
>> feedback on this proposal. 👍
>> >>>
>> >>> HashVisitable
>> >>>
>> >>>   • Proposal: SE-NNNN
>> >>>   • Authors: Vincent Esche
>> >>>   • Review Manager: TBD
>> >>>   • Status: Awaiting review
>> >>> Introduction
>> >>>
>> >>> Replace the Hashable protocol by two new procotols (Hasher and
>> HashVisitable) to improve safety, versatility and learnability.
>> >>>
>> >>> Motivation
>> >>>
>> >>> Implementing Hashable is difficult and the consequences if not done
>> well have dire performance and safety repercussions.
>> >>>
>> >>> The documentation of Hashable lists a sample implementation of var
>> hashValue:
>> >>>
>> >>> /// A point in an x-y coordinate system.
>> >>> struct GridPoint {
>> >>>   var x: Int
>> >>>   var y: Int
>> >>> }
>> >>>
>> >>> extension GridPoint: Hashable {
>> >>>   var hashValue: Int {
>> >>>       return x.hashValue ^ y.hashValue
>> >>>   }
>> >>>
>> >>>   static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
>> >>>       return lhs.x == rhs.x && lhs.y == rhs.y
>> >>>   }
>> >>> }
>> >>>
>> >>> Calculating the hashes of all GridPoints (given the above
>> implementation) on a 1000 × 1000 grid …
>> >>>
>> >>> let (width, height) = (1000, 1000)
>> >>> let total = width * height
>> >>> var hashes = Set<Int>()
>> >>> for x in 0..<width {
>> >>>   for y in 0..<height {
>> >>>       hashes.insert(GridPoint(x: x, y: y).hashValue)
>> >>>   }
>> >>> }
>> >>> print("\(hashes.count) unique hashes out of a total of \(total).")
>> >>>
>> >>> … results in just 1024 unique hash values for 1_000_000 unique values.
>> >>>
>> >>> In other words: The recommended implementation causes 99.9% of values
>> to trigger a hash collision.
>> >>>
>> >>> Out of those 1_000_000 values the median collision count was 976 with
>> min and max being 976 and 1000respectively.
>> >>>
>> >>> The collision rate will have negative impact in algorithms which
>> heavily use hashValue like the ones in Dictionaryand Set. Furthermore, it
>> increases the vulnerability to DDOS attacks when exposed to the web.
>> >>>
>> >>> If even the official Swift documentation gets the implementation of
>> hashValue wrong, then who is to expect the average Swift programmer to do
>> any better?
>> >>>
>> >>> In contrast, running the same snippet using HashVisitable and the
>> semi-secure Fnv1aHash (see below) results in zero collisions!
>> >>>
>> >>> Finally, the design of the Hashable protocol forces the use of one
>> implementation without the possibility of switching between multiple
>> hashing algorithms.
>> >>>
>> >>> Proposed solution
>> >>>
>> >>> Instead of coupling the hashing algorithm with each and every Swift
>> type, we should provide a hashing API based on the visitor-pattern. By
>> freeing application developers from the burden of having to implement
>> hashing algorithms, the Standard Library can provide default ones which
>> fulfill collision, performance and security goals. Furthermore, it would
>> allow developers to swap to different algorithms based on the use case.
>> >>>
>> >>> Detailed design
>> >>>
>> >>> The proposal deprecates the Hashable protocol and introduces the
>> following two:
>> >>>
>> >>> protocol Hasher
>> >>> {
>> >>>
>> >>> mutating func finish() -> Int
>> >>>
>> >>>
>> >>> mutating func write(bytes
>> >>> : UnsafeRawBufferPointer)
>> >>> }
>> >>>
>> >>>
>> >>> protocol HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H)
>> >>> }
>> >>>
>> >>> Hasher is the protocol which represents a hashing algorithm, and
>> HashVisitable replaces Hashable. For types entirely represented by their
>> memory layout, the following protocol would provide a default
>> implementation:
>> >>>
>> >>> protocol ContiguouslyHashable: HashVisitable {}
>> >>>
>> >>> extension ContiguouslyHashable {
>> >>>   func hash<H: Hasher>(_ hasher: inout H) {
>> >>>       var mutableSelf = self
>> >>>       try! Swift.withUnsafeBytes(of: &mutableSelf) {
>> >>>           hasher.write(bytes: $0)
>> >>>       }
>> >>>   }
>> >>> }
>> >>>
>> >>> extension Bool : ContiguouslyHashable {}
>> >>> extension UInt8 : ContiguouslyHashable {}
>> >>> extension UInt16 : ContiguouslyHashable {}
>> >>> extension UInt32 : ContiguouslyHashable {}
>> >>> extension UInt64 : ContiguouslyHashable {}
>> >>> extension UInt : ContiguouslyHashable {}
>> >>> extension Int8 : ContiguouslyHashable {}
>> >>> extension Int16 : ContiguouslyHashable {}
>> >>> extension Int32 : ContiguouslyHashable {}
>> >>> extension Int64 : ContiguouslyHashable {}
>> >>> extension Int : ContiguouslyHashable {}
>> >>>
>> >>> The Standard-Library would then provide a set of hashing
>> implementations specific to each purpose. A possible choice for hashing
>> algorithms would be the reasonably fast SipHash-2-4, and the reasonably
>> secure SipHash-4-8.
>> >>>
>> >>> FNV-1A is another popular semi-secure but blazingly fast hash
>> algorithm, which – for the sake of demonstration – could be implemented as
>> follows:
>> >>>
>> >>> struct Fnv1aHash
>> >>> {
>> >>>
>> >>> fileprivate var state: UInt
>> >>>
>> >>>
>> >>> init(seed: UInt
>> >>> ) {
>> >>>
>> >>> self.state = seed &+ 14695981039346656037
>> >>>
>> >>>   }
>> >>> }
>> >>>
>> >>>
>> >>> extension Fnv1aHash: Hasher
>> >>> {
>> >>>
>> >>> mutating func write(bytes
>> >>> : UnsafeRawBufferPointer) {
>> >>>
>> >>> for byte in
>> >>> bytes {
>> >>>
>> >>> self.state = (self.state ^ UInt(byte)) &* 1099511628211
>> >>>
>> >>>       }
>> >>>   }
>> >>>
>> >>> mutating func finish() -> Int
>> >>> {
>> >>>
>> >>> return unsafeBitCast(self.state, to: Int.self
>> >>> )
>> >>>   }
>> >>> }
>> >>>
>> >>> Coming back to the sample code present in the Hashable documentation,
>> the new implementation would look like:
>> >>>
>> >>> extension GridPoint: HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H) {
>> >>>
>> >>> self.x.hash(&
>> >>> hasher)
>> >>>
>> >>> self.y.hash(&
>> >>> hasher)
>> >>>   }
>> >>> }
>> >>>
>> >>> Source compatibility
>> >>>
>> >>> Making use of "extending protocols to conform to protocols":
>> >>>
>> >>> extension Hashable: HashVisitable
>> >>> {
>> >>>
>> >>> func hash<H: Hasher>(_ hasher: inout
>> >>> H) {
>> >>>
>> >>> self.hashValue.hash(&
>> >>> hasher)
>> >>>   }
>> >>> }
>> >>>
>> >>> Effect on ABI stability
>> >>>
>> >>> n/a
>> >>>
>> >>> Effect on API resilience
>> >>>
>> >>> This feature should be possible to add/remove without breaking ABI.
>> >>>
>> >>> Alternatives considered
>> >>>
>> >>> n/a
>> >>> _______________________________________________
>> >>> swift-evolution mailing list
>> >>> swift-evolution at swift.org
>> >>> https://lists.swift.org/mailman/listinfo/swift-evolution
>> >>
>> >> _______________________________________________
>> >> swift-evolution mailing list
>> >> swift-evolution at swift.org
>> >> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>>
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170314/ce6811f8/attachment.html>


More information about the swift-evolution mailing list