[swift-evolution] [Draft] Hasher & HashVisitable

Tony Allevato tony.allevato at gmail.com
Mon Mar 13 20:30:23 CDT 2017


Adding the list back to this reply because I don't believe you meant to
reply only to me, and I think it's an important discussion to have :)


On Mon, Mar 13, 2017 at 4:32 PM Vincent Esche <
regexident.mailinglists at gmail.com> wrote:

Automatic generation of Hashable implementation code only "solves" the
problem of having to implement `var hashValue`.
It however only works for some types. By no means for all.


Certainly, and I never suggested that it would work for every case. I was
responding to Sean's point that the compiler should be able to do "good
enough" in the common cases and I offered a way that that can be achieved.


Take this hypothetical implementation of a HSB color:

struct Color {
let hue: UInt8
let saturation: UInt8
let brightness: UInt8
}

Let the semantic of this type be this:
Any two colors with a `brightness` of `0` are to be considered equal
regardless of their respective `hue` or `saturation`. At night all cats are
gray.


An auto-generated conformance impl for `Color` would most likely be wrong
afaict.
And those kind of semantics are everywhere.


Of course, and in those cases, you wouldn't want to use an auto-derived
Equatable or Hashable implementation. Those kinds of semantics are
"everywhere" for certain definitions of "everywhere", but similarly,
semantics where the hash value of a thing can be determined simply via
combination of the hash values of its components are also "everywhere".

I wouldn't suggest that auto-deriving Hashable solves *all* problems, and
similarly, your proposal doesn't help in the situation you described
either. Your API provides a different way to mix the hue, saturation, and
brightness components in an implementation of hashValue, but it doesn't
force the user to mix them *correctly* (by excluding hue/saturation if
brightness is zero).

So in both cases, users still have to provide custom implementations of ==
and hashValue. But without auto-deriving, users have to provide them even
for the cases where the auto-derived implementation *would have been
correct.*  Auto-deriving is about reducing the number of types that need to
have custom implementations, thereby reducing the chance that a user will
do it wrong.

When considering a type with auto-derived Hashable and Equatable, there are
two ways it can be wrong:

1) The auto-generated implementations of both == and hashValue don't honor
the semantic contract of the type (for example, don't ignore brightness
when it's zero).
2) The user overrides the auto-generated implementation of one of
==/hashValue but not both, and violates the contracts between them.

For #1, yes, the compiler generated an incorrect implementation in that
case. However, I would argue it's the developer's responsibility to fix it
by overriding it if they need different semantics. As I mentioned above,
without derivation, the developer could still implement it incorrectly,
just as they could with your API.

For #2, this could be solved by requiring users to override both if they
override one of them. Now they're back in the same situation as #1—they
either did it right, or they did it wrong.




Auto-generation clearly is no panacea when it comes to hashing. Especially
if it leads to invalid and unsafe default implementations.


Auto-deriving cannot produce "unsafe" implementations because it's defined
in terms of a function operating over the hash values of its components. It
can produce an implementation that does not match the intended semantics of
the class that are defined outside of its component values, but that's not
the compiler's job to decide; it's up to the user to provide those.

Regarding unsafety, it's worth noting that your ContiguouslyHashable
protocol is inherently unsafe and fragile. Imagine that a user implements a
struct that they make conform to ContiguouslyHashable because at the time,
it's a simple fixed layout type with primitive fields. If they take that
type and add a new field to it that isn't contiguously hashable (say, a
class reference) and they forget to replace the ContiguouslyHashable
conformance, they now have a very broken type, and that behavior isn't
caught until *runtime* when things go wrong.

At least with derived conformances, in that situation the *compiler* would
see that the type was no longer hashable and emit an error when it's used
anywhere that Hashable/hashValue was expected.

So if safety is your motivation, ContiguouslyHashable kind of fights back
against that because it gives the user a door they can open—in some cases,
accidentally—that produces invalid results.




It would however be nice to be able to annotate types for which you want
HashVisitable implementations to be generated.


That's one of the still-open questions from the last discussion thread on
the topic; whether auto-deriving should be automatic for any type where it
is possible (as it is now for enums without associated values) or whether
the user should have to opt-in.




I actually don't see auto-generation as an alternative to a proper hashing
API. But rather as an addition.
HashVisitable and auto-deriving would actually work great together!


I completely agree that these ideas complement each other very well! And I
also agree that the language would do well to make it easier for users to
easily do the right thing. I just feel that *the API proposed here
specifically* adds too much complexity for the problem it's trying to solve.

I'd find it more compelling if it was simplified a bit:

* The standard library doesn't need to provide a number of hash
implementations; it should just provide one that works "well enough" in
most cases
* Hashable doesn't have tie itself to a visitor pattern. It's not
necessarily safer because users can still mix the wrong values; in that
case, I'd rather the stdlib implementation of the "good enough" hash just
be a standalone mixer that the language can encourage users to use.





In fact they already do in other languages <https://is.gd/iy5770>.


On Mon, Mar 13, 2017 at 8:51 PM, Tony Allevato via swift-evolution <
swift-evolution at swift.org> wrote:

I'm not convinced this is the right approach: it couples the fact that a
type is hashable with a specific strategy for implementing the hash value
computation.

Instead, the language should make efforts to avoid requiring the user to
think about hashing algorithms *at all*; to answer Sean's question a couple
messages up, I've proposed in the past
<https://gist.github.com/allevato/2fd10290bfa84accfbe977d8ac07daad> (rough
working draft) adding derivation of Equatable and Hashable for structs and
enums where it's possible to automatically do so (for example, if all of
the enum's associated values are Equatable/Hashable). I've been
experimenting with an implementation in the compiler and I have something
that works for enums, except for recursive ones. I haven't started structs
yet because I think there are more open questions there, and I hope to
propose enums and structs independently so that the enum one doesn't get
bogged down by the struct one.

The core team seems to be aware of the need for this; the logic that
derives Equatable/Hashable for enums without associated values also has
TODO comments to handle those with associated values, and to handle structs
as well. Slava Pestov mentioned a while back that they encouraged PRs on
it, which is why I started, but my free time has been limited lately.

That being said, I *do* think there would be value in having some kind of
hash "mixer" as part of the standard library and strongly encouraging
through documentation that hashValue implementors use it. Getting the
function right is the hard part, and if Swift both (1) took care of it for
you by default in many cases and (2) made the harder cases easier by
providing a high quality canned implementation, I think we'd have a much
cleaner solution than what is being proposed here.


On Mon, Mar 13, 2017 at 12:18 PM Sean Heber via swift-evolution <
swift-evolution at swift.org> wrote:

Is there any reason the API couldn’t be expressed as something along these
lines?

func hash() -> [Hashable] {
  return [x, y]
}

l8r
Sean


> On Mar 13, 2017, at 2:15 PM, David Hart <david at hartbit.com> wrote:
>
>>
>> On 13 Mar 2017, at 18:54, Sean Heber via swift-evolution <
swift-evolution at swift.org> wrote:
>>
>> I’m dumb when it comes to proper hashing, but it’s such a tediously
common thing in my experience to need to add Hashable to some kind of a
struct so I can stash it in a set or use it as a dictionary key. Is there
really no way to make this all more automatic? I have to be honest - this
proposal seems *harder* to understand than the way it works now.
>
> It's not really harder: just call hash on each of your type's significant
values:
>
> x.hash(&hasher)
> y.hash(&hasher)
>
> How would you implement hashValue in a simpler way, remembering that 'x ^
y' is an incorrect implementation?
>
>> Of course the easiest would be if the language could just do this “good
enough" for me using reflection or whatever and if I really did run into a
problem where I wanted to do this myself, I could override something.
>>
>> Perfect is the enemy of good.
>>
>> l8r
>> Sean
>>
>>
>>> On Mar 13, 2017, at 10:38 AM, Vincent Esche via swift-evolution <
swift-evolution at swift.org> wrote:
>>>
>>> Hi there,
>>>
>>> I've written up a proposal for an overhaul/replacement of the Hashable
protocol and would love to hear your feedback on it!
>>>
>>> Rendered | Blog Post
>>>
>>> Cheers,
>>> Vincent
>>>
>>> Ps: I'd like to thank David Hart (@hartbit) for his great editorial
feedback on this proposal. 👍
>>>
>>> HashVisitable
>>>
>>>   • Proposal: SE-NNNN
>>>   • Authors: Vincent Esche
>>>   • Review Manager: TBD
>>>   • Status: Awaiting review
>>> Introduction
>>>
>>> Replace the Hashable protocol by two new procotols (Hasher and
HashVisitable) to improve safety, versatility and learnability.
>>>
>>> Motivation
>>>
>>> Implementing Hashable is difficult and the consequences if not done
well have dire performance and safety repercussions.
>>>
>>> The documentation of Hashable lists a sample implementation of var
hashValue:
>>>
>>> /// A point in an x-y coordinate system.
>>> struct GridPoint {
>>>   var x: Int
>>>   var y: Int
>>> }
>>>
>>> extension GridPoint: Hashable {
>>>   var hashValue: Int {
>>>       return x.hashValue ^ y.hashValue
>>>   }
>>>
>>>   static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
>>>       return lhs.x == rhs.x && lhs.y == rhs.y
>>>   }
>>> }
>>>
>>> Calculating the hashes of all GridPoints (given the above
implementation) on a 1000 × 1000 grid …
>>>
>>> let (width, height) = (1000, 1000)
>>> let total = width * height
>>> var hashes = Set<Int>()
>>> for x in 0..<width {
>>>   for y in 0..<height {
>>>       hashes.insert(GridPoint(x: x, y: y).hashValue)
>>>   }
>>> }
>>> print("\(hashes.count) unique hashes out of a total of \(total).")
>>>
>>> … results in just 1024 unique hash values for 1_000_000 unique values.
>>>
>>> In other words: The recommended implementation causes 99.9% of values
to trigger a hash collision.
>>>
>>> Out of those 1_000_000 values the median collision count was 976 with
min and max being 976 and 1000respectively.
>>>
>>> The collision rate will have negative impact in algorithms which
heavily use hashValue like the ones in Dictionaryand Set. Furthermore, it
increases the vulnerability to DDOS attacks when exposed to the web.
>>>
>>> If even the official Swift documentation gets the implementation of
hashValue wrong, then who is to expect the average Swift programmer to do
any better?
>>>
>>> In contrast, running the same snippet using HashVisitable and the
semi-secure Fnv1aHash (see below) results in zero collisions!
>>>
>>> Finally, the design of the Hashable protocol forces the use of one
implementation without the possibility of switching between multiple
hashing algorithms.
>>>
>>> Proposed solution
>>>
>>> Instead of coupling the hashing algorithm with each and every Swift
type, we should provide a hashing API based on the visitor-pattern. By
freeing application developers from the burden of having to implement
hashing algorithms, the Standard Library can provide default ones which
fulfill collision, performance and security goals. Furthermore, it would
allow developers to swap to different algorithms based on the use case.
>>>
>>> Detailed design
>>>
>>> The proposal deprecates the Hashable protocol and introduces the
following two:
>>>
>>> protocol Hasher
>>> {
>>>
>>> mutating func finish() -> Int
>>>
>>>
>>> mutating func write(bytes
>>> : UnsafeRawBufferPointer)
>>> }
>>>
>>>
>>> protocol HashVisitable
>>> {
>>>
>>> func hash<H: Hasher>(_ hasher: inout
>>> H)
>>> }
>>>
>>> Hasher is the protocol which represents a hashing algorithm, and
HashVisitable replaces Hashable. For types entirely represented by their
memory layout, the following protocol would provide a default
implementation:
>>>
>>> protocol ContiguouslyHashable: HashVisitable {}
>>>
>>> extension ContiguouslyHashable {
>>>   func hash<H: Hasher>(_ hasher: inout H) {
>>>       var mutableSelf = self
>>>       try! Swift.withUnsafeBytes(of: &mutableSelf) {
>>>           hasher.write(bytes: $0)
>>>       }
>>>   }
>>> }
>>>
>>> extension Bool : ContiguouslyHashable {}
>>> extension UInt8 : ContiguouslyHashable {}
>>> extension UInt16 : ContiguouslyHashable {}
>>> extension UInt32 : ContiguouslyHashable {}
>>> extension UInt64 : ContiguouslyHashable {}
>>> extension UInt : ContiguouslyHashable {}
>>> extension Int8 : ContiguouslyHashable {}
>>> extension Int16 : ContiguouslyHashable {}
>>> extension Int32 : ContiguouslyHashable {}
>>> extension Int64 : ContiguouslyHashable {}
>>> extension Int : ContiguouslyHashable {}
>>>
>>> The Standard-Library would then provide a set of hashing
implementations specific to each purpose. A possible choice for hashing
algorithms would be the reasonably fast SipHash-2-4, and the reasonably
secure SipHash-4-8.
>>>
>>> FNV-1A is another popular semi-secure but blazingly fast hash
algorithm, which – for the sake of demonstration – could be implemented as
follows:
>>>
>>> struct Fnv1aHash
>>> {
>>>
>>> fileprivate var state: UInt
>>>
>>>
>>> init(seed: UInt
>>> ) {
>>>
>>> self.state = seed &+ 14695981039346656037
>>>
>>>   }
>>> }
>>>
>>>
>>> extension Fnv1aHash: Hasher
>>> {
>>>
>>> mutating func write(bytes
>>> : UnsafeRawBufferPointer) {
>>>
>>> for byte in
>>> bytes {
>>>
>>> self.state = (self.state ^ UInt(byte)) &* 1099511628211
>>>
>>>       }
>>>   }
>>>
>>> mutating func finish() -> Int
>>> {
>>>
>>> return unsafeBitCast(self.state, to: Int.self
>>> )
>>>   }
>>> }
>>>
>>> Coming back to the sample code present in the Hashable documentation,
the new implementation would look like:
>>>
>>> extension GridPoint: HashVisitable
>>> {
>>>
>>> func hash<H: Hasher>(_ hasher: inout
>>> H) {
>>>
>>> self.x.hash(&
>>> hasher)
>>>
>>> self.y.hash(&
>>> hasher)
>>>   }
>>> }
>>>
>>> Source compatibility
>>>
>>> Making use of "extending protocols to conform to protocols":
>>>
>>> extension Hashable: HashVisitable
>>> {
>>>
>>> func hash<H: Hasher>(_ hasher: inout
>>> H) {
>>>
>>> self.hashValue.hash(&
>>> hasher)
>>>   }
>>> }
>>>
>>> Effect on ABI stability
>>>
>>> n/a
>>>
>>> Effect on API resilience
>>>
>>> This feature should be possible to add/remove without breaking ABI.
>>>
>>> Alternatives considered
>>>
>>> n/a
>>> _______________________________________________
>>> swift-evolution mailing list
>>> swift-evolution at swift.org
>>> https://lists.swift.org/mailman/listinfo/swift-evolution
>>
>> _______________________________________________
>> swift-evolution mailing list
>> swift-evolution at swift.org
>> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution at swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


_______________________________________________
swift-evolution mailing list
swift-evolution at swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170314/81fd772a/attachment.html>


More information about the swift-evolution mailing list