[swift-evolution] [Draft] Hasher & HashVisitable

Tue Mar 14 04:06:26 CDT 2017

Turns out I’m rather bad at mailing lists. Hit “reply” (vs. "reply all")
again.

On Tue, Mar 14, 2017 at 3:34 AM, Károly Lőrentey <karoly at lorentey.hu> wrote:

>
> On 2017. Mar 13., at 23:56, Vincent Esche <regexident.mailinglists@
> gmail.com> wrote:
>
> On Mon, Mar 13, 2017 at 9:56 PM, Károly Lőrentey <karoly at lorentey.hu>
> wrote:
>
>> Yes please! I’ve a package on GitHub to implement roughly the same thing.
>> I’ve been happily using it for months now, and I wouldn’t ever write a
>> hashValue implementation by hand again.
>>
>
>>
> https://github.com/lorentey/SipHash
>>
>> I think the fact that we’ve both come up with essentially the same API is
>> an interesting data point; it definitely underlines the need for a better
>> Hashable protocol.
>>
>
> It's not just us, actually. Rust does the very same thing
> <https://doc.rust-lang.org/std/hash/>.
> A very similar concept has been proposed
> <https://isocpp.org/files/papers/n3980.html> and implemented
> <https://github.com/HowardHinnant/hash_append> for C++ by Howard Hinnant.
>
>
> Hm. It is possible I may have been influenced by one of these.
>
> My comments:
>>
>> * In an ideal world, this would be a replacement for Hashable, not a
>> protocol with a different name. Once visitor-based hashing becomes
>> available, nobody should implement a hashValue property.
>>
>
> Same here. I fear however that turning a protocol upside down would do
> more harm than good.
>
>
>> * All standard Hashable types in standard library should implement the
>> new hash function directly, rather than relying on the default
>> implementation.
>>
>
> Yes.
>
> * Why is the HashVisitable.hash a generic function? Hasher could just as
>> well be a concrete type defined in stdlib. Making it generic may have some
>> performance implications.
>>
>
> Because we specifically don't want to
> <https://blog.definiteloops.com/ha-r-sh-visitors-8c0c3686a46f#.6emnc0d5x>
> be limited to a single hash implementation.
>
>
> Allowing custom hashers indeed sounds nice in theory, but are there any
> more practical examples where you'd want to actually do that? Are Bloom
> Filters an important enough use case on their own to justify the extra
> complexity? A concrete Hasher type would still allow stdlib to switch to a
> different hash algorithm; this is already a long way beyond what people are
> typically doing with Hashable.)
>

Bloom Filters are – while very interesting – an edge case, admittedly.
Being able to choose a different hasher for certain Dictionaries, Sets, …
however is a key goal for me. You want secure defaults, yet be able to
exchange them with a weaker impl for safe bottlenecks.

Especially in the scope of Swift going more and more into systems
programming territory. You don’t want to wait yourself into a corner now.

I'm also worried people would find clever ways to abuse the visitor
> parameter in ways not related to hashing. (E.g., you can implement one half
> of an amusingly broken serialization library in just two minutes by making
> OutputStream conform to Hasher. Implementing deserialization might take a
> *little* more work, though.) 😈
>

While technically true that’s kind of a straw man argument, isn’t it? 😉
I could easily turn it around and argue that nothing hinders me from
implementing half of my app in `var hashValue`. 😈

* I find that I prefer to hash components by calling a mutating method on
>> the hasher, rather than directly calling the components' hash
>> implementations. Putting the hasher first is much more readable to me,
>> primarily because it gets rid of all the distracting &s. It also makes it
>> possible to find slightly better names, eliminating the repetitiveness of
>> "foo.hash(&hasher)":
>>
>> extension GridPoint: SipHashable {
>>     func appendHashes(to hasher: inout
>>  SipHasher) {
>>          hasher.append(x)
>>          hasher.append(y)
>>     }
>> }
>>
>
> That would require SipHasher to overload `append` for every type
> imaginable. And it would require SipHasher know about the type's memory
> layout, which violates encapsulation. "What should be hashed?" should be
> specified in the type. "How should it be hashed?" however in the given
> hasher.
>
>
> Hasher's mutating method would be implemented as a one-line generic
> function:
>
> https://github.com/lorentey/SipHash/blob/master/sources/
> SipHashable.swift#L56
>
> The actual hashing would of course still be done by implementations of the
> HashVisitable protocol. This is just a convenience thing.
>

Oh, I misread then.

(Admittedly this would introduce a generic function, while at the same time
> I'm also arguing for making the protocol requirement non-generic partly
> due to (not very convincing) worries about performance. But this would be a
> single fragile generic in stdlib, not many generics sprinkled over separate
> user-defined modules.)
>
>
> * I suggest using SipHash instead of FNV-1a. The standard library already
>> contains an implementation for SipHash, as undocumented internal API,
>> complete with a note that it should be made public. AFAICR, it is currently
>> only used for String hashing, but it would be very much worth making it
>> universal. (Accomodating SipHash's random key is one reason why hashValue’s
>> documentation explicitly notes that its value is "not guaranteed to be
>> equal across different executions of your program.”)
>>
>
> I do not propose making FNV-1a part of stdlib. I just included FNV-1a
> because it is trivial to implement and easily fits in a proposal as a
> sample impl of Hasher.
>
> I do however propose to include SipHash-2-4, SipHash-4-8 and maybe even
> SipHash-1-3.
>
>
> Oh, indeed, I misread the proposal. (Sorry!) But if the API allows for
> custom hashers, and stdlib comes with more than one, then which one would
> Set and Dictionary use? Would their choice of hasher be user configurable?
> stdlib has been remarkably free of such elaborations so far. The
> alternative is to bake a particular hash into the standard hashed
> collections, but then the overall API would be strangely lopsided, with an
> inconsistent mix of flexible and rigid parts.
>

I would propose to make them generic, but default to a Hasher provided by
stdlib (like `Set<Element, Hasher = SipHash13>`).

> All these questions go away if there is but a single standard Hasher.
> Well, except the question about which algorithm should be the One. But I
> think any keyed hash function would be much better than the status quo.
>
> * ContiguouslyHashable seems like an unsafe construct to me. It is quite
>> dangerous to base hashing on the raw byte sequence underlying a value: for
>> example, struct values may include uninitialized gaps between some of their
>> stored properties due to alignment constraints. So two otherwise identical
>> values may very well have different in-memory representations.
>>
>
> These type are clearly not ContiguouslyHashable then. Which was also
> explained in detail in the accompanying blog post:
>
> "ContiguouslyHashable marks a special family of types which allow for
> simply passing their contiguous memory to the given hasher en bloc, which
> Float, Double and String would decidedly not conform to, while for types
> matching these criteria then it would be a simple matter of tagging them
> with the protocol." Blog Post
> <https://blog.definiteloops.com/ha-r-sh-visitors-8c0c3686a46f#.6emnc0d5x>
>
>
>> Therefore, I suggest ContiguouslyHashable should be removed from the
>> proposal. Swift programmers who know what they’re doing would still be able
>> to call withUnsafeBytes(of:) when they want to, but regular schmoes like
>> myself will not be tempted to shoot themselves in the foot by using it
>> inappropriately.
>>
>
> Fine with me. I don't care to much about how those stdlib primitive types
> gain their hashing powers. ;)
>
>
> To be clear, I think it would be fine to keep ContiguouslyHashable as an
> internal implementation detail in stdlib; it just shouldn't be a public,
> documented protocol in its current form. (Perhaps as
> UnsafeContiguouslyHashable? 🙂) I don't think Swift programmers in general
> are aware of such subtle aspects of how their structs are laid out in
> memory. It's certainly a fuzzy area for me, at least; for example, I
> remember getting surprised by the fact that Float80 values end with six
> uninitialized bytes; no other floating point type in stdlib is padded like
> that.
>

I’d be absolutely fine with just auto-generating HashVisitable for all the
types I had tagged as ContiguouslyHashVisitable instead. Thinking about it
I’d actually prefer that now. Keep it an implementation detail and for all
those who really want to hash a type from its contiguous memory slice the
code necessary is easy to whip up yourself.

> To pick another nit: is it necessary to put a label on the parameter of
> Hasher.write(bytes:)? I may not be up to date on the naming convention, but
> it seems to me it would read just as well without the label. (I also think
> there are problems with the name HashVisitable.hash(_:), but my best
> attempt at naming it is writeHashes(to:) which isn't much of an
> improvement.)
>

I haven’t put too much effort into the naming, to be honest. As such I
don’t really like HashVisitable either. But just changing the semantics of
Hashable by 180° for the sake of a better name isn’t feasible.

>
> Karoly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20170314/d956b306/attachment.html>