[swift-dev] Dictionary Collision Resolution Guarantees

Thu Oct 13 17:09:45 CDT 2016

on Thu Oct 13 2016, Alexis <swift-dev-AT-swift.org> wrote:

> I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled
> across the note that the following code should work assuming no reallocations occur:
>
> //
> //   var (i, found) = d.find(k) // i is associated with d's buffer
> //   if found {
> //      var e = d            // now d is sharing its data with e
> //      e[newKey] = newValue // e now has a unique copy of the data
> //      return e[i]          // use i to access e
> //   }
> //
>
> This is effectively assuming that the open-addressing scheme being
> used is first-come-first-serve (FCFS). That is, any element being
> inserted can *only* be inserted into vacant buckets, rather than
> displacing existing elements. This is currently only internal docs,
> but do we actually want to guarantee this? 

No, not to users.  But "//" comments are not user-level comments and
don't imply any guarantees.

> The popular alternative of robin hood (RH) doesn’t follow this.
>
> Some background: One interesting optimization avenue worth exploring
> is to tweak Dictionary to store hashes, rather than bits, to identify
> occupied slots (with some careful masking so that 0 still means
> “unoccupied”). Then searching for elements can be implemented as
> follows:
>
> let hash = hashFromKey(key)
> var i = indexFromHash(hash)
> while true {
>   if hashes[i] == 0 {
>     // vacant, not contained
>   } else if hashes[i] == hash {
>     // maybe already exists?
>     if keys[i] == key {
>       // actually exists
>     }
>   }
>   i = nextIndex(i)
> }
>
> The interesting property of this is that it almost all of the search
> time is spent linearly walking through an array and doing simple
> comparisons on integers, which is of course really easy to optimize
> (potentially even SIMD). 

Yep, that's cool.  It does cost a lot more storage, though.  Tradeoffs.

> 99.9999% of the time, if the element isn’t stored in the Dictionary,
> then you’ll just hit a 0 hash, and never look at the keys. Similarly,
> 99.9999% of the time, if the element *is* stored in the Dictionary,
> you’ll only do a single key comparison (on the actually equal
> element). So this is also really great for cache — pretty much all of
> your access are just linear scans of the hashes.
>
> The main downside is you’re now “wasting” more memory to store hashes,
> but you can potentially compensate for this by truncating the hashes
> to 8/16 bits for small Dictionaries (which will be better for SIMD
> anyway).

Hey, cool.

> So what does this have to do with the RH scheme? Compared to FCFS, RH
> generally leads to lower variance on search distances, and provides a
> mechanism to short-circuit long runs if you have the hashes handy. If
> you find hashes which want to live later in the array than where you
> started, then the current element definitely isn’t contained. This
> means you can increase the load factor on your dictionary and further
> improve cache/memory usage (compensating for the memory-usage loss
> from storing hashes). However it will be prohibitively expensive if
> you require hashes to be recomputed on-the-fly.

Sounds like a worthy optimization!

-- 
-Dave