[swift-dev] Dictionary Collision Resolution Guarantees

Thu Oct 13 13:47:07 CDT 2016

I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled across the note that the following code should work assuming no reallocations occur:

//
//   var (i, found) = d.find(k) // i is associated with d's buffer
//   if found {
//      var e = d            // now d is sharing its data with e
//      e[newKey] = newValue // e now has a unique copy of the data
//      return e[i]          // use i to access e
//   }
//

This is effectively assuming that the open-addressing scheme being used is first-come-first-serve (FCFS). That is, any element being inserted can *only* be inserted into vacant buckets, rather than displacing existing elements. This is currently only internal docs, but do we actually want to guarantee this? The popular alternative of robin hood (RH) doesn’t follow this.

Some background: One interesting optimization avenue worth exploring is to tweak Dictionary to store hashes, rather than bits, to identify occupied slots (with some careful masking so that 0 still means “unoccupied”). Then searching for elements can be implemented as follows:

let hash = hashFromKey(key)
var i = indexFromHash(hash)
while true {
  if hashes[i] == 0 {
    // vacant, not contained
  } else if hashes[i] == hash {
    // maybe already exists?
    if keys[i] == key {
      // actually exists
    }
  }
  i = nextIndex(i)
}

The interesting property of this is that it almost all of the search time is spent linearly walking through an array and doing simple comparisons on integers, which is of course really easy to optimize (potentially even SIMD). 99.9999% of the time, if the element isn’t stored in the Dictionary, then you’ll just hit a 0 hash, and never look at the keys. Similarly, 99.9999% of the time, if the element *is* stored in the Dictionary, you’ll only do a single key comparison (on the actually equal element). So this is also really great for cache — pretty much all of your access are just linear scans of the hashes. 

The main downside is you’re now “wasting” more memory to store hashes, but you can potentially compensate for this by truncating the hashes to 8/16 bits for small Dictionaries (which will be better for SIMD anyway).

So what does this have to do with the RH scheme? Compared to FCFS, RH generally leads to lower variance on search distances, and provides a mechanism to short-circuit long runs if you have the hashes handy. If you find hashes which want to live later in the array than where you started, then the current element definitely isn’t contained. This means you can increase the load factor on your dictionary and further improve cache/memory usage (compensating for the memory-usage loss from storing hashes). However it will be prohibitively expensive if you require hashes to be recomputed on-the-fly.