[swift-dev] Dictionary Collision Resolution Guarantees
Dave Abrahams
dabrahams at apple.com
Thu Oct 13 17:09:45 CDT 2016
on Thu Oct 13 2016, Alexis <swift-dev-AT-swift.org> wrote:
> I’m currently cleaning up the docs on Dictionary to reflect the new indexing model, and I stumbled
> across the note that the following code should work assuming no reallocations occur:
>
> //
> // var (i, found) = d.find(k) // i is associated with d's buffer
> // if found {
> // var e = d // now d is sharing its data with e
> // e[newKey] = newValue // e now has a unique copy of the data
> // return e[i] // use i to access e
> // }
> //
>
> This is effectively assuming that the open-addressing scheme being
> used is first-come-first-serve (FCFS). That is, any element being
> inserted can *only* be inserted into vacant buckets, rather than
> displacing existing elements. This is currently only internal docs,
> but do we actually want to guarantee this?
No, not to users. But "//" comments are not user-level comments and
don't imply any guarantees.
> The popular alternative of robin hood (RH) doesn’t follow this.
>
> Some background: One interesting optimization avenue worth exploring
> is to tweak Dictionary to store hashes, rather than bits, to identify
> occupied slots (with some careful masking so that 0 still means
> “unoccupied”). Then searching for elements can be implemented as
> follows:
>
> let hash = hashFromKey(key)
> var i = indexFromHash(hash)
> while true {
> if hashes[i] == 0 {
> // vacant, not contained
> } else if hashes[i] == hash {
> // maybe already exists?
> if keys[i] == key {
> // actually exists
> }
> }
> i = nextIndex(i)
> }
>
> The interesting property of this is that it almost all of the search
> time is spent linearly walking through an array and doing simple
> comparisons on integers, which is of course really easy to optimize
> (potentially even SIMD).
Yep, that's cool. It does cost a lot more storage, though. Tradeoffs.
> 99.9999% of the time, if the element isn’t stored in the Dictionary,
> then you’ll just hit a 0 hash, and never look at the keys. Similarly,
> 99.9999% of the time, if the element *is* stored in the Dictionary,
> you’ll only do a single key comparison (on the actually equal
> element). So this is also really great for cache — pretty much all of
> your access are just linear scans of the hashes.
>
> The main downside is you’re now “wasting” more memory to store hashes,
> but you can potentially compensate for this by truncating the hashes
> to 8/16 bits for small Dictionaries (which will be better for SIMD
> anyway).
Hey, cool.
> So what does this have to do with the RH scheme? Compared to FCFS, RH
> generally leads to lower variance on search distances, and provides a
> mechanism to short-circuit long runs if you have the hashes handy. If
> you find hashes which want to live later in the array than where you
> started, then the current element definitely isn’t contained. This
> means you can increase the load factor on your dictionary and further
> improve cache/memory usage (compensating for the memory-usage loss
> from storing hashes). However it will be prohibitively expensive if
> you require hashes to be recomputed on-the-fly.
Sounds like a worthy optimization!
--
-Dave
More information about the swift-dev
mailing list