[swift-users] Optimising Set<String> comparisons

Nial Giacomelli measuredweighed at gmail.com
Mon Nov 14 15:01:53 CST 2016


Many thanks to Tim for suggesting isSubset, which I had somehow missed
while browsing the documentation.

To answer Dave's questions: unfortunately the Set comparisons aren't an
ideal candidate for asynchronous work. The comparisons take place as part
of a CSS-like styling phase, whereby a number of Sets (the stylesheet
definitions) are compared against many thousands of other Sets (the class
lists for each styleable object). While I believe the Int comparisons would
undoubtedly be faster, I'm likely to lose even more time calculating the
hashValues for each Set.

I believe the use of isSubset as well as some result caching (storing
matches for subsequent queries including identical class lists) is likely
to result in a decent speed improvement. Thank you both for taking the time
to respond.

And if anyone has any suggestions for alternative approaches I'm all ears
:-)

On Sun, Nov 13, 2016 at 11:49 PM, David Sweeris <davesweeris at mac.com> wrote:

>
> On Nov 13, 2016, at 1:58 PM, Nial Giacomelli via swift-users <
> swift-users at swift.org> wrote:
>
> Using Swift 3 I have a function that's called extremely frequently and is
> appearing regularly in Profiler runs. The function takes two Set<String>
> instances and simply attempts to determine whether all items from Set A are
> present in Set B (it's important to note that Set B may well contain
> additional items).
>
> I've attempted to approach the problem in two ways:
>
> let diff = a.subtracting(b)
> guard diff.count == 0 else { return false }
>
> And also by simply iterating over the contents of Set A, like so:
>
> for item in a {
> if !b.contains(item) {
> return false
> }
> }
>
> Both ultimately end up spending the majority of their time in String._
> compareDeterministicUnicodeCollaton(String) -> Int. Which makes sense,
> given what I'm doing - but ideally I'd like to come up with a more
> efficient way of performing the above check. Swift's String representation
> is incredibly robust, but for my needs the strings could be adequately
> represented in ASCII encoding. I've also considered storing and comparing
> the hashValue of the strings, to speed up comparisons...
>
> Hopefully this is an acceptable question for this mailing list. I'm aware
> that this may not be a Swift-specific question and could well be solved
> with a more efficient data structure or approach, but I'd really appreciate
> some input from more experienced Swift developers :-)
>
>
> How often do the sets change? And where do you get them from? I’m
> wondering, if there’s enough time between when you get the strings and when
> you need to know if setA is a subset of setB, could you somehow compute the
> answer asynchronously in a background thread or something? Strictly
> speaking it wouldn't be any less work, but if you can move it to where
> you’re waiting for user input or something you’ll probably get your answer
> sooner.
>
> Regarding how to get the answer using less work in the first place, I
> think you’re doing a tiny bit of extra work — just some allocation
> overhead, really — by calculating the set difference and checking that its
> count == 0, rather than just checking if a.isSubset(of: b). Beyond that,
> I’m not really sure… You might be able to do something with the strings’
> hash values. I mean, I don’t know if they really “work like that” (I’m
> quite foggy on how hash values are calculated and exactly what they’re
> suitable for… this might be a horrible idea), but if they do, comparing
> hash values (Ints) will definitely be faster than comparing Strings. I did
> some quick’n’dirty testing on my machine with some code very much like:
>
> let setA: Set<String> = ["some", "set", "of", "random", "strings"]
> let setB: Set<String> = ["some", "other", "set", "of", "strings"]
> let hashedA = Set(setA.map{$0.hashValue})
> let hashedB = Set(setB.map{$0.hashValue})
> setA.isSubset(of: setB)
> hashedA.isSubset(of: hashedB)
>
>
> The actual execution of hashedA.isSubset(of: hashedB) seems to be very
> roughly 2x faster than setA.isSubset(of: setB). However, if you include
> the overhead of calculating them, using hash values drops to about 5x
> *slower*. So unless you’d be reusing the hashed values a lot or
> something, I think you’re already doing about the least amount of work I
> can think of.
>
> - Dave Sweeris
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-users/attachments/20161114/02a25645/attachment.html>


More information about the swift-users mailing list