[swift-dev] Very slow Set<String>(arrayOfStrings) for some arrayOfStrings
Arnold Schwaighofer
aschwaighofer at apple.com
Fri Mar 4 09:25:27 CST 2016
Tracked by: https://bugs.swift.org/browse/SR-877
> On Mar 4, 2016, at 6:10 AM, Arnold via swift-dev <swift-dev at swift.org> wrote:
>
> Yes. NSString's hash is not the same function as String's hashValue. To make matters more confusing when you call hashValue on NSString type you actually get
> NSStrings hash.
>
> I have confirmed that we are indeed getting a lot more collisions with the bad Strings and I will be looking into this.
>
>
> Sent from my iPhone
>
> On Mar 4, 2016, at 5:18 AM, Jens Persson <jens at bitcycle.com> wrote:
>
>> If you make a WrappedString type where you can choose to use .hash instead of .hashValue, then the below test(wrappedStrings) will be fast if .hash is used, and slow if .hashValue is used. Demo-program:
>>
>> import QuartzCore
>> struct WrappedString : Hashable {
>> let wrapped : String
>> init(_ s: String) { wrapped = s }
>> var hashValue: Int {
>> return wrapped.hash // <-- NOTE: Will make test(wrappedStrings) fast.
>> // return wrapped.hashValue // <-- NOTE: Will make test(wrappedStrings) slow.
>> }
>> }
>> func ==(lhs: WrappedString, rhs: WrappedString) -> Bool { return lhs.wrapped == rhs.wrapped }
>> func test<T: Hashable>(array: [T]) {
>> let t0 = CACurrentMediaTime()
>> let set = Set(array)
>> let t1 = CACurrentMediaTime()
>> print("Created set of", set.count, " elements in", t1 - t0, "seconds.")
>> }
>> // Download http://sloppyfocus.com/strings.zip, unzip and set the correct path here:
>> let path = "/Users/jens/strings.txt"
>> do {
>> let strings = try String.init(contentsOfFile: path).componentsSeparatedByString("\n")
>> print("Loaded", strings.count, "strings.") // Prints: Loaded 88379 strings.
>> let wrappedStrings = strings.map { WrappedString($0) }
>> test(wrappedStrings) // Prints: Created set of 88379 strings in 0.0923392080003396 seconds.
>> test(strings) // Prints: Created set of 88379 strings in 9.25633556599496 seconds.
>> }
>> catch let e { fatalError(String(e)) }
>>
>> /Jens
>>
>>
>> On Fri, Mar 4, 2016 at 11:13 AM, Jens Persson <jens at bitcycle.com> wrote:
>> (Correction: 0.07 seconds, not 0.7 seconds.)
>>
>> On Fri, Mar 4, 2016 at 10:33 AM, Jens Persson <jens at bitcycle.com> wrote:
>> Another thing I noticed is that making a Set<NSString> (rather than a Set<String>) is fast ...
>>
>> That is (for those same particular strings) this is fast (0.7 s):
>> let setOfNSStrings = Set<NSString>(strings.map { $0 as NSString })
>>
>> while this is slow (9.5 s):
>> let setOfNSStrings = Set<String>(strings)
>>
>> /Jens
>>
>>
>> On Fri, Mar 4, 2016 at 9:48 AM, Jens Persson <jens at bitcycle.com> wrote:
>> Are you already tracking/looking into this in someway or should I file a bug?
>> /Jens
>>
>> On Thu, Mar 3, 2016 at 5:55 AM, Arnold <aschwaighofer at apple.com> wrote:
>> Yes this does not explain what you are seeing. (I could repeat this on my end)
>>
>> If you look at the two runs under instruments: one run with the slow strings file and one with the fast strings file (I tested with space appended to each line); you will notice that we spend a lot more time in the Set's find function doing NSString comparisons. The only explanation I have for this is hash collisions leading to more comparisons as we are probing non empty buckets. Why one of the files would trigger this and the other not makes no sense to me.
>> This will need looking at.
>>
>> Sent from my iPhone
>>
>> On Mar 2, 2016, at 6:43 PM, Jens Persson <jens at bitcycle.com> wrote:
>>
>>> Just thought I should mention that I tried another thing:
>>>
>>> Using my original example:
>>> http://sloppyfocus.com/slowSetFromParticularButSeeminglyNormalArrayOfStrings.html
>>> Keeping the code as it is but adding a space to the end of every line in the actual text file (strings.txt).
>>>
>>> This actually makes the slow test(strings) run almost as fast as the fast test(caseSwappedStrings), 0.09 seconds, so about a hundred times faster just because the text file has an extra space at the end of each line.
>>>
>>> I just can't see how Arnold Schwaighofer's explanation explains that ... An added space character at the end of each line of that text file is just one more ASCII character, how can that make test(strings) so much faster?
>>>
>>> And other operations on the text file makes the slow test faster too, like eg:
>>> Inserting a space as the first char of each line, 0.9 seconds.
>>> Removing all spaces, 0.9 seconds.
>>> Replacing all spaces with "-" (replacing the dashes back to spaces made it slow again), 0.9 seconds.
>>> Replacing all "a" with "*", 0.9 seconds.
>>> Swapping case (now in the actual file instead of programmatically), 2.3 seconds.
>>> (I only tried operations that kept the resulting strings unique and their characters plain ASCII.)
>>>
>>> /Jens
>>>
>>> On Wed, Mar 2, 2016 at 6:28 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
>>> That is the difference between a “String” type instance that can use the ascii fast path and NSString backed “String” type instances.
>>>
>>> When the code translates the string (the map function that calls upper case) it becomes a natively backed “String” that isASCII.
>>>
>>> // The strings in strings.txt are not strange in any way, they all match ^[a-zA-Z ]+$.
>>>
>>> Set the isASCII bit and things are blazing fast.
>>>
>>> The other path is that componentsSeparatedByString(separator: String) -> [String] returns an “[String]” where every “String” is backed by “NSString”. When it comes to do comparison we call a NSString function on the NSString and this is where the difference in performance comes from.
>>>
>>>
>>>
>>>
>>> > On Mar 2, 2016, at 12:15 AM, Jens Persson via swift-dev <swift-dev at swift.org> wrote:
>>> >
>>> > I guess it would be possible to isolate the problem by analyzing my example, and thereby making it possible to write a smaller benchmark which doesn't need that big textfile. I didn't manage to do that however, hence my question about the problem here.
>>> > /Jens
>>> >
>>> > On Wed, Mar 2, 2016 at 9:11 AM, Jens Persson <jens at bitcycle.com> wrote:
>>> > Thanks, but I'll have to invite anyone with more time and experience to do that.
>>> >
>>> > One (of many) thing(s) I wouldn't know how include/handle is the 1.3 MB text file for the particular arrayOfStrings that makes Set(arrayOfStrings) slow. It seems a bit unnecessary/bloating to put it in the code base.
>>> >
>>> > /Jens
>>> >
>>> > On Tue, Mar 1, 2016 at 11:58 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>> > Hi Jens,
>>> >
>>> > Thanks for reporting this issue. I don’t know what’s going on but we’ll take a look.
>>> >
>>> > I think it would be great if you could add this program as a benchmark under swift/benchmarks/. This will allow us to track the performance of this test and ensure that we don’t regress.
>>> >
>>> > Thanks,
>>> > Nadav
>>> >
>>> >> On Mar 1, 2016, at 5:01 AM, Jens Persson via swift-dev <swift-dev at swift.org> wrote:
>>> >>
>>> >> For some particular arrays of strings, creating a Set<String>(arrayOfStrings) takes about 100 to 200 times longer than for other very similar arrays of strings (equally many unique simple [a-zA-Z ]+ strings).
>>> >>
>>> >> I've put together a tiny program to demonstrate the problem here:
>>> >> http://sloppyfocus.com/slowSetFromParticularButSeeminglyNormalArrayOfStrings.html
>>> >>
>>> >> Is this due to a bug / performance problem in Set or can it be explained (and solved) in some way?
>>> >>
>>> >> /Jens
>>> >>
>>> >> _______________________________________________
>>> >> swift-dev mailing list
>>> >> swift-dev at swift.org
>>> >> https://lists.swift.org/mailman/listinfo/swift-dev
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>>> > http://www.bitcycle.com/
>>> > Phone: +46-73-753 24 62
>>> > E-mail: jens at bitcycle.com
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>>> > http://www.bitcycle.com/
>>> > Phone: +46-73-753 24 62
>>> > E-mail: jens at bitcycle.com
>>> >
>>> > _______________________________________________
>>> > swift-dev mailing list
>>> > swift-dev at swift.org
>>> > https://lists.swift.org/mailman/listinfo/swift-dev
>>>
>>>
>>>
>>>
>>> --
>>> bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>>> http://www.bitcycle.com/
>>> Phone: +46-73-753 24 62
>>> E-mail: jens at bitcycle.com
>>>
>>
>>
>>
>> --
>> bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>> http://www.bitcycle.com/
>> Phone: +46-73-753 24 62
>> E-mail: jens at bitcycle.com
>>
>>
>>
>>
>> --
>> bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>> http://www.bitcycle.com/
>> Phone: +46-73-753 24 62
>> E-mail: jens at bitcycle.com
>>
>>
>>
>>
>> --
>> bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>> http://www.bitcycle.com/
>> Phone: +46-73-753 24 62
>> E-mail: jens at bitcycle.com
>>
>>
>>
>>
>> --
>> bitCycle AB | Smedjegatan 12 | 742 32 Östhammar | Sweden
>> http://www.bitcycle.com/
>> Phone: +46-73-753 24 62
>> E-mail: jens at bitcycle.com
>>
> _______________________________________________
> swift-dev mailing list
> swift-dev at swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev
More information about the swift-dev
mailing list