[swift-evolution] For-loop revisited

David Owens II david at owensd.io
Wed Mar 9 13:36:56 CST 2016


My concern has never really been with optimized builds of Swift; I think it will get there or fairly close to C++ perf. My concern has always been about non-optimized builds (as that is where I spend 90% of my dev time) and the ability to break out of the abstractions when necessary if extremely important. The corollary for this is the ability to reason about the performance of your code.

Here's a basic test project that I have: https://github.com/owensd/swift-perf. If anyone spots anything I'm doing wrong, let me know! Hopefully it's fixes perf to make it usable for me again!

Report for Swift 2.1.1

Language: Swift, Optimization: -Onone, Samples = 10, Iterations = 30      ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 2832.123 │ 2782.729 │ 2935.282 │ 51.660 │
RenderGradient ([UInt32])                                                 │ 356.4907 │ 349.7944 │ 364.8663 │ 4.7476 │
RenderGradient (UnsafeMutablePointer)                                     │ 156.4229 │ 151.4939 │ 164.6879 │ 3.9761 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 182.3293 │ 179.6009 │ 188.4805 │ 3.3929 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 365.9573 │ 359.1546 │ 371.4564 │ 4.0989 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 377.3043 │ 365.1158 │ 394.5101 │ 7.8776 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 305.3486 │ 298.6232 │ 317.0662 │ 6.6205 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 407.7086 │ 399.4519 │ 419.2850 │ 6.4855 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: Swift, Optimization: -O, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 19.71537 │ 18.93118 │ 22.85709 │ 1.2512 │
RenderGradient ([UInt32])                                                 │ 17.40496 │ 17.10440 │ 17.81726 │ 0.2768 │
RenderGradient (UnsafeMutablePointer)                                     │ 20.00288 │ 18.99458 │ 21.66878 │  0.965 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 17.96327 │ 17.05896 │ 22.94012 │ 1.7797 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 22.62061 │ 21.44827 │ 26.07391 │ 1.6991 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 19.25109 │ 17.25802 │ 21.41414 │ 1.5964 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 14.69814 │ 13.91954 │ 17.96169 │ 1.2282 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 24.28080 │  23.1602 │ 27.49104 │ 1.6667 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Report for Swift 3.0-March snapshot

Language: Swift, Optimization: -Onone, Samples = 10, Iterations = 30      ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 3220.543 │ 2898.071 │ 3775.192 │ 325.96 │
RenderGradient ([UInt32])                                                 │ 363.5971 │ 357.4558 │ 377.5159 │ 6.4495 │
RenderGradient (UnsafeMutablePointer)                                     │ 155.5633 │ 150.4397 │ 166.5735 │ 5.3346 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 182.1435 │ 178.1345 │ 194.9554 │ 5.1296 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 340.1474 │ 332.2264 │ 358.1106 │ 7.4227 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 347.6466 │ 341.8327 │ 355.2599 │ 4.1142 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 297.6354 │ 289.0012 │ 310.2637 │ 7.3171 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 379.2153 │ 374.3724 │ 387.1544 │ 4.6838 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: Swift, Optimization: -O, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 22.51406 │ 21.26175 │ 27.60297 │ 2.1561 │
RenderGradient ([UInt32])                                                 │ 18.39304 │ 17.11022 │ 24.14741 │ 2.2164 │
RenderGradient (UnsafeMutablePointer)                                     │ 20.67769 │ 19.03668 │ 23.70964 │ 1.8423 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 15.29333 │ 13.90142 │ 19.20010 │ 1.6236 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 22.51703 │ 21.28654 │ 27.03154 │ 1.9406 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 19.27868 │ 17.20521 │ 22.53724 │  2.066 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 15.63351 │ 13.18523 │ 19.79255 │ 2.0291 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 24.48129 │ 23.05487 │ 28.77946 │ 2.0785 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Report for Swift 3.0-March snapshot (with changes, like for-loop and ++ removal)

Language: Swift, Optimization: -Onone, Samples = 10, Iterations = 30      ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 2992.632 │ 2947.714 │ 3041.948 │ 23.913 │
RenderGradient ([UInt32])                                                 │ 389.1035 │ 380.1675 │ 404.4974 │ 7.0670 │
RenderGradient (UnsafeMutablePointer)                                     │ 200.0568 │ 191.7541 │ 209.4043 │ 5.2761 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 212.9935 │ 207.6584 │ 222.6322 │ 4.9337 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 366.9555 │ 360.9891 │ 374.4666 │  4.533 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 367.9753 │ 359.5587 │ 378.8135 │ 7.0686 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 512.4176 │ 502.9525 │ 522.4241 │ 5.2631 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 601.9091 │ 591.0957 │ 609.4677 │ 5.8693 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: Swift, Optimization: -O, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([Pixel])                                                  │ 19.84214 │ 18.86398 │ 22.93223 │ 1.4087 │
RenderGradient ([UInt32])                                                 │ 18.58096 │ 17.13248 │ 21.16593 │ 1.7183 │
RenderGradient (UnsafeMutablePointer)                                     │ 19.66112 │ 18.87033 │ 22.60767 │ 1.0966 │
RenderGradient (UnsafeMutablePointer<UInt32>)                             │ 15.05691 │ 13.91324 │ 17.63912 │ 1.3198 │
RenderGradient ([Pixel].withUnsafeMutablePointer)                         │ 20.23184 │ 18.80135 │ 22.65893 │ 1.5356 │
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 15.52520 │ 14.72497 │ 19.22203 │ 1.3226 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 12.40027 │ 11.74772 │ 15.66219 │  1.209 │
RenderGradient ([Pixel].withUnsafeMutablePointer (SIMD))                  │ 24.01914 │ 23.19182 │ 27.04742 │ 1.2454 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

As you can see, some of these changes do have an impact on performance for the way in which I run code the vast majority of the time. Yes, the optimized performance equivalent, but as a day-to-day language trying to write real-time apps in Swift, there is a lot to be desired on the non-optimized performance of the language. 

Here's the diff of the API changes; I'm not sure there's a huge readability win (and hopefully I didn't introduce an error in the conversion!): https://github.com/owensd/swift-perf/pull/2/files

A note about the performance tests: these are a stripped down subset from a rendering path that I was using while developing a game that I'm working on. They represent real problems that I have, and they are not simply arbitrary benchmarks.

As a comparison, here's what the C performance looks like in non-optimized and optimized builds:

Language: C, Optimization: -O0, Samples = 10, Iterations = 30             ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient (Pointer Math)                                             │   38.567 │   36.426 │   41.702 │  1.686 │
RenderGradient (SIMD)                                                     │   79.431 │   73.697 │   86.309 │  3.997 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: C, Optimization: -Os, Samples = 10, Iterations = 30             ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient (Pointer Math)                                             │    9.582 │    8.688 │   13.167 │  1.468 │
RenderGradient (SIMD)                                                     │    4.608 │    3.640 │    8.792 │  1.568 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: C, Optimization: -Ofast, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient (Pointer Math)                                             │    3.247 │    2.865 │    4.676 │  0.594 │
RenderGradient (SIMD)                                                     │    4.489 │    3.655 │    8.266 │  1.442 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

In the best case, Swift's debug performance is still over 4x slower, though that path doesn't lead to the best optimized times.

-David


> On Mar 9, 2016, at 8:37 AM, Taras Zakharko via swift-evolution <swift-evolution at swift.org> wrote:
> 
> BTW, I got curious about this so I actually tested it out. I have written a very rudimentary C program (attached) that implements the same simple loop in three styles:
> 
> 1. A classical for(int i … ) loop
> 2. A loop using functions for controlling the counter
> 3. A object-oriented iterator protocol (this is essentially what Swift would compile to)
> 
> To make sure that the compiler would not optimise the loop away, it produces side effects (console output). 
> 
> In release mode, I can't see any substantial difference in the output produced by the compiler. Its not identical, but the actual loop looks very efficient, everything is inlined and all iterator fields are eliminated (they are pushed to registers). So the optimisations in LLVM are there already. If Swift doesn’t do it yet, then its probably the case of few missing heuristics. 
> 
> <main.c>
> 
> 
>> On 09 Mar 2016, at 16:45, Taras Zakharko <taras.zakharko at uzh.ch> wrote:
>> 
>> 
>>> On 09 Mar 2016, at 14:52, Ted F.A. van Gaalen <tedvgiosdev at gmail.com> wrote:
>>> 
>>> Hello Taras
>>> If I am not mistaken, the compiler cannot optimize: 
>>> 
>>> for element in collection {}
>>> 
>>> simply because the contents of the collection are
>>> unknown at compile time.
>> 
>> 
>> The progression 
>> 
>> for(int i=0; i<=n; i++) {...}
>> 
>> is also unknown at compile time. Compiler needs to generate a series of instructions that increment a loop counter and check whether the exit condition has been reached.
>> 
>> Now, in case of a higher abstraction such as
>> 
>> for i in stride(0, n, by=1) {...}
>> 
>> there is indeed more code being generated, but the structure of the loop is exactly the same. Specifically, the above loop can essentially be rewritten in C as
>> 
>> for(int i = get_initial_value(), !should_exit_on_counter(i), i = next_value(i)) {...}
>> 
>> There is a piece of code that increments a loop counter and a piece of code that checks the exit condition. In an unoptimised build, the overhead is indeed significant because of all the extra function calls. However, an optimising compiler can easily
>> 
>> a)  detect that the iterator does not escape the loop (and therefore eliminate refcounting for it)
>> b)  inline the methods that implement loop increment and exit condition check
>> c)  infer (e.g. via SSA form transformations) that i and the internal counter used by the iterator variable are always identical and eliminate any unnecessary assignments
>> 
>> All of these optimisations are fairly trivial and have existed for quite some time in modern compilers. At this point the code of the abstract collection-driven loop becomes identical to an optimised for loop. Furthermore, the compiler can use additional heuristics to unroll or otherwise optimise the loop (such as moving the loop counter to a register). 
>> 
>>> On 09 Mar 2016, at 14:25, Goffredo Marocchi <panajev at gmail.com> wrote:
>>> 
>>> Sometimes programmers directives and runtime knowledge are essential though and the compilers should be optimised but not held to a practically impossible standards.
>> 
>> 
>> I do not think that expecting the compiler to optimise away a simple iterator is a practically impossible standard. There are compilers out there that perform much more impressive feats (such as autovectorisation). In fact, I am quite sure that GCC and clang will optimise away the for loop that uses simple functions for  exit condition/increment. 
>> 
>> I certainly agree with you that a compiler can’t do everything — but thats what the while loop is for. 
>> 
>>> On 09 Mar 2016, at 16:11, David Owens II <david at owensd.io> wrote:
>>> 
>>> And yet, developers spend the vast majority of their time running and validating code in non-optimized builds. 
>> 
>> This is a good point! But I believe that this can be alleviated somehow by performing partial optimisations on loops even in the debug mode (e.g. refcounting elimination, inlining)
>> 
>> — Taras
>> 
> 
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20160309/bfd0c4f8/attachment.html>


More information about the swift-evolution mailing list