[swift-evolution] [Review] SE-0007 Remove C-style for-loops with conditions and incrementers

Fri Dec 11 11:11:43 CST 2015

FWIW:

Performance differences in loops exist in any language. And the vast
majority of cases it is only a matter of compiler optimization difference:

Using this example:

http://stackoverflow.com/a/2959105

"For instance, the following code shows that the "do-while" is a bit
faster. This because the "jmp" istruction is not used by the "do-while"
loop."

int main(int argc, char* argv[])  {     int i;  char x[100];
//"FOR" LOOP:for(i=0; i<100; i++ )  { x[i] = 0; }
//"WHILE" LOOP:
i=0;  while(i<100 )  { x[i++] = 0;  }
//"DO-WHILE" LOOP:
i=0;  do  { x[i++] = 0;   }  while(i<100);
return 0;  }

*// "FOR" LOOP:*

    010013C8  mov         dword ptr [ebp-0Ch],0
    010013CF  jmp         wmain+3Ah (10013DAh)

  for(i=0; i<100; i++ )
  { x[i] = 0;
    010013D1  mov         eax,dword ptr [ebp-0Ch]  <<< UPDATE i
    010013D4  add         eax,1
    010013D7  mov         dword ptr [ebp-0Ch],eax
    010013DA  cmp         dword ptr [ebp-0Ch],64h  <<< TEST
    010013DE  jge         wmain+4Ah (10013EAh)     <<< COND JUMP
    010013E0  mov         eax,dword ptr [ebp-0Ch]  <<< DO THE JOB..
    010013E3  mov         byte ptr [ebp+eax-78h],0
    010013E8  jmp         wmain+31h (10013D1h)     <<< UNCOND JUMP
  }

*// "WHILE" LOOP:*

  i=0;
  010013EA  mov         dword ptr [ebp-0Ch],0
  while(i<100 )
  { x[i++] = 0;
    010013F1  cmp         dword ptr [ebp-0Ch],64h   <<< TEST
    010013F5  jge         wmain+6Ah (100140Ah)      <<< COND JUMP
    010013F7  mov         eax,dword ptr [ebp-0Ch]   <<< DO THE JOB..
    010013FA  mov         byte ptr [ebp+eax-78h],0
    010013FF  mov         ecx,dword ptr [ebp-0Ch]   <<< UPDATE i
    01001402  add         ecx,1
    01001405  mov         dword ptr [ebp-0Ch],ecx
    01001408  jmp         wmain+51h (10013F1h)      <<< UNCOND JUMP
  }

*// "DO-WHILE" LOOP:*

i=0;  .  0100140A  mov         dword ptr [ebp-0Ch],0
  do
  { x[i++] = 0;
    01001411  mov         eax,dword ptr [ebp-0Ch]   <<< DO THE JOB..
    01001414  mov         byte ptr [ebp+eax-78h],0
    01001419  mov         ecx,dword ptr [ebp-0Ch]   <<< UPDATE i
    0100141C  add         ecx,1
    0100141F  mov         dword ptr [ebp-0Ch],ecx
    01001422  cmp         dword ptr [ebp-0Ch],64h   <<< TEST
    01001426  jl          wmain+71h (1001411h)      <<< COND JUMP
  }
  while(i<100);

Em sex, 11 de dez de 2015 às 06:34, David Owens II via swift-evolution <
swift-evolution at swift.org> escreveu:

> I don’t know what you did, your gist 404s.
>
> Here’s an update with the while-loop:
> https://gist.github.com/owensd/b438bea90075907fa6ec and using both i and
> j within the loops. This is a simple OS X framework project with unit tests.
>
> *Debug Build:*
>
>    - testZipStride - 2.496s
>    - testCStyleFor - 0.210s
>    - testWhileLoop - 0.220s
>
>
> *Release Build:*
>
>    - testZipStride - 0.029s
>    - testCStyleFor - 0.018s
>    - testWhileLoop - 0.019s
>
>
> I ran these tests from my MacBook Pro, the previous tests were from my
> iMac.
>
> When you use the sum += (i - j) construct, I think all you are ending up
> with a hot-path that the optimizer can end up optimizing better (my guess
> is that the i-j turns into a constant expression - after all, the
> difference is always 1, but I don’t know enough about the SIL
> representation to confirm that). If you use a code path where that
> expression is not constant time (again, assuming my suspicion is correct),
> the zip+stride is against slower.
>
> I would argue the following:
>
>
>    1. The code is not objectively easier to read or understand with the
>    zip+stride construct (arguably, they are not even semantically equivalent).
>    2. The debug builds are prohibitively slower, especially in the
>    context of high-performance requirement code (I’m doing a lot of
>    prototyping Swift in the context of games, so yes, performance matters
>    considerably).
>    3. The optimized builds are still slower than the for-in “equivalent"
>    functionality.
>    4. The optimizer is inconsistent, like all optimizers are (this is a
>    simple truth, not a value judgement - optimizers are not magic, they are
>    code that is run like any other code and can only do as well as they are
>    coded under the conditions they are coded against), at actually producing
>    similar results with code that ends up with slightly different shapes.
>    5. There is not functionally equivalent version of the code that I can
>    write that is not more verbose, while requiring artificial scoping
>    constructs, to achieve the same behavior.
>
>
> So no, there is no evidence that I’ve seen to reconsider my opinion that
> this proposal should not be implemented. If there is evidence to show that
> my findings are incorrect or a poor summary of the general problem I am
> seeing, then of course I would reconsider my opinion.
>
> -David
>
> On Dec 10, 2015, at 9:12 PM, Paul Cantrell <cantrell at pobox.com> wrote:
>
> Hold the presses.
>
> David, I found the radical differences in our results troubling, so I did
> some digging. It turns out that the zip+stride code:
>
>     var sum = 0
>     for (i, j) in zip(first.stride(to: 0, by: -1), second.stride(to: 0,
> by: -2)) {
>         if i % 2 == 0 { continue }
>         sum += 1
>     }
>
> …runs *much* faster if you actually use both i and j inside the loop:
>
>     var sum = 0
>     for (i, j) in zip(first.stride(to: 0, by: -1), second.stride(to: 0,
> by: -2)) {
>         if i % 2 == 0 { continue }
>         sum += *i-j*
>     }
>
> Weird, right? This is with optimization on (default “production” build).
> It smells like a compiler quirk.
>
> With that tweak, the zip+stride approach actually clocks in faster than
> the C-style for. Yes, you read that right: *faster*. Also smells like a
> quirk. Am I doing something fantastically stupid in my code? Or maybe it’s
> just my idiosyncratic taste in indentation? :P
>
> Here’s my test case, which was a command-line app with manual timing,
> followed by David’s dropped into the same harness, followed by David’s but
> with sum += i-j instead of sum += 1:
>
>     https://gist.github.com/pcantrell/6bbe80e630d227ed0262
>
> Point is: *no big performance difference here; even a performance
> advantage* (that is probably a compiler artifact).
>
> David and Thorsten, you might want to reconsider your reviews?
>
> Results:
>
> —————— Paul’s comparison ——————
>
> zip+stride
>
>   Iter 0: 0.519110977649689
>   Iter 1: 0.503385007381439
>   Iter 2: 0.503321051597595
>   Iter 3: 0.485216021537781
>   Iter 4: 0.524757027626038
>   Iter 5: 0.478078007698059
>   Iter 6: 0.503880977630615
>   Iter 7: 0.498068988323212
>   Iter 8: 0.485781013965607
>          ——————————————
>   Median: 0.524757027626038
>
> C-style
>
>   Iter 0: 0.85480797290802
>   Iter 1: 0.879491031169891
>   Iter 2: 0.851797997951508
>   Iter 3: 0.836017966270447
>   Iter 4: 0.863684952259064
>   Iter 5: 0.837742984294891
>   Iter 6: 0.839070022106171
>   Iter 7: 0.849772989749908
>   Iter 8: 0.819278955459595
>          ——————————————
>   Median: 0.863684952259064
>
> Zip+stride takes 0.607579217692143x the time of C-style for
>
> —————— David’s comparison ——————
>
> zip+stride
>
>   Iter 0: 1.15285503864288
>   Iter 1: 1.1244450211525
>   Iter 2: 1.24192994832993
>   Iter 3: 1.02782195806503
>   Iter 4: 1.13640999794006
>   Iter 5: 1.15879601240158
>   Iter 6: 1.12114900350571
>   Iter 7: 1.21364599466324
>   Iter 8: 1.10698300600052
>          ——————————————
>   Median: 1.13640999794006
>
> C-style
>
>   Iter 0: 0.375869989395142
>   Iter 1: 0.371365010738373
>   Iter 2: 0.356527984142303
>   Iter 3: 0.384984970092773
>   Iter 4: 0.367590010166168
>   Iter 5: 0.365644037723541
>   Iter 6: 0.384257972240448
>   Iter 7: 0.379297018051147
>   Iter 8: 0.363133013248444
>          ——————————————
>   Median: 0.367590010166168
>
> Zip+stride takes 3.09151491202482x the time of C-style for
>
> —————— David’s comparison, actually using indices in the loop ——————
>
> zip+stride
>
>   Iter 0: 0.328687965869904
>   Iter 1: 0.332105994224548
>   Iter 2: 0.336817979812622
>   Iter 3: 0.321089029312134
>   Iter 4: 0.338591992855072
>   Iter 5: 0.348567008972168
>   Iter 6: 0.34687602519989
>   Iter 7: 0.34755402803421
>   Iter 8: 0.341500997543335
>          ——————————————
>   Median: 0.338591992855072
>
> C-style
>
>   Iter 0: 0.422354996204376
>   Iter 1: 0.427953958511353
>   Iter 2: 0.403640985488892
>   Iter 3: 0.415378987789154
>   Iter 4: 0.403639018535614
>   Iter 5: 0.416707038879395
>   Iter 6: 0.415345013141632
>   Iter 7: 0.417587995529175
>   Iter 8: 0.415713012218475
>          ——————————————
>   Median: 0.403639018535614
>
> Zip+stride takes 0.838848518865867x the time of C-style for
>
> Program ended with exit code: 0
>
> Cheers,
>
> Paul
>
> On Dec 10, 2015, at 5:36 PM, David Owens II <david at owensd.io> wrote:
>
> Here’s my basic test case:
>
> let first = 10000000
> let second = 20000000
>
> class LoopPerfTests: XCTestCase {
>
>     func testZipStride() {
>         self.measureBlock {
>             var sum = 0
>             for (i, j) in zip(first.stride(to: 0, by:
> -1), second.stride(to: 0, by: -2)) {
>                 if i % 2 == 0 { continue }
>                 sum += 1
>             }
>             print(sum)
>         }
>     }
>
>     func testCStyleFor() {
>         self.measureBlock {
>             var sum = 0
>             for var i = first, j = second; i > 0 && j > 0; i -= 1, j -= 2 {
>                 if i % 2 == 0 { continue }
>                 sum += 1
>             }
>             print(sum)
>         }
>
>     }
>
> }
>
>
> Non-optimized timings:
>
>    - testCStyleFor - 0.126s
>    - testZipStride - 2.189s
>
>
> Optimized timings:
>
>    - testCStyleFor - 0.008s
>    - testZipStride - 0.015s
>
>
> That’s a lot worse than 34%; even in optimized builds, that’s 2x slower
> and in debug builds, that’s 17x slower. I think it’s unreasonable to force
> people to write a more verbose while-loop construct to simply get the
> performance they need.
>
> Also, the readability argument is very subjective; for example, I don’t
> find the zip version more readability. In fact, I think it obscures what
> the logic of the loop is doing. But again, that’s subjective.
>
> -David
>
> On Dec 10, 2015, at 2:41 PM, Paul Cantrell <cantrell at pobox.com> wrote:
>
> Is there any guarantee that these two loops have the exact same runtime
> performance?
>
> for (i, j) in zip(10.stride(to: 0, by: -1), 20.stride(to: 0, by: -2)) {
>    if i % 2 == 0 { continue }
>    print(i, j)
> }
>
> for var i = 10, j = 20; i > 0 && j > 0; i -= 1, j -= 2 {
>    if i % 2 == 0 { continue }
>    print(i, j)
> }
>
>
> In a quick and dirty test, the second is approximately 34% slower.
>
> I’d say that’s more than acceptable for the readability gain. If you’re in
> that rare stretch of critical code where the extra 34% actually matters,
> write it using a while loop instead.
>
> P
>
>
>
> On Dec 10, 2015, at 4:07 PM, David Owens II via swift-evolution <
> swift-evolution at swift.org> wrote:
>
>
> On Dec 10, 2015, at 1:57 PM, thorsten--- via swift-evolution <
> swift-evolution at swift.org> wrote:
>
> Yes, performance is one thing neglected by the discussions and the
> proposal.
>
>
> This is my primary objection to to this proposal; it assumes (or
> neglects?) that all of the types used can magically be inlined to nothing
> but the imperative code. This isn’t magical, someone has to implement the
> optimizations to do this.
>
> Is there any guarantee that these two loops have the exact same runtime
> performance?
>
> for (i, j) in zip(10.stride(to: 0, by: -1), 20.stride(to: 0, by: -2)) {
>    if i % 2 == 0 { continue }
>    print(i, j)
> }
>
> for var i = 10, j = 20; i > 0 && j > 0; i -= 1, j -= 2 {
>    if i % 2 == 0 { continue }
>    print(i, j)
> }
>
>
> And can you guarantee that performance is relatively the same across debug
> and release builds? Because historically, Swift has suffered greatly in
> this regard with respects to the performance of optimized versus
> non-optimized builds.
>
> These types of optimizer issues are real-world things I’ve had to deal
> with (and have written up many blog posts about). I get the desire to
> simplify the constructs, but we need an escape hatch to write performant
> code when the optimizer isn’t up to the job.
>
> -David
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
>
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution at swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-evolution/attachments/20151211/50bca11a/attachment.html>