<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic <<a href="mailto:pali@pali.sk" class="">pali@pali.sk</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote">On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev <span dir="ltr" class=""><<a href="mailto:swift-dev@swift.org" target="_blank" class="">swift-dev@swift.org</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-m_-1742069484177758159gmail-"><br class="">
on Thu May 11 2017, Pavol Vaskovic <<a href="http://swift-dev-AT-swift.org" class="">swift-dev-AT-swift.org</a>> wrote:<br class=""></span><span class="gmail-m_-1742069484177758159gmail-"><br class=""></span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-m_-1742069484177758159gmail-">I have run Benchmark_O with --num-iters=100 on my machine for the the<br class=""></span><span class="gmail-m_-1742069484177758159gmail-">whole performance test suite, to get a feeling for the distribution of<br class=""></span><span class="gmail-m_-1742069484177758159gmail-">benchmark samples, because I also want to move the Benchmark_Driver to<br class=""></span><span class="gmail-m_-1742069484177758159gmail-">use MEAN instead of MIN in the analysis.</span></blockquote><span class="gmail-m_-1742069484177758159gmail-">
<br class="">
</span>I'm concerned about that, especially for microbenchmarks; it seems to me<br class="">
as though MIN is the right measurement. Can you explain why MEAN is<br class="">
better?<br class="">
<span class="gmail-m_-1742069484177758159gmail-"><br class=""></span></blockquote><div class=""><br class=""></div><div class="">On Wed, May 17, 2017 at 1:26 AM, Andrew Trick <span dir="ltr" class=""><<a href="mailto:atrick@apple.com" target="_blank" class="">atrick@apple.com</a>></span> <wbr class="">wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"></blockquote></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that task is to reduce the time CI takes to get useful results (e.g. by using 3 runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.<br class=""></blockquote><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Current approach to detecting performance changes is fragile for tests that have very low absolute runtime, as they are easily over the 5% improvement/regression threshold when the test machine gets a little bit noisy. For example in <a href="https://github.com/apple/swift/pull/9806#issuecomment-303370149" target="_blank" class="">benchmark on PR #9806</a>:</div><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><table style="box-sizing:border-box;border-collapse:collapse;margin-top:0px;margin-bottom:16px;display:block;width:668px;overflow:auto;color:rgb(36,41,46);font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol";font-size:14px" class=""><tbody style="box-sizing:border-box" class=""><tr style="box-sizing:border-box;border-top:1px solid rgb(198,203,209)" class=""><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">BitCount</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">12</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">14</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">+16.7%</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class=""><span style="box-sizing:border-box;font-weight:600" class="">0.86x</span></td></tr><tr style="box-sizing:border-box;background-color:rgb(246,248,250);border-top:1px solid rgb(198,203,209)" class=""><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">SuffixCountableRange</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">10</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">11</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">+10.0%</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class=""><span style="box-sizing:border-box;font-weight:600" class="">0.91x</span></td></tr><tr style="box-sizing:border-box;border-top:1px solid rgb(198,203,209)" class=""><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">MapReduce</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">303</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">331</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class="">+9.2%</td><td style="box-sizing:border-box;padding:6px 13px;border:1px solid rgb(223,226,229)" class=""><span style="box-sizing:border-box;font-weight:600" class="">0.92x</span></td></tr></tbody></table></blockquote><div class="">These are all false changes (and there are quite a few more there).</div><div class=""><br class=""></div><div class="">To partially address this issue (I'm guessing) the last SPEEDUP column sometimes features mysterious question mark in brackets. Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not checked the other way around.</div></div></div></div></div></blockquote><div><br class=""></div><div>That bug must have been introduced during one of the rewrites. Is that in the driver or compare script? Why not fix that bug?</div><div><br class=""></div>We clearly don’t want to see any false changes. The ‘?’ is a signal to me to avoid reporting those results. They should either be ignored as flaky benchmarks or rerun. I thought rerunning them was the fix you were working on.</div><div><br class=""></div><div>If you have some other proposal for fixing this then please, in a separate proposal, explain your new approach, why your new approach works, and demonstrate it’s effectiveness with results that you’ve gathered over time on the side. Please don’t change how the driver computes performance changes on a whim while introducing other features.</div><div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class="">I'm suggesting to use MEAN value in combination with SD (standard-deviation) to detect the changes (improvements/regressions). At the moment, this is hard to do, because the aggregate test results reported by Benchmark_O (and co.) can include anomalous results in the sample population that messes up the MEAN and SD, too. Currently it is only visible in the high sample range - the difference between reported MIN and MAX. But it is not clear how many results are anomalous.</div></div></div></div></div></blockquote><div><br class=""></div><div>I honestly don’t know what MEAN/SD has to do with the problem you’re pointing to above. The benchmark harness is already setup to compute the average iteration time, and our benchmarks are not currently designed to measure cache effects or any other phenomenon that would have a statistically meaningful sample distribution. Statistical methods might be interesting if you’re analyzing benchmark results over a long period of time or system noise levels across benchmarks.</div><div><br class=""></div><div>The primary purpose of the benchmark suite is identifying performance bugs/regressions at the point they occur. It should be no more complicated than necessary to do that. The current approach is simple: run a microbenchmark long enough in a loop to factor out benchmark startup time, cache/cpu warmup effects, and timer resolution, then compute the average iteration time. Throw away any run that was apparently impacted by system noise.</div><div><br class=""></div><div>We really have two problems:</div><div>1. spurious results </div><div>2. the turnaround time for the entire benchmark suite</div><div><br class=""></div><div>Running benchmarks on a noisy machine is a losing proposition because you won’t be able to address problem #1 without making problem #2 much worse.</div><div><br class=""></div><div>-Andy</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class="">Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.</div><div class=""><br class=""></div><div class=""><div class="">Best regards</div><div class="">Pavol Vaskovic</div></div><div class=""><br class=""></div><div class=""><br class=""></div></div></div></div>
</div></blockquote></div><br class=""></body></html>