<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jun 12, 2017, at 10:36 PM, Pavol Vaskovic &lt;<a href="mailto:pali@pali.sk" class="">pali@pali.sk</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><div dir="auto" class="">As the next two paragraphs after the part you quoted go on explaining, I'm hoping that with this approach we could adaptively sample the benchmark until we get stable population, but starting from lower iteration count.&nbsp;</div><div dir="auto" class=""><br class=""></div><div dir="auto" class="">If the Python implementation bears this out, the proper solution would be to change the implementation in DriverUtil.swift, from the current ~1s run adaptive num-iters to more finer grained runs. We'd be gathering more smaller samples, tossing out anomalies as we go until we gather stable sample population (with low coefficient of variation) or run out of the allotted time.</div></div></div></blockquote><div><br class=""></div>~1s might be longer than necessary for the benchmarks with cheap setup. Another option is for the benchmark to call back to the Driver’s “start button” after setup. With no setup work, I think 200 ms is a bare minimum if we care about changes in the 1% range.</div><div><br class=""></div><div>I’m confused though because I thought we agreed that all samples need to run with exactly the same number of iterations. So, there would be one short run to find the desired num_iters for each benchmark, then each subsequent invocation of the benchmark harness would be handed num_iters as input.</div><div><br class=""></div><div>-Andy</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><div dir="auto" class="">This has a potential to speed up the benchmark suite with more intelligent management of the measurements, instead of using brute force of super-long runtime to drown out the errors as we do currently.&nbsp;</div><div dir="auto" class=""><br class=""></div><div dir="auto" class="">(I am aware of various aspects this approach might introduce that have the potential to mess with the caching: time measurement itself, more frequent logging - this would currently rely on --verbose mode, invoking Benchmark_O from Python…)<br class=""></div><div dir="auto" class=""><br class=""></div><div dir="auto" class=""><div dir="auto" class="">The proof is in the pudding, so I guess we'll learn if this approach would work this week, when I hammer the implementation down in Python for demonstration.&nbsp;</div></div><div dir="auto" class=""><br class=""></div><div dir="auto" class="">--Pavol</div><br class=""><div class="gmail_quote"><div class="">On Tue, 13 Jun 2017 at 03:19, Andrew Trick &lt;<a href="mailto:atrick@apple.com" class="">atrick@apple.com</a>&gt; wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic &lt;<a href="mailto:pali@pali.sk" target="_blank" class="">pali@pali.sk</a>&gt; wrote:</div><br class="m_-3590598712144602244Apple-interchange-newline"><div class=""><div style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class=""><div style="font-size:12.8px" class="">I have sketched an algorithm for getting more consistent test results, so far its in Numbers. I have ran the whole test suite for 100 samples and observed the varying distribution of test results. The first result is quite often an outlier, with subsequent results being quicker. Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results when it exceeds 5%. This results in solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.</div></div></div></blockquote><br class=""></div></div><div style="word-wrap:break-word" class=""><div class="">That’s a reasonable approach for running 100 samples. I’m not sure how it fits with the goal of minimizing turnaround time. Typically you don’t need more than 3 samples (keeping in mind were usually averaging over thousands of iterations per sample).</div><div class=""><br class=""></div><div class="">-Andy</div></div></blockquote></div></div>

</div></blockquote></div><br class=""></body></html>