<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jun 12, 2017, at 4:54 PM, Pavol Vaskovic &lt;<a href="mailto:pali@pali.sk" class="">pali@pali.sk</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman <span dir="ltr" class="">&lt;<a href="mailto:mgottesman@apple.com" target="_blank" class="">mgottesman@apple.com</a>&gt;</span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><div class="gmail-h5"><div class=""><br class=""></div></div></div><div class="">The current design assumes that in such cases, the workload will be increased so that is not an issue.</div></div></div></blockquote><div class=""><br class=""></div><div class="">I understand. But clearly some part of our process is failing, because there are multiple benchmarks in 10ms range in the tree for months without fixing this.</div></div></div></div></div></blockquote><div><br class=""></div><div>I think that is just inertia and being busy. Patch? I'll review = ).</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class="">&nbsp;</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""></div><div class="">The reason why we use the min is that statistically we are not interesting in estimated the "mean" or "center" of the distribution. Rather, we are actually interested in the "speed of light" of the computation implying that we are looking for the min.</div></div></div></blockquote><div class=""><br class=""></div><div class="">I understand that. But all measurements have a certain degree of error associated with them. Our issue is two-fold: we need to differentiate between normal variation between measured samples under "perfect" conditions and samples that are worse because of interference from other background processes.</div></div></div></div></div></blockquote><div><br class=""></div><div>I disagree. CPUs are inherently messy but disruptions tend to be due to temporary spikes most of the time once you have quieted down your system by unloading a few processes.</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class="">&nbsp;</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word" class=""><div class="">What do you mean by anomalous results?</div></div></blockquote><div class=""><br class=""></div><div class="">I mean results that significantly stand out from the measured sample population.</div></div></div></div></div></blockquote><div><br class=""></div><div>What that could mean is that we need to run a couple of extra iterations to warm up the cpu/cache/etc before we start gathering samples.</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><span class="gmail-"><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class=""></div><div class="">Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.</div></div></div></div></div></blockquote><div class=""><br class=""></div></span><div class="">Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are using it or not)</div></div></div></blockquote></div><div class="gmail_extra"><br class=""></div>I don't know what that is. </div></div></div></blockquote><div><br class=""></div><div>Check it out:&nbsp;<a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test" class="">https://en.wikipedia.org/wiki/Mann–Whitney_U_test</a>. It is a non-parametric test that two sets of samples are from the same distribution. As a bonus, it does not assume that our data is from a normal distribution (a problem with using mean/standard deviation which assumes a normal distribution).</div><div><br class=""></div><div>We have been using Mann-Whitney internally for a while successfully to reduce the noise.</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra">Here's what I've been doing:</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra"><span style="font-size:12.8px" class="">Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results (1 sigma from max) when it exceeds 5%. This results in quite solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.</span><br class=""></div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">--Pavol</div></div>

</div></blockquote></div><br class=""></body></html>