[swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

Mon Jun 12 19:29:51 CDT 2017

> On Jun 12, 2017, at 4:54 PM, Pavol Vaskovic <pali at pali.sk> wrote:
> 
> 
> 
> On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman <mgottesman at apple.com <mailto:mgottesman at apple.com>> wrote:
> 
> The current design assumes that in such cases, the workload will be increased so that is not an issue.
> 
> I understand. But clearly some part of our process is failing, because there are multiple benchmarks in 10ms range in the tree for months without fixing this.

I think that is just inertia and being busy. Patch? I'll review = ).

>  
> The reason why we use the min is that statistically we are not interesting in estimated the "mean" or "center" of the distribution. Rather, we are actually interested in the "speed of light" of the computation implying that we are looking for the min.
> 
> I understand that. But all measurements have a certain degree of error associated with them. Our issue is two-fold: we need to differentiate between normal variation between measured samples under "perfect" conditions and samples that are worse because of interference from other background processes.

I disagree. CPUs are inherently messy but disruptions tend to be due to temporary spikes most of the time once you have quieted down your system by unloading a few processes.

>  
> What do you mean by anomalous results?
> 
> I mean results that significantly stand out from the measured sample population.

What that could mean is that we need to run a couple of extra iterations to warm up the cpu/cache/etc before we start gathering samples.

> 
>> Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.
> 
> Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are using it or not)
> 
> I don't know what that is.

Check it out: https://en.wikipedia.org/wiki/Mann–Whitney_U_test <https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test>. It is a non-parametric test that two sets of samples are from the same distribution. As a bonus, it does not assume that our data is from a normal distribution (a problem with using mean/standard deviation which assumes a normal distribution).

We have been using Mann-Whitney internally for a while successfully to reduce the noise.

> Here's what I've been doing:
> 
> Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results (1 sigma from max) when it exceeds 5%. This results in quite solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.
> 
> --Pavol

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20170612/c7b7812e/attachment.html>