[swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)

Tue Jun 13 00:36:06 CDT 2017

As the next two paragraphs after the part you quoted go on explaining, I'm
hoping that with this approach we could adaptively sample the benchmark
until we get stable population, but starting from lower iteration count.

If the Python implementation bears this out, the proper solution would be
to change the implementation in DriverUtil.swift, from the current ~1s run
adaptive num-iters to more finer grained runs. We'd be gathering more
smaller samples, tossing out anomalies as we go until we gather stable
sample population (with low coefficient of variation) or run out of the
allotted time.

This has a potential to speed up the benchmark suite with more intelligent
management of the measurements, instead of using brute force of super-long
runtime to drown out the errors as we do currently.

(I am aware of various aspects this approach might introduce that have the
potential to mess with the caching: time measurement itself, more frequent
logging - this would currently rely on --verbose mode, invoking Benchmark_O
from Python…)

The proof is in the pudding, so I guess we'll learn if this approach would
work this week, when I hammer the implementation down in Python for
demonstration.

--Pavol

On Tue, 13 Jun 2017 at 03:19, Andrew Trick <atrick at apple.com> wrote:

>
> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali at pali.sk> wrote:
>
> I have sketched an algorithm for getting more consistent test results, so
> far its in Numbers. I have ran the whole test suite for 100 samples and
> observed the varying distribution of test results. The first result is
> quite often an outlier, with subsequent results being quicker. Depending on
> the "weather" on the test machine, you sometimes measure anomalies. So I'm
> tracking the coefficient of variance from the sample population and purging
> anomalous results when it exceeds 5%. This results in solid sample
> population where standard deviation is a meaningful value, that can be use
> in judging the significance of change between master and branch.
>
>
> That’s a reasonable approach for running 100 samples. I’m not sure how it
> fits with the goal of minimizing turnaround time. Typically you don’t need
> more than 3 samples (keeping in mind were usually averaging over thousands
> of iterations per sample).
>
> -Andy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.swift.org/pipermail/swift-dev/attachments/20170613/96bafe15/attachment.html>