[swift-dev] Measuring MEAN Performance (was: Questions about Swift-CI)
atrick at apple.com
Tue Jun 13 01:51:01 CDT 2017
> On Jun 12, 2017, at 10:36 PM, Pavol Vaskovic <pali at pali.sk> wrote:
> As the next two paragraphs after the part you quoted go on explaining, I'm hoping that with this approach we could adaptively sample the benchmark until we get stable population, but starting from lower iteration count.
> If the Python implementation bears this out, the proper solution would be to change the implementation in DriverUtil.swift, from the current ~1s run adaptive num-iters to more finer grained runs. We'd be gathering more smaller samples, tossing out anomalies as we go until we gather stable sample population (with low coefficient of variation) or run out of the allotted time.
~1s might be longer than necessary for the benchmarks with cheap setup. Another option is for the benchmark to call back to the Driver’s “start button” after setup. With no setup work, I think 200 ms is a bare minimum if we care about changes in the 1% range.
I’m confused though because I thought we agreed that all samples need to run with exactly the same number of iterations. So, there would be one short run to find the desired num_iters for each benchmark, then each subsequent invocation of the benchmark harness would be handed num_iters as input.
> This has a potential to speed up the benchmark suite with more intelligent management of the measurements, instead of using brute force of super-long runtime to drown out the errors as we do currently.
> (I am aware of various aspects this approach might introduce that have the potential to mess with the caching: time measurement itself, more frequent logging - this would currently rely on --verbose mode, invoking Benchmark_O from Python…)
> The proof is in the pudding, so I guess we'll learn if this approach would work this week, when I hammer the implementation down in Python for demonstration.
> On Tue, 13 Jun 2017 at 03:19, Andrew Trick <atrick at apple.com <mailto:atrick at apple.com>> wrote:
>> On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali at pali.sk <mailto:pali at pali.sk>> wrote:
>> I have sketched an algorithm for getting more consistent test results, so far its in Numbers. I have ran the whole test suite for 100 samples and observed the varying distribution of test results. The first result is quite often an outlier, with subsequent results being quicker. Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results when it exceeds 5%. This results in solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.
> That’s a reasonable approach for running 100 samples. I’m not sure how it fits with the goal of minimizing turnaround time. Typically you don’t need more than 3 samples (keeping in mind were usually averaging over thousands of iterations per sample).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the swift-dev