🤖 AI Summary
To address the low efficiency of performance monitoring in CI/CD pipelines caused by excessive repetition in microbenchmarking, this paper proposes a static, data-driven method for determining the minimal reliable repetition count per microbenchmark. Leveraging historical execution data and five statistical stability metrics—including coefficient of variation (CV), relative standard deviation (RSD), and interquartile range (IQR)—our approach statically infers the minimum repetitions required to achieve measurement reliability. It explicitly models JVM warm-up effects and supports cross-language adaptation for both Java and Go. Furthermore, the method is tightly integrated with industrial-grade performance regression detection frameworks. Experimental evaluation across 14 open-source projects demonstrates that, while preserving regression detection accuracy, our technique reduces microbenchmark measurement time by 95.83% (Go) and 94.17% (Java) on average.
📝 Abstract
Performance regressions have a tremendous impact on the quality of software. One way to catch regressions before they reach production is executing performance tests before deployment, e.g., using microbenchmarks, which measure performance at subroutine level. In projects with many microbenchmarks, this may take several hours due to repeated execution to get accurate results, disqualifying them from frequent use in CI/CD pipelines. We propose $mu$OpTime, a static approach to reduce the execution time of microbenchmark suites by configuring the number of repetitions for each microbenchmark. Based on the results of a full, previous microbenchmark suite run, $mu$OpTime determines the minimal number of (measurement) repetitions with statistical stability metrics that still lead to accurate results. We evaluate $mu$OpTime with an experimental study on 14 open-source projects written in two programming languages and five stability metrics. Our results show that (i) $mu$OpTime reduces the total suite execution time (measurement phase) by up to 95.83% (Go) and 94.17% (Java), (ii) the choice of stability metric depends on the project and programming language, (iii) microbenchmark warmup phases have to be considered for Java projects (potentially leading to higher reductions), and (iv) $mu$OpTime can be used to reliably detect performance regressions in CI/CD pipelines.