๐ค AI Summary
This study addresses the limited sensitivity of traditional cloud service performance regression detection, which is often hindered by I/O fluctuations and infrastructure changes. The authors propose a novel paradigm termed โDuet Instrumentation,โ which uniquely integrates large language model (LLM)-driven code change analysis with synchronized dual-version benchmarking. By leveraging an LLM to precisely identify performance-relevant changes between consecutive versions, the method dynamically instruments only those critical code regions, achieving high-sensitivity regression detection with low overhead. Evaluated in real-world environments, the approach attains a precision of 58%, recall of 93%, and specificity of 71%, effectively detecting performance regressions as subtle as one-fifth the severity detectable by conventional methods.
๐ Abstract
Continuous cloud service performance benchmarking is essential for detecting performance bugs early before deploying them to production. However, detecting performance regressions using application benchmarks, which usually treat the system under test as a black box, is challenging due to variable I/O calls or changing performance characteristics of the underlying cloud infrastructure. Microbenchmarks are often more sensitive and accurate, but also more time-consuming to implement and run. Further, they do not capture the performance of the integrated system as a whole. A comprehensive performance assessment therefore typically requires a combination of both approaches. To address the shortcomings of application benchmarks, we propose duet instrumentation, a novel benchmarking paradigm enabled by recent advancements in large language model (LLM) code understanding. The idea is to analyze code changes between two consecutive application versions and measure performance differences directly at performance-relevant changes during a synchronized benchmark of both application versions, uncovering performance changes with higher sensitivity. We design a system that reliably automates the assessment and instrumentation of performance-relevant code changes between the two application versions. In experiments with a realistic testbed application offering configurable performance regressions, we find that our prototype achieves 58% precision, 93% recall, and 71% specificity (averaged across tasks) when comparing the generated instrumentation against the ideal instrumentation with a line-distance threshold of five. In the downstream application benchmark, we find that our prototype can detect performance regressions at up to 5x lower injected severity compared to a traditional duet application benchmark while preserving similar A/A latency distributions.