🤖 AI Summary
Frequent updates to mobile operating systems often introduce user-perceptible performance regressions—such as increased response latency, UI jank, or frame drops—but existing detection methods rely heavily on system-level metrics or isolated component analysis, failing to accurately reflect end-user experience. This paper proposes the first cross-version, black-box detection framework specifically designed for user-perceived performance regression. It automatically extracts end-to-end perceptual metrics—including response time, app launch time, task completion time, and frame drop rate—from screen recordings, then applies statistical hypothesis testing to identify significant degradations. Deployed in an industrial-scale CI environment, the framework processes thousands of application recordings daily. Evaluation shows precision of 0.96, recall of 0.91, and F1-score of 0.93—substantially outperforming Wilcoxon rank-sum test and Cliff’s Delta baselines. It has successfully detected multiple real-world regressions missed by conventional tools.
📝 Abstract
Mobile operating systems (OS) are frequently updated, but such updates can unintentionally degrade user experience by introducing performance regressions. Existing detection techniques often rely on system-level metrics (e.g., CPU or memory usage) or focus on specific OS components, which may miss regressions actually perceived by users -- such as slower responses or UI stutters. To address this gap, we present MobileUPReg, a black-box framework for detecting user-perceived performance regressions across OS versions. MobileUPReg runs the same apps under different OS versions and compares user-perceived performance metrics -- response time, finish time, launch time, and dropped frames -- to identify regressions that are truly perceptible to users. In a large-scale study, MobileUPReg achieves high accuracy in extracting user-perceived metrics and detects user-perceived regressions with 0.96 precision, 0.91 recall, and 0.93 F1-score -- significantly outperforming a statistical baseline using the Wilcoxon rank-sum test and Cliff's Delta. MobileUPReg has been deployed in an industrial CI pipeline, where it analyzes thousands of screencasts across hundreds of apps daily and has uncovered regressions missed by traditional tools. These results demonstrate that MobileUPReg enables accurate, scalable, and perceptually aligned regression detection for mobile OS validation.