π€ AI Summary
This study addresses the unreliability of conventional subject-exclusive cross-validation in facial action unit (AU) detection, which is susceptible to random partitioning noise that obscures genuine performance improvements. For the first time, the work systematically quantifies the stochastic variance introduced by this evaluation protocol and proposes Leave-One-Dataset-Out (LODO) cross-validation to eliminate partition randomness, thereby enabling more robust assessment of cross-dataset generalization. Experiments across five mainstream AU datasets reveal substantial evaluation instability for low-prevalence AUsβe.g., an average F1-score noise floor of Β±0.065 on BP4D+βand demonstrate that LODO uncovers domain-level instabilities invisible to single-dataset cross-validation. These findings suggest that many reported performance gains may fall within the margin of evaluation variance rather than reflecting true model advances.
π Abstract
Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments.
We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings