🤖 AI Summary
This work addresses the challenge of inaccurate continuous valence-arousal estimation in real-world scenarios, where inconsistent modality reliability and varying interaction phases degrade performance. To tackle this, the authors propose the SAGE framework, which decouples modality reliability estimation from feature representation and introduces a phase-adaptive, reliability-aware fusion mechanism. This mechanism dynamically calibrates the confidence of audio and visual modalities across different interaction stages and adaptively adjusts their representation weights accordingly. Experiments on the Aff-Wild2 benchmark demonstrate that SAGE significantly improves Concordance Correlation Coefficients compared to state-of-the-art methods, validating the effectiveness and robustness of reliability-driven modeling under conditions of noise, occlusion, and dynamic interaction changes.
📝 Abstract
Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal dynamics, often overlooking the fact that modality reliability can vary substantially across interaction stages. To address this issue, we propose SAGE, a Stage-Adaptive reliability modeling framework that explicitly estimates and calibrates modality-wise confidence during multimodal integration. SAGE introduces a reliability-aware fusion mechanism that dynamically rebalances audio and visual representations according to their stage-dependent informativeness, preventing unreliable signals from dominating the prediction process. By separating reliability estimation from feature representation, the proposed framework enables more stable emotion estimation under cross-modal noise, occlusion, and varying interaction conditions. Extensive experiments on the Aff-Wild2 benchmark demonstrate that SAGE consistently improves concordance correlation coefficient scores compared with existing multimodal fusion approaches, highlighting the effectiveness of reliability-driven modeling for continuous affect prediction.