🤖 AI Summary
Cross-corpus speech emotion recognition suffers from unstable acoustic features and limited generalizability due to speaker variability, domain shift, and heterogeneous recording conditions. To address this, we propose a contrastive learning framework anchored on articulatory mouth movements—introducing physiologically interpretable articulatory dynamics as the core alignment signal for cross-domain emotional representation, replacing conventional acoustic feature alignment. Our method integrates lip-motion modeling, acoustic-visual disentangled representation learning, and joint contrastive training across multiple corpora (CREMA-D and MSP-IMPROV). Experimental results demonstrate substantial improvements in cross-corpus emotion recognition accuracy, validating that mouth articulation provides a stable, consistent, and generalizable cue for emotion representation. This work establishes a novel paradigm for unsupervised cross-domain speech emotion recognition grounded in articulatory physiology.
📝 Abstract
Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.