🤖 AI Summary
To address the underutilization of unlabeled data and insufficient modeling of both local and global representations in audio-visual speech emotion recognition (AVSER), this paper proposes VQ-MAE—the first vector-quantized masked autoencoder tailored for AVSER. VQ-MAE jointly models temporal-semantic features from audio and video streams, leveraging vector quantization (VQ) for discrete representation learning. It integrates cross-modal masked reconstruction with contrastive loss-guided codebook optimization to enhance robustness and generalization. Key innovations include the synergistic fusion of multimodal features, a Transformer-based encoder, and a hierarchical masking strategy. Evaluated on RAVDESS and CMU-MOSEI, VQ-MAE achieves state-of-the-art performance: +3.2% accuracy in speaker-independent settings and +4.7% F1-score under low-resource (few-shot) conditions.