A vector quantized masked autoencoder for audiovisual speech emotion recognition

📅 2023-05-05
🏛️ Computer Vision and Image Understanding
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
To address the underutilization of unlabeled data and insufficient modeling of both local and global representations in audio-visual speech emotion recognition (AVSER), this paper proposes VQ-MAE—the first vector-quantized masked autoencoder tailored for AVSER. VQ-MAE jointly models temporal-semantic features from audio and video streams, leveraging vector quantization (VQ) for discrete representation learning. It integrates cross-modal masked reconstruction with contrastive loss-guided codebook optimization to enhance robustness and generalization. Key innovations include the synergistic fusion of multimodal features, a Transformer-based encoder, and a hierarchical masking strategy. Evaluated on RAVDESS and CMU-MOSEI, VQ-MAE achieves state-of-the-art performance: +3.2% accuracy in speaker-independent settings and +4.7% F1-score under low-resource (few-shot) conditions.
Problem

Research questions and friction points this paper is trying to address.

Leveraging unlabeled audiovisual speech data for emotion recognition
Learning multimodal representations via masked autoencoders
Improving emotion recognition accuracy in diverse conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector quantized autoencoders compress speech data
Masked autoencoder with attention learns multimodal representations
Contrastive learning reconstructs masked audiovisual tokens
🔎 Similar Papers
No similar papers found.