A vector quantized masked autoencoder for audiovisual speech emotion recognition

📅 2023-05-05

🏛️ Computer Vision and Image Understanding

📈 Citations: 6

✨ Influential: 1

🤖 AI Summary

To address the underutilization of unlabeled data and insufficient modeling of both local and global representations in audio-visual speech emotion recognition (AVSER), this paper proposes VQ-MAE—the first vector-quantized masked autoencoder tailored for AVSER. VQ-MAE jointly models temporal-semantic features from audio and video streams, leveraging vector quantization (VQ) for discrete representation learning. It integrates cross-modal masked reconstruction with contrastive loss-guided codebook optimization to enhance robustness and generalization. Key innovations include the synergistic fusion of multimodal features, a Transformer-based encoder, and a hierarchical masking strategy. Evaluated on RAVDESS and CMU-MOSEI, VQ-MAE achieves state-of-the-art performance: +3.2% accuracy in speaker-independent settings and +4.7% F1-score under low-resource (few-shot) conditions.

Problem

Research questions and friction points this paper is trying to address.

Leveraging unlabeled audiovisual speech data for emotion recognition

Learning multimodal representations via masked autoencoders

Improving emotion recognition accuracy in diverse conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector quantized autoencoders compress speech data

Masked autoencoder with attention learns multimodal representations

Contrastive learning reconstructs masked audiovisual tokens

🔎 Similar Papers

No similar papers found.