Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modeling multimodal social cues—such as facial expressions and speech—remains challenging due to their temporal complexity and inter-modal dependencies. To address this, we propose Social-MAE, a Transformer-based audiovisual masked autoencoder that extends the CAV-MAE architecture to support long-sequence inputs. It is the first model to perform domain-specific self-supervised pretraining on VoxCeleb2 for social interaction understanding. Social-MAE jointly encodes face and speech modalities via integrated mechanisms: masked reconstruction, contrastive learning, and cross-modal fusion. This design enhances representation learning for downstream social perception tasks. Experiments demonstrate state-of-the-art performance on multimodal emotion recognition and laughter detection, and competitive results on explicit personality trait estimation. These outcomes validate Social-MAE’s strong representational capacity and cross-task transferability, establishing it as a robust foundation model for social behavior analysis.

Technology Category

Application Category

📝 Abstract
Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal autoencoder for face and voice social perception
Addressing emotion recognition through audiovisual social data
Improving laughter detection and personality estimation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multimodal autoencoder for face and voice
Extended CAV-MAE with increased frame input capacity
Self-supervised pre-training on VoxCeleb2 social dataset
🔎 Similar Papers
No similar papers found.
H
Hugo Bohy
Numediart Institute, ISIA Lab, University of Mons, Mons, Belgium
M
Minh Tran
Institute for Creative Technologies, University of Southern California, Los Angeles, CA, USA
Kevin El Haddad
Kevin El Haddad
Unknown affiliation
Affective computingSpeech processingConversational AINLPHuman-agent Interactions
Thierry Dutoit
Thierry Dutoit
Université de Mons
Media Art TechnologySpeech and audio ProcessingBiological Signal Processing
M
Mohammad Soleymani
Institute for Creative Technologies, University of Southern California, Los Angeles, CA, USA