Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in Affective Video Facial Analysis (AVFA)—data scarcity, difficulty in modeling cross-modal associations, and limited scalability—this paper introduces the first scalable audio-visual masked autoencoder framework tailored for AVFA. Our method features: (1) a dual-modal collaborative masking strategy to jointly enhance intra- and inter-modal representation learning; (2) an iterative audio-visual correlation learning module that explicitly captures dynamic cross-modal dependencies; and (3) a progressive semantic injection mechanism to improve knowledge transfer from pretraining to downstream tasks. Built upon a two-stream encoder, cross-modal attention, and staged self-supervised training, the framework achieves state-of-the-art performance across 17 datasets and three core affective analysis tasks. Ablation studies confirm the effectiveness of each component. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited data availability in affective video facial analysis
Exploring scaling impact on audio-visual masked autoencoders for emotion recognition
Improving cross-modal correlation modeling through novel dual masking strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual masking strategy across audio and visual modalities
Iterative Audio-Visual Correlation Learning Module for SSL
Progressive semantic injection strategy with three training stages
🔎 Similar Papers
2023-05-05Computer Vision and Image UnderstandingCitations: 6
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University
Junxiao Xue
Junxiao Xue
Zhejiang Lab
Computer GraphicsCrowd simulationMulti-agents ModelingMulti-modal Learning
X
Xinyi Yin
School of Cyber Science and Engineering, Zhengzhou University
Y
Yunyun Shi
School of Computer Science and Technology, Xi’an Jiaotong University
L
Liangyu Fu
School of Software, Northwestern Polytechnical University
D
Danlei Huang
School of Computer Science and Technology, Xi’an Jiaotong University
Y
Yifan Wang
Institute of Advanced Technology, University of Science and Technology of China
J
Jia Zhang
School of Computer Science and Technology, Xi’an Jiaotong University
J
Jiayu Nie
Inspur Group
J
Jun Wang
Research Center for Space Computing System, Zhejiang Lab