Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Composite emotion recognition in real-world scenarios faces significant challenges from modality conflicts and heightened uncertainty, particularly under complex audiovisual cues. To address this, we propose a dual-visual-backbone collaborative architecture that, for the first time, performs feature-level fusion of complementary visual representations extracted by Vision Transformers (ViT) and ResNet, jointly with audio features, enabling end-to-end multimodal modeling. This design enhances visual representation diversity and robustness while effectively mitigating inter-modal inconsistency. Extensive experiments on the C-EXPR-DB and MELD benchmarks demonstrate that our method substantially outperforms single-backbone baselines in composite emotion recognition—especially exhibiting superior generalization on noisy and low-quality audiovisual samples. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW
Problem

Research questions and friction points this paper is trying to address.

Addresses uncertainty and modal conflicts in compound emotion recognition
Proposes ViT-ResNet fusion for multimodal emotion recognition
Improves performance in complex visual-audio scenarios like C-EXPR-DB
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses Vision Transformer and ResNet features
Targets compound emotion recognition challenges
Improves performance in complex multimodal scenarios
🔎 Similar Papers
No similar papers found.
R
Ran Liu
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; Tianjin Normal University
F
Fengyu Zhang
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Cong Yu
Cong Yu
Head of Engineering, Dandy
ML / Language ModelML / Computer Vision3D/CADProcess MiningData Mining
L
Longjiang Yang
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Z
Zhuofan Wen
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
S
Siyuan Zhang
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
H
Hailiang Yao
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
Shun Chen
Shun Chen
中国科学院自动化研究所
情感计算、人机交互、深度学习
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning
B
Bin Liu
University of Chinese Academy of Sciences, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences