ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and challenging cross-modal fusion in multimodal emotion recognition (MER), this paper proposes an enhanced cross-modal fusion framework. Methodologically, it introduces a dual-branch visual encoder and a context-aware textual representation module; incorporates a dynamic modality-weighted fusion mechanism with residual connections; leverages large-scale pretrained models, self-attention mechanisms, and large language models for feature alignment and emotional cue enhancement; and adopts a multi-source annotation strategy to mitigate noise from imperfect labels. Evaluated on the MER2025-SEMI semi-supervised benchmark, the framework achieves a weighted F1-score of 87.49%, significantly outperforming the official baseline (78.63%). This demonstrates its effectiveness and robustness in few-shot, cross-modal collaborative modeling.

Technology Category

Application Category

📝 Abstract
Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal emotion recognition accuracy
Addressing data scarcity with pre-trained models
Improving fusion of visual, audio, textual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch visual encoder captures global and local features
Context-enriched method enhances emotional cues in text
Self-attention and residual connections optimize multimodal fusion
🔎 Similar Papers
No similar papers found.
J
Juewen Hu
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Yexin Li
Yexin Li
State Key Laboratory of General Artificial Intelligence BIGAI
reinforcement learningmulti-agent systemmulti-armed banditsdata mining
J
Jiulin Li
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
S
Shuo Chen
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
P
Pring Wong
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China