Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal conversational emotion recognition under noisy environmental and acquisition conditions, which often introduce noise into audiovisual features and cause imbalanced information quality across modalities, leading to distorted representations and biased weighting during fusion. To mitigate these issues, the authors propose a relation-aware denoising and diffusion attention fusion model. The approach employs a differential Transformer to suppress temporally irrelevant noise, constructs intra- and inter-modal relation subgraphs to capture affective dependencies, and introduces a text-guided cross-modal diffusion mechanism to achieve semantically aligned and robust fusion. Notably, this method explicitly models modality discrepancies and the dominant role of textual cues—departing from conventional implicit weighted fusion schemes—and demonstrates significant improvements in both accuracy and robustness for emotion recognition in noisy settings.
📝 Abstract
In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.
Problem

Research questions and friction points this paper is trying to address.

multimodal emotion recognition
noise robustness
modality imbalance
feature fusion
environmental noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

differential Transformer
relation graph
cross-modal diffusion
multimodal emotion recognition
attention fusion
Ying Liu
Ying Liu
University of Science and Technology of China
seismology
Y
Yuntao Shou
College of Computer and Mathematics, Central South University of Forestry and Technology, 410004, Hunan, Changsha China
W
Wei Ai
College of Computer and Mathematics, Central South University of Forestry and Technology, 410004, Hunan, Changsha China
Tao Meng
Tao Meng
Central South University of Forestry and Technology
Graph Neural NetworkMultimodal Emotion RecognitionText ClassificationEntity Alignment
Keqin Li
Keqin Li
AMA University
RoboticMachine learningArtificial intelligenceComputer vision