AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal emotion recognition approaches, which are often susceptible to noise and suffer from dominant modalities—such as text—overwhelming non-dominant ones like speech and visual cues, thereby introducing biased emotional representations. To mitigate these issues, the authors construct modality-specific dynamic semantic subgraphs to jointly model intra- and inter-speaker emotional dependencies. They propose a differential graph attention mechanism that explicitly contrasts attention distributions across modalities to filter out shared noise while preserving modality-unique signals. Additionally, an adaptive modality balancing mechanism is introduced to dynamically adjust the contribution weights of each modality during fusion. This framework effectively enhances the influence of non-dominant modalities, improves modeling of emotional state dynamics, and yields cleaner, more discriminative multimodal emotion representations.

Technology Category

Application Category

📝 Abstract
Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.
Problem

Research questions and friction points this paper is trying to address.

multimodal emotion recognition
emotional dependencies
modality imbalance
noisy signals
dynamic emotion evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

differential graph attention
adaptive modality balancing
dynamic semantic graph
multimodal emotion recognition
modality-specific subgraphs
Yunsheng Wang
Yunsheng Wang
Assistant Professor in Department of Computer Science, California State Polytechnic University
Connected VehicleAutonomous VehicleEdge ComputingOpportunistic NetworksCybersecurity
Y
Yuntao Shou
College of Computer and Mathematics, Central South University of Forestry and Technology, Hunan, Changsha, China
Y
Yilong Tan
College of Computer and Mathematics, Central South University of Forestry and Technology, Hunan, Changsha, China
W
Wei Ai
Department of Computer Science, Hunan University, Hunan, Changsha, China
Tao Meng
Tao Meng
Central South University of Forestry and Technology
Graph Neural NetworkMultimodal Emotion RecognitionText ClassificationEntity Alignment
Keqin Li
Keqin Li
AMA University
RoboticMachine learningArtificial intelligenceComputer vision