Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

πŸ“… 2025-11-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address modality imbalance, static dominant modality assignment, and redundancy/noise in non-linguistic modalities (e.g., visual/audio sequences) in multimodal sentiment analysis, this paper proposes a dynamic dominant-modality-driven fusion framework. Methodologically, it introduces: (1) a sample-adaptive dominant modality selector that identifies the most discriminative modality per instance; (2) a graph-structured sequence compressor leveraging capsule networks and graph convolution to compress redundant non-linguistic sequences and suppress noise; and (3) a dominant-modality-centered cross-attention mechanism that anchors cross-modal interaction on the selected dominant modality to enhance critical information exchange. Evaluated on four benchmark video sentiment datasets, the framework consistently outperforms state-of-the-art methods, demonstrating significant improvements in modality balance, robustness against noise, and overall accuracy.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced unimodal performance in multimodal sentiment analysis
Reduces sequential redundancy and noise in non-language modalities
Dynamically selects primary modality to adapt to sample variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based compressor reduces acoustic/visual sequential redundancy
Dynamic modality selector adaptively determines primary modality per sample
Cross-attention module enhances dominant modality and cross-modal interaction
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
Mingcheng Li
Mingcheng Li
Fudan University
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University
Zhaoyu Chen
Zhaoyu Chen
TikTok
AI SecurityTrustworthy AIMultimodal AIGenerative AI
Kaixun Jiang
Kaixun Jiang
Fudan University
Computer VisionAdversarial Examples
K
Keliang Liu
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
P
Peng Zhai
College of Intelligent Robotics and Advanced Manufacturing, Fudan University
Lihua Zhang
Lihua Zhang
Wuhan University
computational biologybioinformaticsdata mining