π€ AI Summary
To address modality imbalance, static dominant modality assignment, and redundancy/noise in non-linguistic modalities (e.g., visual/audio sequences) in multimodal sentiment analysis, this paper proposes a dynamic dominant-modality-driven fusion framework. Methodologically, it introduces: (1) a sample-adaptive dominant modality selector that identifies the most discriminative modality per instance; (2) a graph-structured sequence compressor leveraging capsule networks and graph convolution to compress redundant non-linguistic sequences and suppress noise; and (3) a dominant-modality-centered cross-attention mechanism that anchors cross-modal interaction on the selected dominant modality to enhance critical information exchange. Evaluated on four benchmark video sentiment datasets, the framework consistently outperforms state-of-the-art methods, demonstrating significant improvements in modality balance, robustness against noise, and overall accuracy.
π Abstract
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.