Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address modality imbalance, static dominant modality assignment, and redundancy/noise in non-linguistic modalities (e.g., visual/audio sequences) in multimodal sentiment analysis, this paper proposes a dynamic dominant-modality-driven fusion framework. Methodologically, it introduces: (1) a sample-adaptive dominant modality selector that identifies the most discriminative modality per instance; (2) a graph-structured sequence compressor leveraging capsule networks and graph convolution to compress redundant non-linguistic sequences and suppress noise; and (3) a dominant-modality-centered cross-attention mechanism that anchors cross-modal interaction on the selected dominant modality to enhance critical information exchange. Evaluated on four benchmark video sentiment datasets, the framework consistently outperforms state-of-the-art methods, demonstrating significant improvements in modality balance, robustness against noise, and overall accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos. However, imbalanced unimodal performance often leads to suboptimal fused representations. Existing approaches typically adopt fixed primary modality strategies to maximize dominant modality advantages, yet fail to adapt to dynamic variations in modality importance across different samples. Moreover, non-language modalities suffer from sequential redundancy and noise, degrading model performance when they serve as primary inputs. To address these issues, this paper proposes a modality optimization and dynamic primary modality selection framework (MODS). First, a Graph-based Dynamic Sequence Compressor (GDC) is constructed, which employs capsule networks and graph convolution to reduce sequential redundancy in acoustic/visual modalities. Then, we develop a sample-adaptive Primary Modality Selector (MSelector) for dynamic dominance determination. Finally, a Primary-modality-Centric Cross-Attention (PCCA) module is designed to enhance dominant modalities while facilitating cross-modal interaction. Extensive experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods, achieving superior performance by effectively balancing modality contributions and eliminating redundant noise.

Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced unimodal performance in multimodal sentiment analysis

Reduces sequential redundancy and noise in non-language modalities

Dynamically selects primary modality to adapt to sample variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based compressor reduces acoustic/visual sequential redundancy

Dynamic modality selector adaptively determines primary modality per sample

Cross-attention module enhances dominant modality and cross-modal interaction

🔎 Similar Papers

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach