TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

📅 2024-04-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal sentiment analysis (MSA) suffers from modality heterogeneity and disparities in semantic richness, causing conventional equal-weight fusion to overemphasize weak modalities (e.g., visual/audio) while diluting strong ones (e.g., textual), thereby amplifying noise. To address this, we propose a text-centric cross-modal attention framework: (1) text-query-driven visual and audio cross-modal attention mechanisms explicitly model textual semantic guidance over other modalities; (2) a gated noise suppression module jointly filters modality-specific noise; and (3) a unimodal joint backpropagation strategy enhances robustness and inter-modal synergy. Evaluated on CMU-MOSI and CMU-MOSEI, our method consistently outperforms state-of-the-art approaches in both sentiment classification and regression tasks. It establishes the first text-dominant, controllably guided multimodal fusion paradigm—achieving superior accuracy, interpretability, and generalization without sacrificing computational efficiency.

Technology Category

Application Category

📝 Abstract
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Despite the remarkable performance exhibited by previous MSA approaches, the presence of inherent multimodal heterogeneities poses a challenge, with the contribution of different modalities varying considerably. Past research predominantly focused on improving representation learning techniques and feature fusion strategies. However, many of these efforts overlooked the variation in semantic richness among different modalities, treating each modality uniformly. This approach may lead to underestimating the significance of strong modalities while overemphasizing the importance of weak ones. Motivated by these insights, we introduce a Text-oriented Cross-Attention Network (TCAN), emphasizing the predominant role of the text modality in MSA. Specifically, for each multimodal sample, by taking unaligned sequences of the three modalities as inputs, we initially allocate the extracted unimodal features into a visual-text and an acoustic-text pair. Subsequently, we implement self-attention on the text modality and apply text-queried cross-attention to the visual and acoustic modalities. To mitigate the influence of noise signals and redundant features, we incorporate a gated control mechanism into the framework. Additionally, we introduce unimodal joint learning to gain a deeper understanding of homogeneous emotional tendencies across diverse modalities through backpropagation. Experimental results demonstrate that TCAN consistently outperforms state-of-the-art MSA methods on two datasets (CMU-MOSI and CMU-MOSEI).
Problem

Research questions and friction points this paper is trying to address.

Address multimodal heterogeneity in sentiment analysis
Balance modality contributions by emphasizing text dominance
Reduce noise and redundancy in multimodal features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-oriented cross-attention for multimodal fusion
Gated control reduces noise and redundant features
Unimodal joint learning captures emotional tendencies
🔎 Similar Papers
No similar papers found.
M
Ming Zhou
School of Information Science and Technology, Donghua University, Shanghai 201620, China
Weize Quan
Weize Quan
MAIS-CASIA
Image ProcessingComputer GraphicsDeep Learning
Z
Ziqi Zhou
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
K
Kai Wang
Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble 38000, France
T
Tong Wang
School of Information Science and Technology, Donghua University, Shanghai 201620, China
Dong Yan
Dong Yan
AI Chief Expert, Bosch.
Reinforcement LearningFoundation Model