🤖 AI Summary
Existing multimodal sentiment analysis methods often neglect intra-modal structural dependencies and inter-modal semantic misalignment, leading to weak discriminability, poor interpretability, and insufficient robustness. To address these issues, we propose the Structure–Semantics Unification (SSU) framework. SSU jointly models modality-specific structures—e.g., syntactic dependency graphs for text—and introduces text-guided cross-modal anchors to enable fine-grained semantic interaction and alignment across heterogeneous embedding spaces. Furthermore, it integrates a lightweight attention mechanism with multi-view contrastive learning to collaboratively optimize structural and semantic fusion. Evaluated on CMU-MOSI and CMU-MOSEI, SSU achieves state-of-the-art performance while significantly reducing computational overhead. Ablation studies confirm consistent improvements in robustness against input perturbations and enhanced model interpretability via attention visualization and anchor-based alignment analysis.
📝 Abstract
Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU's interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.