Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing multimodal sentiment analysis methods often neglect intra-modal structural dependencies and inter-modal semantic misalignment, leading to weak discriminability, poor interpretability, and insufficient robustness. To address these issues, we propose the Structure–Semantics Unification (SSU) framework. SSU jointly models modality-specific structures—e.g., syntactic dependency graphs for text—and introduces text-guided cross-modal anchors to enable fine-grained semantic interaction and alignment across heterogeneous embedding spaces. Furthermore, it integrates a lightweight attention mechanism with multi-view contrastive learning to collaboratively optimize structural and semantic fusion. Evaluated on CMU-MOSI and CMU-MOSEI, SSU achieves state-of-the-art performance while significantly reducing computational overhead. Ablation studies confirm consistent improvements in robustness against input perturbations and enhanced model interpretability via attention visualization and anchor-based alignment analysis.

Technology Category

Application Category

📝 Abstract

Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU's interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

Problem

Research questions and friction points this paper is trying to address.

Integrating multimodal structural dependencies and semantic alignment

Addressing modality-specific semantic misalignment in sentiment analysis

Enhancing multimodal fusion quality, interpretability, and robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph contrastive learning for multimodal fusion

Text-guided attention for acoustic and visual modalities

Semantic anchor harmonizes cross-modal alignment

🔎 Similar Papers

No similar papers found.