🤖 AI Summary
Existing multimodal sentiment analysis methods overlook signals shared exclusively by subsets of modalities, limiting the expressiveness and discriminative power of learned representations. To address this, this work proposes a tri-subspace disentanglement framework that explicitly decomposes features into three complementary subspaces: globally shared, pairwise modality-shared, and modality-private. Subspace independence is enforced through disentanglement supervision and structural regularization. Furthermore, a Subspace-Aware Cross-Attention (SACA) module is introduced to enable fine-grained, adaptive fusion. This approach is the first to model multi-granularity cross-modal affective cues, achieving state-of-the-art performance on CMU-MOSI and CMU-MOSEI with MAE = 0.691 and ACC-7 = 54.9%, respectively. The framework also demonstrates successful transferability to multimodal intent recognition tasks.
📝 Abstract
Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC-7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.