🤖 AI Summary
This work addresses the challenges of multimodal sentiment analysis, where missing or low-quality modalities often induce feature distribution shifts and decision instability. To mitigate these issues, the authors propose a two-level reference alignment framework that introduces stable references at both the feature representation and sentiment decision stages. Specifically, complete-modality samples guide representation learning, while a prototype retrieval and voting mechanism suppresses the influence of unreliable modalities, thereby enforcing cross-modal consistency. This approach is the first to incorporate reference alignment at dual levels, significantly enhancing model robustness and generalization. Experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate state-of-the-art performance under full-modality settings, achieving accuracies of 86.28% and 85.88% and F1 scores of 86.24% and 85.86%, respectively, with consistent improvements across various modality-missing scenarios.
📝 Abstract
Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.