🤖 AI Summary
To address the significant performance degradation in multimodal sentiment analysis caused by audio modality missingness, this paper proposes a knowledge transfer-enhanced robust modeling framework. Methodologically, it introduces: (1) a novel cross-modal knowledge transfer network that jointly reconstructs missing audio representations from visual and linguistic modalities; (2) a cross-modal attention mechanism designed to maximize information preservation, enabling adaptive fusion of reconstructed and observed modalities without full multimodal supervision; and (3) an implicit feature alignment strategy to enhance inter-modal consistency. Evaluated on three benchmark datasets—CMU-MOSEI, IEMOCAP, and RAVDESS—the proposed method achieves substantial improvements over existing baselines under audio-missing conditions, attaining performance close to that of fully supervised multimodal models. It effectively mitigates performance deterioration induced by modality absence while maintaining robustness and generalizability.
📝 Abstract
Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.