🤖 AI Summary
This work addresses the performance degradation of conventional models in real-world multimodal scenarios due to missing modalities. To tackle this challenge, the authors propose the CyIN framework, which jointly optimizes learning from both complete and incomplete multimodal data within a unified architecture. CyIN integrates a cross-modal cyclic information bottleneck mechanism with a bidirectional cross-modal reconstruction strategy, combining token-level and label-level variational information bottlenecks, cyclic cross-modal translation, and latent variable reconstruction. This approach effectively distills task-relevant features and imputes missing modalities. Extensive experiments on four standard multimodal benchmarks demonstrate that CyIN consistently outperforms state-of-the-art methods across various modality-completion settings, including fully observed and multiple missing-modality conditions.
📝 Abstract
Multimodal machine learning, mimicking the human brain's ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.