🤖 AI Summary
In real-world scenarios, vision-language models often encounter missing modalities—such as when cameras are disabled due to privacy constraints—leading to distribution shifts between training and testing conditions and thereby undermining model reliability. To address this challenge, this work proposes the first unified framework for incomplete video-language modeling, featuring a plug-and-play multimodal fusion module that flexibly handles inputs with arbitrary missing modalities and seamlessly integrates into existing architectures. This approach systematically enhances the robustness and trustworthiness of video-language models under modality缺失 conditions, achieving significant improvements in both performance and stability across multiple multimodal tasks, thus demonstrating its effectiveness and broad applicability.
📝 Abstract
Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.