🤖 AI Summary
Existing video-language models are constrained by predefined text templates, limiting their ability to handle semantically equivalent yet linguistically diverse open-ended language inputs in real-world scenarios. This work proposes a plug-and-play general framework that rethinks video-language alignment from the perspective of language input. By generating positive and negative textual samples, incorporating an attribute-level semantic reasoning mechanism, and introducing a video-guided self-weighted cross-modal loss, the framework effectively models arbitrary textual formulations without modifying the backbone architecture. The approach consistently enhances the performance of multiple state-of-the-art video-language models on open-vocabulary understanding tasks, demonstrating its versatility and effectiveness.
📝 Abstract
Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.