🤖 AI Summary
TVR suffers from cross-modal information asymmetry between rich video content and fragmented textual descriptions. To address this, we propose a data-driven framework that enhances textual representation at its source. First, we introduce event-level video segmentation annotations to improve the granularity of textual coverage. Second, we design LLM-based prompt engineering to generate semantically diverse and complementary query variants, followed by a learnable diversity evaluation and optimal subset selection mechanism for intelligent filtering. Our approach is the first to jointly integrate fine-grained event segmentation, LLM-driven query generation, and structured query selection—thereby bridging the semantic gap at the data level. Evaluated on major TVR benchmarks—including MSR-VTT, MSVD, and ActivityNet—our method achieves state-of-the-art performance, significantly improving Recall@1 and retrieval efficiency. These results empirically validate the effectiveness and scalability of a “data-centric” paradigm for cross-modal retrieval.
📝 Abstract
As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.