Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address insufficient fine-grained alignment and excessive training costs in video–language retrieval, this paper proposes a “coarse-to-fine” multi-level learning framework. Methodologically: (1) we design a granularity-aware representation module that jointly optimizes contrastive and matching objectives to explicitly model hierarchical semantic correspondences between videos and texts; (2) we introduce a keyword repetition mechanism—requiring no additional training—and a matching-entropy-guided voting inference strategy to strengthen fine-grained alignment. Our key contribution lies in the first integration of granularity-aware modeling with lightweight inference mechanisms. Evaluated on four standard benchmarks—including MSR-VTT and DiDeMo—our approach achieves significant improvements: Recall@1 increases by 2.1% on MSR-VTT and 1.6% on DiDeMo, demonstrating superior accuracy while maintaining low computational overhead.

Technology Category

Application Category

📝 Abstract

The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as "Repetition", can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.

Problem

Research questions and friction points this paper is trying to address.

Achieving high accuracy video-language retrieval with low training costs

Addressing underexplored fine-grained information in videos and texts

Reducing computational demands of large-scale pre-training methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine objectives for semantic understanding

Granularity-Aware Representation module for fine-grained data

Inference pipeline with voting mechanism and Matching Entropy

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs