GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of imprecise temporal localization in zero-shot video moment retrieval, which arises from a mismatch in semantic granularity between textual queries and visual content. To resolve this issue, the authors propose a training-free, granularity-aware alignment framework that achieves precise cross-modal semantic alignment across multiple levels of granularity. The approach leverages multi-granularity query rewriting and query-aware video caption generation to bridge the semantic gap without requiring any model training. As the first zero-shot method to explicitly incorporate granularity awareness, this framework establishes new state-of-the-art results on three major benchmarks—QVHighlights, Charades-STA, and ActivityNet-Captions—with a notable 3.23% absolute improvement in mAP@avg on QVHighlights.

Technology Category

Application Category

📝 Abstract
Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot video moment retrieval
semantic granularity mismatch
video-language alignment
natural language query
untrimmed video
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot video moment retrieval
semantic granularity alignment
query rewriting
query-aware captioning
training-free framework
🔎 Similar Papers
No similar papers found.
M
Mingyu Jeon
Department of Artificial Intelligence, Chung-Ang University
Sunjae Yoon
Sunjae Yoon
KAIST
Deep LearningComputer VisionGenerative AI
Jonghee Kim
Jonghee Kim
Electronics and Telecommunications Research Institute
J
Junyeoung Kim
Department of Artificial Intelligence, Chung-Ang University