MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the limitations of current video-language foundation models in temporal localization tasks, which stem from the absence of explicit temporal cues and difficulties in consistently tracking query-relevant entities across long videos. To overcome these challenges, the authors propose MarkIt, a training-free, plug-and-play framework that introduces a novel Query-to-Mask Bridge (Q2M-Bridge) mechanism. This mechanism automatically translates natural language queries into visual masks and semantic tokens while embedding frame indices, thereby providing explicit visual prompts that transform long-range temporal reasoning into intra-frame perception. By integrating language parsing, open-vocabulary segmentation, and instance mask generation, MarkIt establishes an end-to-end inference enhancement pipeline that consistently and significantly improves localization accuracy across multiple video-language models on standard benchmarks for video moment retrieval and highlight detection, achieving state-of-the-art performance.

📝 Abstract

Video temporal grounding (VTG) aims to localize the start and end timestamps of the event described by a given query within an untrimmed video. Despite the strong open-world video understanding and recognition ability of video language large models (Vid-LLMs), outputting precise temporal grounding information remains challenging, since explicit temporal cues are scarce in untrimmed videos, and query-relevant entities are hard to track consistently across the video timeline. In this paper, we present \MarkIt{}, a training-free framework that transforms an input video into a query-conditioned marked video, which empowers Vid-LLMs to generate more reliable temporal localization predictions. The core component of \MarkIt{} is an annotation-free query-to-mask grounding bridge (Q2M-Bridge). Given a natural-language query, it automatically derives a compact set of canonical subject tags through linguistic parsing and normalization, then maps these tags to query-conditioned instance masks using text-conditioned open-vocabulary segmentation. The bridge also embeds lightweight semantic instance markers and a persistent frame index into each frame, effectively transforming long-range temporal reasoning into explicit visual cues for Vid-LLMs. \MarkIt{} adopts an inference-time plug-and-play design, needs no modifications to Vid-LLM weights, and is fully compatible with supervised fine-tuning. Experiments conducted on multiple mainstream moment retrieval and highlight detection benchmarks demonstrate that \MarkIt {} achieves state-of-the-art results, delivering consistent temporal grounding improvements across a wide range of existing models.

Problem

Research questions and friction points this paper is trying to address.

video temporal grounding

temporal localization

untrimmed video

query-relevant entity tracking

explicit temporal cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

video temporal grounding

query-to-mask grounding