GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing video large language models commonly adopt uniform frame sampling, which often discards critical temporal cues and struggles to accurately perform video temporal grounding tasks. To address this limitation, this work proposes GroundVTS, a novel architecture featuring the first query-guided, fine-grained visual token dynamic sampling mechanism. This approach adaptively selects the most informative, non-uniform temporal segments prior to inputting them into the large language model, thereby preserving rich spatiotemporal semantics. Combined with a progressive optimization training strategy, GroundVTS significantly enhances modeling of non-uniform video features. The method achieves state-of-the-art performance on three standard video temporal grounding benchmarks, improving moment retrieval mIoU by 7.7 points and highlight detection mAP by 12.0 points.

Technology Category

Application Category

📝 Abstract

Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.

Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding

Multimodal Large Language Models

Visual Token Sampling

Temporal Coherence

Moment Retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Token Sampling

Query-Guided Filtering

Temporal Grounding