VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing frame sampling methods for long-video understanding lack task awareness and fail to adapt to user instructions in complex scenes. Method: We propose Instructed Temporal Grounding for Videos (VideoITG), a discriminative, instruction-driven frame sampling framework. We introduce the first instruction-aligned temporal grounding paradigm—mimicking human annotation logic—and design VidThinker, an automated annotation tool that generates instruction-conditioned segment descriptions, performs reasoning-based retrieval, and selects fine-grained keyframes. We further develop a plug-and-play VideoITG model enabling end-to-end, instruction-guided sampling. Contribution/Results: We release VideoITG-40K, the first large-scale instruction–spatiotemporal alignment dataset (40K videos, 500K annotations). Extensive experiments on multiple long-video understanding benchmarks demonstrate significant performance gains, validating both the effectiveness and generalizability of task-customized frame sampling.

Technology Category

Application Category

📝 Abstract

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

Problem

Research questions and friction points this paper is trying to address.

Improving video frame selection for Video-LLMs

Addressing complex long video understanding scenarios

Aligning frame sampling with user instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized frame sampling aligned with instructions

Automated annotation mimicking human process

Plug-and-play model for discriminative frame selection

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models