Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 36
Influential: 3
📄 PDF
🤖 AI Summary
Existing video large language models (Video-LLMs) excel at coarse-grained understanding but underperform on fine-grained temporal grounding tasks due to weak temporal modeling capabilities and the absence of explicit timestamp representations. To address this, we propose a dual-stream video understanding framework: (1) a dedicated temporal stream explicitly models inter-frame dynamics, augmented by discrete time tokens that encode explicit temporal semantics; and (2) a multi-stage curriculum learning strategy coupled with a fully automated pipeline for grounding-aware VideoQA data construction. Our method achieves significant improvements over state-of-the-art methods on fine-grained benchmarks—including temporal sentence grounding, dense video captioning, and grounded VideoQA—while preserving strong performance on general video understanding tasks, demonstrating both generalizability and practicality. The core contributions lie in the decoupling of spatio-temporal modeling, the structured representation of temporal information, and a scalable, grounding-centric training paradigm.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses fine-grained temporal grounding in Video-LLMs
Overcomes limitations in temporal modeling and timestamp representation
Enhances perception and reasoning over specific video moments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Additional temporal stream for frame relationships
Discrete temporal tokens with time knowledge
Multi-stage training with progressive complexity
🔎 Similar Papers
No similar papers found.
H
Haibo Wang
Fudan University
Z
Zhiyang Xu
Virginia Tech
Y
Yu Cheng
The Chinese University of Hong Kong
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
Y
Yufan Zhou
Adobe Research
Y
Yixin Cao
Fudan University
Q
Qifan Wang
Meta AI
Weifeng Ge
Weifeng Ge
Fudan University
Humanoid RobotComputer VisionArtificial IntelligenceAI4Science
Lifu Huang
Lifu Huang
Assistant Professor, UC Davis
Natural Language ProcessingMultimodal LearningAI for ScienceMultilingual