Harnessing Object Grounding for Time-Sensitive Video Understanding

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Video large language models (Video-LLMs) struggle to accurately model temporal dynamics in time-sensitive video understanding (TSV) tasks. Method: This paper proposes GO-Tokenizer, a plug-and-play module that directly encodes spatial localization information of grounded objects (GOs) within frames into lightweight, compact visual tokens—bypassing noisy and length-intensive textual descriptions. GO-Tokenizer integrates GO features during pretraining and is compatible with arbitrary object detectors and Video-LLMs. Contribution/Results: Extensive experiments demonstrate significant improvements over baselines and text-augmented methods on TSV tasks—including temporal localization reasoning and dense video captioning—across multiple datasets and model architectures. The results validate that explicit spatio-temporal alignment critically enhances the temporal awareness of Video-LLMs, establishing GO-Tokenizer as an effective and generalizable solution for temporally grounded video understanding.

Technology Category

Application Category

📝 Abstract

We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.

Problem

Research questions and friction points this paper is trying to address.

Improving time-sensitive video understanding in Video-LLMs

Addressing noise and token length from object annotations

Encoding compact object information without textual descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

GO-Tokenizer module encodes compact object information

Leverages off-the-shelf object detectors for grounding

Improves video understanding without textual description overhead

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models