🤖 AI Summary
Video large language models (Video-LLMs) struggle to accurately model temporal dynamics in time-sensitive video understanding (TSV) tasks. Method: This paper proposes GO-Tokenizer, a plug-and-play module that directly encodes spatial localization information of grounded objects (GOs) within frames into lightweight, compact visual tokens—bypassing noisy and length-intensive textual descriptions. GO-Tokenizer integrates GO features during pretraining and is compatible with arbitrary object detectors and Video-LLMs. Contribution/Results: Extensive experiments demonstrate significant improvements over baselines and text-augmented methods on TSV tasks—including temporal localization reasoning and dense video captioning—across multiple datasets and model architectures. The results validate that explicit spatio-temporal alignment critically enhances the temporal awareness of Video-LLMs, establishing GO-Tokenizer as an effective and generalizable solution for temporally grounded video understanding.
📝 Abstract
We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.